Thanks Matt, that was a pretty informative answer. I took a look at the
# of missing data patterns and it grows much faster than linearly with #
of variables, so that may have a strong effect on the performance.
I'm wondering if there are other ways of approaching this. Here's what
I'm interested in- I'm working with a data set with hundreds of
variables but only about 30-40 are of interest to me. I'd hoped to feed
in as many of the extra variables as possible to Amelia only in order to
make the imputation more accurate on the ones I care about. Ultimately I
don't care about the imputed variables for the extra variables (or even
building a "realistic" multivariate model over them).
If I could treat the extra variables as completely observed I guess that
would speed things up a lot. For categorical variables I could easily
treat "missing" as another category. I'm not sure how best to handle
continuous variables. Assigning an arbitrary value far outside the range
of observed values would probably skew the multivariate normal fits. I
could assign an average value to each missing value, or even assign a
value from a simple regression model (basically a "pre-imputation
imputation").
Any ideas for good approaches here? Again, all I am really interested in
is getting the most accurate imputation in a manageable amount of time
for the 30-40 variables of interest. Thanks!
Kurt
On Wed, 2010-07-07 at 23:10 -0400, Matt Blackwell wrote:
Hi Kurt,
The computing time depends heavily on the number of parameters in the
model, but this is a quadratic function of the number of variables.
Another key component of the computation time, though, is the number
of distinct patterns of missingness in the data. That is, how many
different ways can a unit be missing in the data. Each step of the EM
algorithm must loop over each of these patterns and do a bit of math.
Thus, fewer patterns means fewer computations. If missingness is
generated randomly, then as you add variables, there might be a
tendency for each of the possible missingness patterns to appear in
the data. Since the number of possible patterns is exponential, this
would likely increase the computing time exponentially. In real data,
however, missingness tends to be "chunky"--groups of variables are
missing together for systematic reasons (harder to collect all data on
certain subgroups, for example). Thus, it is likely that a 200
variable dataset would not contain all the possible patterns. This is
something that you could check in your data:
nrow(unique(is.na(myData)))
Another relevant issue that might crop up is the so-called fraction of
missing information. The more complete-data information that is in the
missing part of the data, the longer the EM algorithm will take to
converge.
These issues imply that it would be hard to estimate the length of
time until convergence based on the size of the data matrix. That
being said, adding an additional covariate has the potential to
increase each of these problems, perhaps markedly.
Hope that helps.
Cheers,
matt.
On Wed, Jul 7, 2010 at 10:01 PM, Kurt Smith
<kurt.smith(a)archimedesmodel.com> wrote:
> Hi,
>
> I've just started working with Amelia II to do multiple imputation for
> large data sets. It works great but I have some questions about how well
> it scales.
>
> In the Honaker & King "What to do about Missing Values..." paper the
> authors mention imputing for data sets with 240 variables and 32,000
> observations, which I would love to do, but I estimate this would take
> ~10^6 hours to do one imputation.
>
> I did some test runs and it seems like computing time grows
> exponentially with the number of variables. I timed several runs in R
> 2.10.1 (on an Intel Xeon desktop) and fit a regression that gave me the
> roughly the following:
>
> time [seconds] = 10^-4 * (# of imputations) * (# of subjects)^0.92 *
> 1.118^(# of variables)
>
> In these runs I used up to 25,000 subjects and 24 variables. Missing
> rates were ~7-12% for most variables.
>
> Based on this it looks like using ~200 variables would take O(10^6)
> hours while 120 variables could be done in about a week. As
> parallelization only reduces # of imputations/processor, not # of
> variables it doesn't look like that would help.
>
> Can anyone comment on run times for large sets? It's possible I've
> missed something or the exponential relation doesn't hold for more
> variables.
>
> Thanks!
> Kurt
>
>
> --
> Kurt Smith, PhD
> Scientist II
> Archimedes Inc
> 201 Mission Street, 29th Floor
> San Francisco, CA 94105
> -
> Amelia mailing list served by Harvard-MIT Data Center
> [Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
> More info about Amelia:
http://gking.harvard.edu/amelia
>
>
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: