Hi Kurt,
The computing time depends heavily on the number of parameters in the
model, but this is a quadratic function of the number of variables.
Another key component of the computation time, though, is the number
of distinct patterns of missingness in the data. That is, how many
different ways can a unit be missing in the data. Each step of the EM
algorithm must loop over each of these patterns and do a bit of math.
Thus, fewer patterns means fewer computations. If missingness is
generated randomly, then as you add variables, there might be a
tendency for each of the possible missingness patterns to appear in
the data. Since the number of possible patterns is exponential, this
would likely increase the computing time exponentially. In real data,
however, missingness tends to be "chunky"--groups of variables are
missing together for systematic reasons (harder to collect all data on
certain subgroups, for example). Thus, it is likely that a 200
variable dataset would not contain all the possible patterns. This is
something that you could check in your data:
nrow(unique(is.na(myData)))
Another relevant issue that might crop up is the so-called fraction of
missing information. The more complete-data information that is in the
missing part of the data, the longer the EM algorithm will take to
converge.
These issues imply that it would be hard to estimate the length of
time until convergence based on the size of the data matrix. That
being said, adding an additional covariate has the potential to
increase each of these problems, perhaps markedly.
Hope that helps.
Cheers,
matt.
On Wed, Jul 7, 2010 at 10:01 PM, Kurt Smith
<kurt.smith(a)archimedesmodel.com> wrote:
Hi,
I've just started working with Amelia II to do multiple imputation for
large data sets. It works great but I have some questions about how well
it scales.
In the Honaker & King "What to do about Missing Values..." paper the
authors mention imputing for data sets with 240 variables and 32,000
observations, which I would love to do, but I estimate this would take
~10^6 hours to do one imputation.
I did some test runs and it seems like computing time grows
exponentially with the number of variables. I timed several runs in R
2.10.1 (on an Intel Xeon desktop) and fit a regression that gave me the
roughly the following:
time [seconds] = 10^-4 * (# of imputations) * (# of subjects)^0.92 *
1.118^(# of variables)
In these runs I used up to 25,000 subjects and 24 variables. Missing
rates were ~7-12% for most variables.
Based on this it looks like using ~200 variables would take O(10^6)
hours while 120 variables could be done in about a week. As
parallelization only reduces # of imputations/processor, not # of
variables it doesn't look like that would help.
Can anyone comment on run times for large sets? It's possible I've
missed something or the exponential relation doesn't hold for more
variables.
Thanks!
Kurt
--
Kurt Smith, PhD
Scientist II
Archimedes Inc
201 Mission Street, 29th Floor
San Francisco, CA 94105
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia