Hi,
I've just started working with Amelia II to do multiple imputation for
large data sets. It works great but I have some questions about how well
it scales.
In the Honaker & King "What to do about Missing Values..." paper the
authors mention imputing for data sets with 240 variables and 32,000
observations, which I would love to do, but I estimate this would take
~10^6 hours to do one imputation.
I did some test runs and it seems like computing time grows
exponentially with the number of variables. I timed several runs in R
2.10.1 (on an Intel Xeon desktop) and fit a regression that gave me the
roughly the following:
time [seconds] = 10^-4 * (# of imputations) * (# of subjects)^0.92 *
1.118^(# of variables)
In these runs I used up to 25,000 subjects and 24 variables. Missing
rates were ~7-12% for most variables.
Based on this it looks like using ~200 variables would take O(10^6)
hours while 120 variables could be done in about a week. As
parallelization only reduces # of imputations/processor, not # of
variables it doesn't look like that would help.
Can anyone comment on run times for large sets? It's possible I've
missed something or the exponential relation doesn't hold for more
variables.
Thanks!
Kurt
--
Kurt Smith, PhD
Scientist II
Archimedes Inc
201 Mission Street, 29th Floor
San Francisco, CA 94105
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia