I have a question for the MI experts about imputations and experiments:
We often run experiments in which we hypothesize that responses will vary across conditions. We expose subjects to a condition -- CONDa, CONDb, or CONDc, say -- and then measure responses to a dependent variables across all conditions, say DV1, DV2, etc. We also collect data on various independent variables, say IV1, IV2, etc.
But because we anticipate the relationship between the IVs and DVs to vary across the conditions it seems like we ought to do one of two things when imputing missing data:
(1) interact every IV with the conditions so that we have, in effect DV1a, DV1_CONDb, DV1_CONDc, DV2_CONDa, etc. But quite often, the result of that equation will just result in a zero when the dummy for that condition is zero rather than one. This seems wasteful of information to me. Which leads me to my alternative...
(2) Rather than computing DV_CONDa as equalling zero when CONDa is also zero, I'm tempted to treat every alternative DV (DV1_CONDa, DV1_CONDb, DV1_CONDc ...) as missing for each case and impute it. Missingness will be high, of course (only 1/conditions of DVs will be present), but at least I won't be throwing away lots of valuable data. I hesitate to do so because I can't find anyone else who has done this and that makes me think I am probably misguided.
ps. I realize that this is a question about imputation generally, but I thought I'd post it here since I use Amelia for my imputation needs -- let me know if I should not post something like this here & I'll look elsewhere
Donald Braman
phone: 413-628-1221
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
More info about Amelia: http://gking.harvard.edu/amelia
Hi,
I've just started working with Amelia II to do multiple imputation for
large data sets. It works great but I have some questions about how well
it scales.
In the Honaker & King "What to do about Missing Values..." paper the
authors mention imputing for data sets with 240 variables and 32,000
observations, which I would love to do, but I estimate this would take
~10^6 hours to do one imputation.
I did some test runs and it seems like computing time grows
exponentially with the number of variables. I timed several runs in R
2.10.1 (on an Intel Xeon desktop) and fit a regression that gave me the
roughly the following:
time [seconds] = 10^-4 * (# of imputations) * (# of subjects)^0.92 *
1.118^(# of variables)
In these runs I used up to 25,000 subjects and 24 variables. Missing
rates were ~7-12% for most variables.
Based on this it looks like using ~200 variables would take O(10^6)
hours while 120 variables could be done in about a week. As
parallelization only reduces # of imputations/processor, not # of
variables it doesn't look like that would help.
Can anyone comment on run times for large sets? It's possible I've
missed something or the exponential relation doesn't hold for more
variables.
Thanks!
Kurt
--
Kurt Smith, PhD
Scientist II
Archimedes Inc
201 Mission Street, 29th Floor
San Francisco, CA 94105
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
More info about Amelia: http://gking.harvard.edu/amelia