Amelia July 2010

amelia@lists.gking.harvard.edu

5 participants
2 discussions

imputation across experimental conditions

by Donald Braman

I have a question for the MI experts about imputations and experiments: We often run experiments in which we hypothesize that responses will vary across conditions. We expose subjects to a condition -- CONDa, CONDb, or CONDc, say -- and then measure responses to a dependent variables across all conditions, say DV1, DV2, etc. We also collect data on various independent variables, say IV1, IV2, etc. But because we anticipate the relationship between the IVs and DVs to vary across the conditions it seems like we ought to do one of two things when imputing missing data: (1) interact every IV with the conditions so that we have, in effect DV1a, DV1_CONDb, DV1_CONDc, DV2_CONDa, etc. But quite often, the result of that equation will just result in a zero when the dummy for that condition is zero rather than one. This seems wasteful of information to me. Which leads me to my alternative... (2) Rather than computing DV_CONDa as equalling zero when CONDa is also zero, I'm tempted to treat every alternative DV (DV1_CONDa, DV1_CONDb, DV1_CONDc ...) as missing for each case and impute it. Missingness will be high, of course (only 1/conditions of DVs will be present), but at least I won't be throwing away lots of valuable data. I hesitate to do so because I can't find anyone else who has done this and that makes me think I am probably misguided. ps. I realize that this is a question about imputation generally, but I thought I'd post it here since I use Amelia for my imputation needs -- let me know if I should not post something like this here & I'll look elsewhere Donald Braman phone: 413-628-1221 - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia More info about Amelia: http://gking.harvard.edu/amelia

13 years, 9 months

imputation time growing exponentially?

by Kurt Smith

Hi, I've just started working with Amelia II to do multiple imputation for large data sets. It works great but I have some questions about how well it scales. In the Honaker & King "What to do about Missing Values..." paper the authors mention imputing for data sets with 240 variables and 32,000 observations, which I would love to do, but I estimate this would take ~10^6 hours to do one imputation. I did some test runs and it seems like computing time grows exponentially with the number of variables. I timed several runs in R 2.10.1 (on an Intel Xeon desktop) and fit a regression that gave me the roughly the following: time [seconds] = 10^-4 * (# of imputations) * (# of subjects)^0.92 * 1.118^(# of variables) In these runs I used up to 25,000 subjects and 24 variables. Missing rates were ~7-12% for most variables. Based on this it looks like using ~200 variables would take O(10^6) hours while 120 variables could be done in about a week. As parallelization only reduces # of imputations/processor, not # of variables it doesn't look like that would help. Can anyone comment on run times for large sets? It's possible I've missed something or the exponential relation doesn't hold for more variables. Thanks! Kurt -- Kurt Smith, PhD Scientist II Archimedes Inc 201 Mission Street, 29th Floor San Francisco, CA 94105 - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia More info about Amelia: http://gking.harvard.edu/amelia

13 years, 9 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Amelia July 2010