Re: [amelia] imputation time growing exponentially?

7 Jul 2010

Hi Kurt,

The computing time depends heavily on the number of parameters in the
model, but this is a quadratic function of the number of variables.

Another key component of the computation time, though, is the number
of distinct patterns of missingness in the data. That is, how many
different ways can a unit be missing in the data. Each step of the EM
algorithm must loop over each of these patterns and do a bit of math.
Thus, fewer patterns means fewer computations. If missingness is
generated randomly, then as you add variables, there might be a
tendency for each of the possible missingness patterns to appear in
the data. Since the number of possible patterns is exponential, this
would likely increase the computing time exponentially. In real data,
however, missingness tends to be "chunky"--groups of variables are
missing together for systematic reasons (harder to collect all data on
certain subgroups, for example). Thus, it is likely that a 200
variable dataset would not contain all the possible patterns. This is
something that you could check in your data:

nrow(unique(is.na(myData)))

Another relevant issue that might crop up is the so-called fraction of
missing information. The more complete-data information that is in the
missing part of the data, the longer the EM algorithm will take to
converge.

These issues imply that it would be hard to estimate the length of
time until convergence based on the size of the data matrix. That
being said, adding an additional covariate has the potential to
increase each of these problems, perhaps markedly.

Hope that helps.

Cheers,
matt.

On Wed, Jul 7, 2010 at 10:01 PM, Kurt Smith
&lt;kurt.smith(a)archimedesmodel.com&gt; wrote:
...
  Hi,

 I've just started working with Amelia II to do multiple imputation for
 large data sets. It works great but I have some questions about how well
 it scales.

 In the Honaker & King "What to do about Missing Values..." paper the
 authors mention imputing for data sets with 240 variables and 32,000
 observations, which I would love to do, but I estimate this would take
 ~10^6 hours to do one imputation.

 I did some test runs and it seems like computing time grows
 exponentially with the number of variables. I timed several runs in R
 2.10.1 (on an Intel Xeon desktop) and fit a regression that gave me the
 roughly the following:

 time [seconds] = 10^-4 * (# of imputations) * (# of subjects)^0.92 *
 1.118^(# of variables)

 In these runs I used up to 25,000 subjects and 24 variables. Missing
 rates were ~7-12% for most variables.

 Based on this it looks like using ~200 variables would take O(10^6)
 hours while 120 variables could be done in about a week. As
 parallelization only reduces # of imputations/processor, not # of
 variables it doesn't look like that would help.

 Can anyone comment on run times for large sets? It's possible I've
 missed something or the exponential relation doesn't hold for more
 variables.

 Thanks!
 Kurt

 --
 Kurt Smith, PhD
 Scientist II
 Archimedes Inc
 201 Mission Street, 29th Floor
 San Francisco, CA  94105
 -
 Amelia mailing list served by Harvard-MIT Data Center
 [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
 More info about Amelia: http://gking.harvard.edu/amelia

 -
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
More info about Amelia: http://gking.harvard.edu/amelia

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [amelia] imputation time growing exponentially?