Re: [amelia] (highly) stochastic imputed data with Amelia - Amelia

7 Oct 2009

Daniel,

There are a lot of people who know a lot more about this than me, but I am writing a
response to try and help anyway. 

There is an option in Amelia 2 to use priors on the imputations. I used amelia quite
extensively with survey data and found that quite often the data would exceed the range I
expected for the survey question. For example, if a question was scored on a scale of 1-7,
i would often get zero, -1, or 8,9,10 values on individual imputations. Using a bayesian
rationale, I set my prior to (1-7) to ensure that the  imputations did not extend beyond
credible limits in the eyes of my colleagues who were seeing multiple imputation for the
first time. Mind you, they never questioned listwise deletion because they were taught
this by their professors at university or it was the default on their package. 

My sense is that your errors in the imputation can be modified using prior historical
knowledge and beliefs - so that for example with the fixed costs, you could impose a prior
on the range that you would expect to see from the variable (for example 'your cost
values will not extend outside the historical range of 8-11 million and the very
stochastic patterns you are seeing will be constrained by your prior). In other words the
min and max they will be 8 and 11 million respectively 

Also, if I was going to take an average of my datasets (which is against conventional
wisdom and not what I would recommend), I would probably impute a minimum of 30 datasets.
Intuitively, if I was using a sampling approach from a distribution to estimate an
average, based on the CLT, I would expect 30 samples to start to converage to a normal
distribution. Never tried this. I would probably check the distribution of my data
imputations as well. NB - I don't use this - so this advice could be totally wrong. 

I would also run my models with listwise versus multiple imputation to see if my
parameters are different.  

HTH Paul

...
  Dan Matisoff &lt;dmatisof(a)umail.iu.edu&gt; wrote:

 Hi all-

 I have a (panel) dataset of about 1000 powerplants in the U.S., over  
 13 years, including cost data, which includes total non-fuel  
 expenditures, fixed costs, operations & maintenance expenses, and  
 hours of operation. For each of the variables, I am missing about 20%  
 of the observations.. overall, I have about 50% complete observations,  
 meaning, with listwise deletion I would lose 50% of my observations.   
 Thus, this seems like a perfect application of Amelia.

 If I were to run a simple OLS, I could predict each of the variables  
 at an r squared of 75% to 95%, depending on whether I include lagged  
 values.  However, I can't use this to fill in missing data, because of  
 the many missing values of predictor variable.  Again, the perfect  
 reason to use Amelia.

 When I run Amelia, I am running into several problems.
 First, regardless of whether I use polynomials of time 0, 1, 2, or 3,  
 my imputed dataset seems highly, highly stochastic - much more so than  
 the original data.  fixed cost data is fluctuating from -$5 million  
 one year, to $17 the next year, when all of the observed data for the  
 same powerplant is a relatively stable  $8mil to $11mil over 8 years.   
 Another case, where I don't have observed data to compare it to - cost  
 data varies from -$5 to +$2mil, to -$13mil, and then to +$11 million,  
 all in a 4 year span! Even when I average the five datasets, the  
 imputed data seems extremely stochastic and unrealistic, and highly  
 dependent upon the polynomial of time I select.

 Am I doing something wrong?  Why is the imputed dataset so stochastic?

 It appears that much of the advice on the listserve suggests that one  
 should proceed with the regressions, and not worry about the  
 stochasticity of 20% of the observations; however, if I am going to be  
 using this data as part of a dependent variable in a difference in  
 differences model or as part of many other complex techniques -  
 several steps down the road, after performing matching, etc - I would  
 strongly prefer not to have to do each statistical step 5x each, and  
 instead, come up with a reasonable dataset from which to proceed with  
 my regression (and perform the regression steps once). Can I simply  
 average each dataset to generate 1 useful dataset? (and again, why is  
 the imputed data so variable?)

 Second, if I attempt to allow an individual time trend to be estimated  
 for each individual, by interacting with the cross-section, my  
 computer grinds for hours and never produces anything.  Once, after  
 several hours, I got an unknown error.  Normally - with a 1, 2, or 3  
 time polynomial, it takes my computer (a Macbook OS X, with 2 gb ram,  
 and a 2.4 ghz intel core duo chip) about 20 seconds to 1 minute to  
 produce all five datasets.  Do I need a supercomputer - or am I doing  
 something wrong?

 Thanks in advance for your help,

 Daniel Matisoff

 Indiana University
 Georgia Institute of Technology
 -
 Amelia mailing list served by Harvard-MIT Data Center
 [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
 More info about Amelia: http://gking.harvard.edu/amelia -
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
More info about Amelia: http://gking.harvard.edu/amelia