(highly) stochastic imputed data with Amelia - Amelia

7 Oct 2009

Hi all-

I have a (panel) dataset of about 1000 powerplants in the U.S., over  
13 years, including cost data, which includes total non-fuel  
expenditures, fixed costs, operations & maintenance expenses, and  
hours of operation. For each of the variables, I am missing about 20%  
of the observations.. overall, I have about 50% complete observations,  
meaning, with listwise deletion I would lose 50% of my observations.   
Thus, this seems like a perfect application of Amelia.

If I were to run a simple OLS, I could predict each of the variables  
at an r squared of 75% to 95%, depending on whether I include lagged  
values.  However, I can't use this to fill in missing data, because of  
the many missing values of predictor variable.  Again, the perfect  
reason to use Amelia.

When I run Amelia, I am running into several problems.
First, regardless of whether I use polynomials of time 0, 1, 2, or 3,  
my imputed dataset seems highly, highly stochastic - much more so than  
the original data.  fixed cost data is fluctuating from -$5 million  
one year, to $17 the next year, when all of the observed data for the  
same powerplant is a relatively stable  $8mil to $11mil over 8 years.   
Another case, where I don't have observed data to compare it to - cost  
data varies from -$5 to +$2mil, to -$13mil, and then to +$11 million,  
all in a 4 year span! Even when I average the five datasets, the  
imputed data seems extremely stochastic and unrealistic, and highly  
dependent upon the polynomial of time I select.

Am I doing something wrong?  Why is the imputed dataset so stochastic?

It appears that much of the advice on the listserve suggests that one  
should proceed with the regressions, and not worry about the  
stochasticity of 20% of the observations; however, if I am going to be  
using this data as part of a dependent variable in a difference in  
differences model or as part of many other complex techniques -  
several steps down the road, after performing matching, etc - I would  
strongly prefer not to have to do each statistical step 5x each, and  
instead, come up with a reasonable dataset from which to proceed with  
my regression (and perform the regression steps once). Can I simply  
average each dataset to generate 1 useful dataset? (and again, why is  
the imputed data so variable?)

Second, if I attempt to allow an individual time trend to be estimated  
for each individual, by interacting with the cross-section, my  
computer grinds for hours and never produces anything.  Once, after  
several hours, I got an unknown error.  Normally - with a 1, 2, or 3  
time polynomial, it takes my computer (a Macbook OS X, with 2 gb ram,  
and a 2.4 ghz intel core duo chip) about 20 seconds to 1 minute to  
produce all five datasets.  Do I need a supercomputer - or am I doing  
something wrong?

Thanks in advance for your help,

Daniel Matisoff

Indiana University
Georgia Institute of Technology
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
More info about Amelia: http://gking.harvard.edu/amelia