Hi all-
I have a (panel) dataset of about 1000 powerplants in the U.S., over
13 years, including cost data, which includes total non-fuel
expenditures, fixed costs, operations & maintenance expenses, and
hours of operation. For each of the variables, I am missing about 20%
of the observations.. overall, I have about 50% complete observations,
meaning, with listwise deletion I would lose 50% of my observations.
Thus, this seems like a perfect application of Amelia.
If I were to run a simple OLS, I could predict each of the variables
at an r squared of 75% to 95%, depending on whether I include lagged
values. However, I can't use this to fill in missing data, because of
the many missing values of predictor variable. Again, the perfect
reason to use Amelia.
When I run Amelia, I am running into several problems.
First, regardless of whether I use polynomials of time 0, 1, 2, or 3,
my imputed dataset seems highly, highly stochastic - much more so than
the original data. fixed cost data is fluctuating from -$5 million
one year, to $17 the next year, when all of the observed data for the
same powerplant is a relatively stable $8mil to $11mil over 8 years.
Another case, where I don't have observed data to compare it to - cost
data varies from -$5 to +$2mil, to -$13mil, and then to +$11 million,
all in a 4 year span! Even when I average the five datasets, the
imputed data seems extremely stochastic and unrealistic, and highly
dependent upon the polynomial of time I select.
Am I doing something wrong? Why is the imputed dataset so stochastic?
It appears that much of the advice on the listserve suggests that one
should proceed with the regressions, and not worry about the
stochasticity of 20% of the observations; however, if I am going to be
using this data as part of a dependent variable in a difference in
differences model or as part of many other complex techniques -
several steps down the road, after performing matching, etc - I would
strongly prefer not to have to do each statistical step 5x each, and
instead, come up with a reasonable dataset from which to proceed with
my regression (and perform the regression steps once). Can I simply
average each dataset to generate 1 useful dataset? (and again, why is
the imputed data so variable?)
Second, if I attempt to allow an individual time trend to be estimated
for each individual, by interacting with the cross-section, my
computer grinds for hours and never produces anything. Once, after
several hours, I got an unknown error. Normally - with a 1, 2, or 3
time polynomial, it takes my computer (a Macbook OS X, with 2 gb ram,
and a 2.4 ghz intel core duo chip) about 20 seconds to 1 minute to
produce all five datasets. Do I need a supercomputer - or am I doing
something wrong?
Thanks in advance for your help,
Daniel Matisoff
Indiana University
Georgia Institute of Technology
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia