Daniel,
There are a lot of people who know a lot more about this than me, but I am writing a
response to try and help anyway.
There is an option in Amelia 2 to use priors on the imputations. I used amelia quite
extensively with survey data and found that quite often the data would exceed the range I
expected for the survey question. For example, if a question was scored on a scale of 1-7,
i would often get zero, -1, or 8,9,10 values on individual imputations. Using a bayesian
rationale, I set my prior to (1-7) to ensure that the imputations did not extend beyond
credible limits in the eyes of my colleagues who were seeing multiple imputation for the
first time. Mind you, they never questioned listwise deletion because they were taught
this by their professors at university or it was the default on their package.
My sense is that your errors in the imputation can be modified using prior historical
knowledge and beliefs - so that for example with the fixed costs, you could impose a prior
on the range that you would expect to see from the variable (for example 'your cost
values will not extend outside the historical range of 8-11 million and the very
stochastic patterns you are seeing will be constrained by your prior). In other words the
min and max they will be 8 and 11 million respectively
Also, if I was going to take an average of my datasets (which is against conventional
wisdom and not what I would recommend), I would probably impute a minimum of 30 datasets.
Intuitively, if I was using a sampling approach from a distribution to estimate an
average, based on the CLT, I would expect 30 samples to start to converage to a normal
distribution. Never tried this. I would probably check the distribution of my data
imputations as well. NB - I don't use this - so this advice could be totally wrong.
I would also run my models with listwise versus multiple imputation to see if my
parameters are different.
HTH Paul
Dan Matisoff <dmatisof(a)umail.iu.edu> wrote:
Hi all-
I have a (panel) dataset of about 1000 powerplants in the U.S., over
13 years, including cost data, which includes total non-fuel
expenditures, fixed costs, operations & maintenance expenses, and
hours of operation. For each of the variables, I am missing about 20%
of the observations.. overall, I have about 50% complete observations,
meaning, with listwise deletion I would lose 50% of my observations.
Thus, this seems like a perfect application of Amelia.
If I were to run a simple OLS, I could predict each of the variables
at an r squared of 75% to 95%, depending on whether I include lagged
values. However, I can't use this to fill in missing data, because of
the many missing values of predictor variable. Again, the perfect
reason to use Amelia.
When I run Amelia, I am running into several problems.
First, regardless of whether I use polynomials of time 0, 1, 2, or 3,
my imputed dataset seems highly, highly stochastic - much more so than
the original data. fixed cost data is fluctuating from -$5 million
one year, to $17 the next year, when all of the observed data for the
same powerplant is a relatively stable $8mil to $11mil over 8 years.
Another case, where I don't have observed data to compare it to - cost
data varies from -$5 to +$2mil, to -$13mil, and then to +$11 million,
all in a 4 year span! Even when I average the five datasets, the
imputed data seems extremely stochastic and unrealistic, and highly
dependent upon the polynomial of time I select.
Am I doing something wrong? Why is the imputed dataset so stochastic?
It appears that much of the advice on the listserve suggests that one
should proceed with the regressions, and not worry about the
stochasticity of 20% of the observations; however, if I am going to be
using this data as part of a dependent variable in a difference in
differences model or as part of many other complex techniques -
several steps down the road, after performing matching, etc - I would
strongly prefer not to have to do each statistical step 5x each, and
instead, come up with a reasonable dataset from which to proceed with
my regression (and perform the regression steps once). Can I simply
average each dataset to generate 1 useful dataset? (and again, why is
the imputed data so variable?)
Second, if I attempt to allow an individual time trend to be estimated
for each individual, by interacting with the cross-section, my
computer grinds for hours and never produces anything. Once, after
several hours, I got an unknown error. Normally - with a 1, 2, or 3
time polynomial, it takes my computer (a Macbook OS X, with 2 gb ram,
and a 2.4 ghz intel core duo chip) about 20 seconds to 1 minute to
produce all five datasets. Do I need a supercomputer - or am I doing
something wrong?
Thanks in advance for your help,
Daniel Matisoff
Indiana University
Georgia Institute of Technology
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia -
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: