Matt -
Thank you, this is very helpful, and it has given me a lot to think
about.
When I leave out polynomials of time the data seems to be quite
stochastic (I'm not sure if its worse or better than with polynomials
of time). It does seem that the imputations are being heavily
influenced by variance between units. There is a lot of variation in
the between unit data. Powerplants in the dataset vary from over a
hundred years old, to new plants. New plants often have negative cost
data, due to capital depreciation schedules (which is why I can't use
Bayesian priors to bound the data), while older plants have higher
costs. Many plants are not operated in certain years, or are operated
at very low capacity, which leads them to have costs of 0. Plant age,
capacity, electricity generation, etc - provide a lot of explanatory
power for costs, but certainly there is certainly a lot of variation
between units. This is why it's particularly important to use a MI
program that accounts for unit.
I also understand why I can't interact the cross-section with time due
to computational resources...
My remaining / follow-up questions are:
Is it possible to interact a linear trend with a cross-section using
Amelia? I'm not exactly positive what you mean... does this require
estimating a single time trend, and interacting it with fixed effect
intercepts for each unit?
Kind regards,
Dan Matisoff
On Oct 8, 2009, at 9:12 AM, Matt Blackwell wrote:
Hi Dan,
I am curious what happens to these imputations when you leave out
polynomials of time. It sounds to me like the imputations are being
heavily influenced by the variance between units and not within units.
Perhaps you could simply impute with fixed effects.
The problem with adding an interaction with the cross section is that
it adds NxT variables to the dataset, where N is the number of units
and T is the order of the polynomials of time. You can see how this
would add roughly 3000 variables to your model and why this would slow
down your imputations considerably. Using just polynomials of time,
you only add 3 (or fewer) variables to the regression.
Perhaps you could try a linear trend interacted with the cross section.
I hope that helps.
On Wed, Oct 7, 2009 at 6:03 PM, Dan Matisoff <dmatisof(a)umail.iu.edu>
wrote:
Hi all-
I have a (panel) dataset of about 1000 powerplants in the U.S., over
13
years, including cost data, which includes total non-fuel
expenditures,
fixed costs, operations & maintenance expenses, and hours of
operation. For
each of the variables, I am missing about 20% of the observations..
overall,
I have about 50% complete observations, meaning, with listwise
deletion I
would lose 50% of my observations. Thus, this seems like a perfect
application of Amelia.
If I were to run a simple OLS, I could predict each of the variables
at an r
squared of 75% to 95%, depending on whether I include lagged values.
However, I can't use this to fill in missing data, because of the
many
missing values of predictor variable. Again, the perfect reason to
use
Amelia.
When I run Amelia, I am running into several problems.
First, regardless of whether I use polynomials of time 0, 1, 2, or
3, my
imputed dataset seems highly, highly stochastic - much more so than
the
original data. fixed cost data is fluctuating from -$5 million one
year, to
$17 the next year, when all of the observed data for the same
powerplant is
a relatively stable $8mil to $11mil over 8 years. Another case,
where I
don't have observed data to compare it to - cost data varies from -
$5 to
+$2mil, to -$13mil, and then to +$11 million, all in a 4 year span!
Even
when I average the five datasets, the imputed data seems extremely
stochastic and unrealistic, and highly dependent upon the polynomial
of time
I select.
Am I doing something wrong? Why is the imputed dataset so stochastic?
It appears that much of the advice on the listserve suggests that
one should
proceed with the regressions, and not worry about the stochasticity
of 20%
of the observations; however, if I am going to be using this data as
part of
a dependent variable in a difference in differences model or as part
of many
other complex techniques - several steps down the road, after
performing
matching, etc - I would strongly prefer not to have to do each
statistical
step 5x each, and instead, come up with a reasonable dataset from
which to
proceed with my regression (and perform the regression steps once).
Can I
simply average each dataset to generate 1 useful dataset? (and
again, why is
the imputed data so variable?)
Second, if I attempt to allow an individual time trend to be
estimated for
each individual, by interacting with the cross-section, my computer
grinds
for hours and never produces anything. Once, after several hours, I
got an
unknown error. Normally - with a 1, 2, or 3 time polynomial, it
takes my
computer (a Macbook OS X, with 2 gb ram, and a 2.4 ghz intel core
duo chip)
about 20 seconds to 1 minute to produce all five datasets. Do I
need a
supercomputer - or am I doing something wrong?
Thanks in advance for your help,
Daniel Matisoff
Indiana University
Georgia Institute of Technology
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?
info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia