Hi Dan,
For the unit specific linear trends, you can use the polynomials of
time argument (polytime = 1) and the cross-section interaction
argument (intercs = TRUE) to impute with a linear trend within each
unit. Your interpretation of how this works is right on--a time-trend
with interactions to account for between unit variation.
Cheers,
matt.
On Thu, Oct 8, 2009 at 12:19 PM, Dan Matisoff <dmatisof(a)umail.iu.edu> wrote:
Matt -
Thank you, this is very helpful, and it has given me a lot to think about.
When I leave out polynomials of time the data seems to be quite stochastic
(I'm not sure if its worse or better than with polynomials of time). It
does seem that the imputations are being heavily influenced by variance
between units. There is a lot of variation in the between unit data.
Powerplants in the dataset vary from over a hundred years old, to new
plants. New plants often have negative cost data, due to capital
depreciation schedules (which is why I can't use Bayesian priors to bound
the data), while older plants have higher costs. Many plants are not
operated in certain years, or are operated at very low capacity, which leads
them to have costs of 0. Plant age, capacity, electricity generation, etc -
provide a lot of explanatory power for costs, but certainly there is
certainly a lot of variation between units. This is why it's particularly
important to use a MI program that accounts for unit.
I also understand why I can't interact the cross-section with time due to
computational resources...
My remaining / follow-up questions are:
Is it possible to interact a linear trend with a cross-section using Amelia?
I'm not exactly positive what you mean... does this require estimating a
single time trend, and interacting it with fixed effect intercepts for each
unit?
Kind regards,
Dan Matisoff
On Oct 8, 2009, at 9:12 AM, Matt Blackwell wrote:
Hi Dan,
I am curious what happens to these imputations when you leave out
polynomials of time. It sounds to me like the imputations are being
heavily influenced by the variance between units and not within units.
Perhaps you could simply impute with fixed effects.
The problem with adding an interaction with the cross section is that
it adds NxT variables to the dataset, where N is the number of units
and T is the order of the polynomials of time. You can see how this
would add roughly 3000 variables to your model and why this would slow
down your imputations considerably. Using just polynomials of time,
you only add 3 (or fewer) variables to the regression.
Perhaps you could try a linear trend interacted with the cross section.
I hope that helps.
On Wed, Oct 7, 2009 at 6:03 PM, Dan Matisoff <dmatisof(a)umail.iu.edu> wrote:
Hi all-
I have a (panel) dataset of about 1000 powerplants in the U.S., over 13
years, including cost data, which includes total non-fuel expenditures,
fixed costs, operations & maintenance expenses, and hours of operation.
For
each of the variables, I am missing about 20% of the observations..
overall,
I have about 50% complete observations, meaning, with listwise deletion I
would lose 50% of my observations. Thus, this seems like a perfect
application of Amelia.
If I were to run a simple OLS, I could predict each of the variables at an
r
squared of 75% to 95%, depending on whether I include lagged values.
However, I can't use this to fill in missing data, because of the many
missing values of predictor variable. Again, the perfect reason to use
Amelia.
When I run Amelia, I am running into several problems.
First, regardless of whether I use polynomials of time 0, 1, 2, or 3, my
imputed dataset seems highly, highly stochastic - much more so than the
original data. fixed cost data is fluctuating from -$5 million one year,
to
$17 the next year, when all of the observed data for the same powerplant
is
a relatively stable $8mil to $11mil over 8 years. Another case, where I
don't have observed data to compare it to - cost data varies from -$5 to
+$2mil, to -$13mil, and then to +$11 million, all in a 4 year span! Even
when I average the five datasets, the imputed data seems extremely
stochastic and unrealistic, and highly dependent upon the polynomial of
time
I select.
Am I doing something wrong? Why is the imputed dataset so stochastic?
It appears that much of the advice on the listserve suggests that one
should
proceed with the regressions, and not worry about the stochasticity of 20%
of the observations; however, if I am going to be using this data as part
of
a dependent variable in a difference in differences model or as part of
many
other complex techniques - several steps down the road, after performing
matching, etc - I would strongly prefer not to have to do each statistical
step 5x each, and instead, come up with a reasonable dataset from which to
proceed with my regression (and perform the regression steps once). Can I
simply average each dataset to generate 1 useful dataset? (and again, why
is
the imputed data so variable?)
Second, if I attempt to allow an individual time trend to be estimated for
each individual, by interacting with the cross-section, my computer grinds
for hours and never produces anything. Once, after several hours, I got
an
unknown error. Normally - with a 1, 2, or 3 time polynomial, it takes my
computer (a Macbook OS X, with 2 gb ram, and a 2.4 ghz intel core duo
chip)
about 20 seconds to 1 minute to produce all five datasets. Do I need a
supercomputer - or am I doing something wrong?
Thanks in advance for your help,
Daniel Matisoff
Indiana University
Georgia Institute of Technology
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: