New subject: (highly) stochastic imputed data with Amelia

8 Oct 2009

Matt -

Thank you, this is very helpful, and it has given me a lot to think  
about.

When I leave out polynomials of time the data seems to be quite  
stochastic (I'm not sure if its worse or better than with polynomials  
of time).  It does seem that the imputations are being heavily  
influenced by variance between units.  There is a lot of variation in  
the between unit data.  Powerplants in the dataset vary from over a  
hundred years old, to new plants.  New plants often have negative cost  
data, due to capital depreciation schedules (which is why I can't use  
Bayesian priors to bound the data), while older plants have higher  
costs.  Many plants are not operated in certain years, or are operated  
at very low capacity, which leads them to have costs of 0.  Plant age,  
capacity, electricity generation, etc - provide a lot of explanatory  
power for costs, but certainly there is certainly a lot of variation  
between units.  This is why it's particularly important to use a MI  
program that accounts for unit.

I also understand why I can't interact the cross-section with time due  
to computational resources...

My remaining / follow-up questions are:

Is it possible to interact a linear trend with a cross-section using  
Amelia?  I'm not exactly positive what you mean... does this require  
estimating a single time trend, and interacting it with fixed effect  
intercepts for each unit?

Kind regards,
Dan Matisoff

On Oct 8, 2009, at 9:12 AM, Matt Blackwell wrote:

Hi Dan,

I am curious what happens to these imputations when you leave out
polynomials of time. It sounds to me like the imputations are being
heavily influenced by the variance between units and not within units.
Perhaps you could simply impute with fixed effects.

The problem with adding an interaction with the cross section is that
it adds NxT variables to the dataset, where N is the number of units
and T is the order of the polynomials of time. You can see how this
would add roughly 3000 variables to your model and why this would slow
down your imputations considerably. Using just polynomials of time,
you only add 3 (or fewer) variables to the regression.

Perhaps you could try a linear trend interacted with the cross section.

I hope that helps.

On Wed, Oct 7, 2009 at 6:03 PM, Dan Matisoff &lt;dmatisof(a)umail.iu.edu&gt;  
wrote:
...
  Hi all-

 I have a (panel) dataset of about 1000 powerplants in the U.S., over  
 13
 years, including cost data, which includes total non-fuel  
 expenditures,
 fixed costs, operations & maintenance expenses, and hours of  
 operation. For
 each of the variables, I am missing about 20% of the observations..  
 overall,
 I have about 50% complete observations, meaning, with listwise  
 deletion I
 would lose 50% of my observations.  Thus, this seems like a perfect
 application of Amelia.

 If I were to run a simple OLS, I could predict each of the variables  
 at an r
 squared of 75% to 95%, depending on whether I include lagged values.
  However, I can't use this to fill in missing data, because of the  
 many
 missing values of predictor variable.  Again, the perfect reason to  
 use
 Amelia.

 When I run Amelia, I am running into several problems.
 First, regardless of whether I use polynomials of time 0, 1, 2, or  
 3, my
 imputed dataset seems highly, highly stochastic - much more so than  
 the
 original data.  fixed cost data is fluctuating from -$5 million one  
 year, to
 $17 the next year, when all of the observed data for the same  
 powerplant is
 a relatively stable  $8mil to $11mil over 8 years.  Another case,  
 where I
 don't have observed data to compare it to - cost data varies from - 
 $5 to
 +$2mil, to -$13mil, and then to +$11 million, all in a 4 year span!  
 Even
 when I average the five datasets, the imputed data seems extremely
 stochastic and unrealistic, and highly dependent upon the polynomial  
 of time
 I select.

 Am I doing something wrong?  Why is the imputed dataset so stochastic?

 It appears that much of the advice on the listserve suggests that  
 one should
 proceed with the regressions, and not worry about the stochasticity  
 of 20%
 of the observations; however, if I am going to be using this data as  
 part of
 a dependent variable in a difference in differences model or as part  
 of many
 other complex techniques - several steps down the road, after  
 performing
 matching, etc - I would strongly prefer not to have to do each  
 statistical
 step 5x each, and instead, come up with a reasonable dataset from  
 which to
 proceed with my regression (and perform the regression steps once).  
 Can I
 simply average each dataset to generate 1 useful dataset? (and  
 again, why is
 the imputed data so variable?)

 Second, if I attempt to allow an individual time trend to be  
 estimated for
 each individual, by interacting with the cross-section, my computer  
 grinds
 for hours and never produces anything.  Once, after several hours, I  
 got an
 unknown error.  Normally - with a 1, 2, or 3 time polynomial, it  
 takes my
 computer (a Macbook OS X, with 2 gb ram, and a 2.4 ghz intel core  
 duo chip)
 about 20 seconds to 1 minute to produce all five datasets.  Do I  
 need a
 supercomputer - or am I doing something wrong?

 Thanks in advance for your help,

 Daniel Matisoff

 Indiana University
 Georgia Institute of Technology
 -
 Amelia mailing list served by Harvard-MIT Data Center
 [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/? 
 info=amelia
 More info about Amelia: http://gking.harvard.edu/amelia

 -
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
More info about Amelia: http://gking.harvard.edu/amelia

Re: [amelia] (highly) stochastic imputed data with Amelia