Amelia October 2009

amelia@lists.gking.harvard.edu

7 participants
7 discussions

Multiple imputaton question/ coding issue

by Helen Brown

Dear all, I am using Amelia II to impute but I have a question regarding one of my variables which seems to pose a problem for multiple imputation unless I can find a better coding rule: I want to control for **compliance with treatment*** (binary variable 1,0) but I am not sure how to deal with the following: say patient1 is under T1 in 1990 and complies, then dummy for T1=1 and dummy for compliance=1 say patient2 is under T1 in 1999 and does not comply, then dummy for T1=1 and dummy for compliance=0 But, then, say patient1 is NOT under T1 in 1980 (T1=0) what value should I assign to compliance in this case? Should I leave it as a missing -no value- (it makes sense but I will lose many observations). On the other hand, it doesnt' make sense to assign 1 or 0 to compliance if there was nothing to comply (or fail to comply) with in the first place. Moreover, if I leave the compliance value as missing in this case, when I use Amelia II the missing value for compliance will be imputed and I am not sure this would be correct given that in reality compliance did not exist b/c there was no treatment to comply with. Thanks in advance, Regards, Helen A. Brown - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia More info about Amelia: http://gking.harvard.edu/amelia

14 years, 5 months

odd post-imputation levels

by Donald Braman

Hi Folks, I'm getting some odd levels in my post-imputation data: > levels(as.factor(a.out$imputations[[1]]$facts_convict)) [1] "1" "1" "2" "3" "4" "5" "6" The "facts_convict" variable is imputed as ordinal, but the double "1"s aren't there prior to imputing: > levels(as.factor(my.data$facts_convict)) [1] "1" "2" "3" "4" "5" "6" Also, I can't seem to cure this by converting to numeric and back: > levels(as.factor(as.numeric(a.out$imputations[[1]]$facts_convict))) [1] "1" "1" "2" "3" "4" "5" "6" Not sure how/why the double "1" crept in during imputation. As you can imagine, ologits end up being odd, with two estimates for "1". Any thoughts? Don Donald Braman phone: 413-628-1221 http://www.culturalcognition.net/braman/ http://ssrn.com/author=286206 http://www.law.gwu.edu/Faculty/profile.aspx?id=10123

14 years, 6 months

Re: [amelia] (highly) stochastic imputed data with Amelia

by Dan Matisoff

Matt - Thank you, this is very helpful, and it has given me a lot to think about. When I leave out polynomials of time the data seems to be quite stochastic (I'm not sure if its worse or better than with polynomials of time). It does seem that the imputations are being heavily influenced by variance between units. There is a lot of variation in the between unit data. Powerplants in the dataset vary from over a hundred years old, to new plants. New plants often have negative cost data, due to capital depreciation schedules (which is why I can't use Bayesian priors to bound the data), while older plants have higher costs. Many plants are not operated in certain years, or are operated at very low capacity, which leads them to have costs of 0. Plant age, capacity, electricity generation, etc - provide a lot of explanatory power for costs, but certainly there is certainly a lot of variation between units. This is why it's particularly important to use a MI program that accounts for unit. I also understand why I can't interact the cross-section with time due to computational resources... My remaining / follow-up questions are: Is it possible to interact a linear trend with a cross-section using Amelia? I'm not exactly positive what you mean... does this require estimating a single time trend, and interacting it with fixed effect intercepts for each unit? Kind regards, Dan Matisoff On Oct 8, 2009, at 9:12 AM, Matt Blackwell wrote: Hi Dan, I am curious what happens to these imputations when you leave out polynomials of time. It sounds to me like the imputations are being heavily influenced by the variance between units and not within units. Perhaps you could simply impute with fixed effects. The problem with adding an interaction with the cross section is that it adds NxT variables to the dataset, where N is the number of units and T is the order of the polynomials of time. You can see how this would add roughly 3000 variables to your model and why this would slow down your imputations considerably. Using just polynomials of time, you only add 3 (or fewer) variables to the regression. Perhaps you could try a linear trend interacted with the cross section. I hope that helps. On Wed, Oct 7, 2009 at 6:03 PM, Dan Matisoff <dmatisof(a)umail.iu.edu> wrote: > Hi all- > > I have a (panel) dataset of about 1000 powerplants in the U.S., over > 13 > years, including cost data, which includes total non-fuel > expenditures, > fixed costs, operations & maintenance expenses, and hours of > operation. For > each of the variables, I am missing about 20% of the observations.. > overall, > I have about 50% complete observations, meaning, with listwise > deletion I > would lose 50% of my observations. Thus, this seems like a perfect > application of Amelia. > > If I were to run a simple OLS, I could predict each of the variables > at an r > squared of 75% to 95%, depending on whether I include lagged values. > However, I can't use this to fill in missing data, because of the > many > missing values of predictor variable. Again, the perfect reason to > use > Amelia. > > When I run Amelia, I am running into several problems. > First, regardless of whether I use polynomials of time 0, 1, 2, or > 3, my > imputed dataset seems highly, highly stochastic - much more so than > the > original data. fixed cost data is fluctuating from -$5 million one > year, to > $17 the next year, when all of the observed data for the same > powerplant is > a relatively stable $8mil to $11mil over 8 years. Another case, > where I > don't have observed data to compare it to - cost data varies from - > $5 to > +$2mil, to -$13mil, and then to +$11 million, all in a 4 year span! > Even > when I average the five datasets, the imputed data seems extremely > stochastic and unrealistic, and highly dependent upon the polynomial > of time > I select. > > Am I doing something wrong? Why is the imputed dataset so stochastic? > > It appears that much of the advice on the listserve suggests that > one should > proceed with the regressions, and not worry about the stochasticity > of 20% > of the observations; however, if I am going to be using this data as > part of > a dependent variable in a difference in differences model or as part > of many > other complex techniques - several steps down the road, after > performing > matching, etc - I would strongly prefer not to have to do each > statistical > step 5x each, and instead, come up with a reasonable dataset from > which to > proceed with my regression (and perform the regression steps once). > Can I > simply average each dataset to generate 1 useful dataset? (and > again, why is > the imputed data so variable?) > > Second, if I attempt to allow an individual time trend to be > estimated for > each individual, by interacting with the cross-section, my computer > grinds > for hours and never produces anything. Once, after several hours, I > got an > unknown error. Normally - with a 1, 2, or 3 time polynomial, it > takes my > computer (a Macbook OS X, with 2 gb ram, and a 2.4 ghz intel core > duo chip) > about 20 seconds to 1 minute to produce all five datasets. Do I > need a > supercomputer - or am I doing something wrong? > > Thanks in advance for your help, > > Daniel Matisoff > > Indiana University > Georgia Institute of Technology > - > Amelia mailing list served by Harvard-MIT Data Center > [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/? > info=amelia > More info about Amelia: http://gking.harvard.edu/amelia > > - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia More info about Amelia: http://gking.harvard.edu/amelia

14 years, 6 months

Including information abouth survey design in MI

by Fabr�cio Mendes Fialho

Hi Amelia II developers and users, According King et al (APSR 95(1), 2001, p. 57, footnote 18): "If the data are generated using a complex or multistage survey design, then information abouth the design should be included in the imputation model. For example. ot accoount for stratified sampling, the imputation model should include the strata coded as dummy variables." How should I proceed if my data come from a survey design using clusters? Almost all data I analyze use census tracts as PSU: first, n census tracts are randomly selected (in dataset I'm currently working, n = 127); then households are randomly selected from each census tract (each tract containing around dozen of cases). Dataset includes one variable indicating from which census tract/PSU each case is from. Should I just include this variable in the MI process like it is (numbering census tracts from 1 to 127), or should I create dummy variables (one dummy variable for each of the 127 census tracts in my sample)? Thanks for all help (again). Sincerely, Fabricio Fialho

14 years, 6 months

Is there a way to specify interaction effects when imputing missing data?

by Donald Braman

I don't see anything about it in the manual. Donald Braman phone: 413-628-1221 http://www.culturalcognition.net/braman/ http://ssrn.com/author=286206 http://www.law.gwu.edu/Faculty/profile.aspx?id=10123

14 years, 6 months

Re: [amelia] (highly) stochastic imputed data with Amelia

by paulandpen＠optusnet.com.au

Daniel, There are a lot of people who know a lot more about this than me, but I am writing a response to try and help anyway. There is an option in Amelia 2 to use priors on the imputations. I used amelia quite extensively with survey data and found that quite often the data would exceed the range I expected for the survey question. For example, if a question was scored on a scale of 1-7, i would often get zero, -1, or 8,9,10 values on individual imputations. Using a bayesian rationale, I set my prior to (1-7) to ensure that the imputations did not extend beyond credible limits in the eyes of my colleagues who were seeing multiple imputation for the first time. Mind you, they never questioned listwise deletion because they were taught this by their professors at university or it was the default on their package. My sense is that your errors in the imputation can be modified using prior historical knowledge and beliefs - so that for example with the fixed costs, you could impose a prior on the range that you would expect to see from the variable (for example 'your cost values will not extend outside the historical range of 8-11 million and the very stochastic patterns you are seeing will be constrained by your prior). In other words the min and max they will be 8 and 11 million respectively Also, if I was going to take an average of my datasets (which is against conventional wisdom and not what I would recommend), I would probably impute a minimum of 30 datasets. Intuitively, if I was using a sampling approach from a distribution to estimate an average, based on the CLT, I would expect 30 samples to start to converage to a normal distribution. Never tried this. I would probably check the distribution of my data imputations as well. NB - I don't use this - so this advice could be totally wrong. I would also run my models with listwise versus multiple imputation to see if my parameters are different. HTH Paul > Dan Matisoff <dmatisof(a)umail.iu.edu> wrote: > > Hi all- > > I have a (panel) dataset of about 1000 powerplants in the U.S., over > 13 years, including cost data, which includes total non-fuel > expenditures, fixed costs, operations & maintenance expenses, and > hours of operation. For each of the variables, I am missing about 20% > of the observations.. overall, I have about 50% complete observations, > meaning, with listwise deletion I would lose 50% of my observations. > Thus, this seems like a perfect application of Amelia. > > If I were to run a simple OLS, I could predict each of the variables > at an r squared of 75% to 95%, depending on whether I include lagged > values. However, I can't use this to fill in missing data, because of > the many missing values of predictor variable. Again, the perfect > reason to use Amelia. > > When I run Amelia, I am running into several problems. > First, regardless of whether I use polynomials of time 0, 1, 2, or 3, > my imputed dataset seems highly, highly stochastic - much more so than > the original data. fixed cost data is fluctuating from -$5 million > one year, to $17 the next year, when all of the observed data for the > same powerplant is a relatively stable $8mil to $11mil over 8 years. > Another case, where I don't have observed data to compare it to - cost > data varies from -$5 to +$2mil, to -$13mil, and then to +$11 million, > all in a 4 year span! Even when I average the five datasets, the > imputed data seems extremely stochastic and unrealistic, and highly > dependent upon the polynomial of time I select. > > Am I doing something wrong? Why is the imputed dataset so stochastic? > > It appears that much of the advice on the listserve suggests that one > should proceed with the regressions, and not worry about the > stochasticity of 20% of the observations; however, if I am going to be > using this data as part of a dependent variable in a difference in > differences model or as part of many other complex techniques - > several steps down the road, after performing matching, etc - I would > strongly prefer not to have to do each statistical step 5x each, and > instead, come up with a reasonable dataset from which to proceed with > my regression (and perform the regression steps once). Can I simply > average each dataset to generate 1 useful dataset? (and again, why is > the imputed data so variable?) > > Second, if I attempt to allow an individual time trend to be estimated > for each individual, by interacting with the cross-section, my > computer grinds for hours and never produces anything. Once, after > several hours, I got an unknown error. Normally - with a 1, 2, or 3 > time polynomial, it takes my computer (a Macbook OS X, with 2 gb ram, > and a 2.4 ghz intel core duo chip) about 20 seconds to 1 minute to > produce all five datasets. Do I need a supercomputer - or am I doing > something wrong? > > Thanks in advance for your help, > > Daniel Matisoff > > Indiana University > Georgia Institute of Technology > - > Amelia mailing list served by Harvard-MIT Data Center > [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia > More info about Amelia: http://gking.harvard.edu/amelia - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia More info about Amelia: http://gking.harvard.edu/amelia

14 years, 6 months

(highly) stochastic imputed data with Amelia

by Dan Matisoff

Hi all- I have a (panel) dataset of about 1000 powerplants in the U.S., over 13 years, including cost data, which includes total non-fuel expenditures, fixed costs, operations & maintenance expenses, and hours of operation. For each of the variables, I am missing about 20% of the observations.. overall, I have about 50% complete observations, meaning, with listwise deletion I would lose 50% of my observations. Thus, this seems like a perfect application of Amelia. If I were to run a simple OLS, I could predict each of the variables at an r squared of 75% to 95%, depending on whether I include lagged values. However, I can't use this to fill in missing data, because of the many missing values of predictor variable. Again, the perfect reason to use Amelia. When I run Amelia, I am running into several problems. First, regardless of whether I use polynomials of time 0, 1, 2, or 3, my imputed dataset seems highly, highly stochastic - much more so than the original data. fixed cost data is fluctuating from -$5 million one year, to $17 the next year, when all of the observed data for the same powerplant is a relatively stable $8mil to $11mil over 8 years. Another case, where I don't have observed data to compare it to - cost data varies from -$5 to +$2mil, to -$13mil, and then to +$11 million, all in a 4 year span! Even when I average the five datasets, the imputed data seems extremely stochastic and unrealistic, and highly dependent upon the polynomial of time I select. Am I doing something wrong? Why is the imputed dataset so stochastic? It appears that much of the advice on the listserve suggests that one should proceed with the regressions, and not worry about the stochasticity of 20% of the observations; however, if I am going to be using this data as part of a dependent variable in a difference in differences model or as part of many other complex techniques - several steps down the road, after performing matching, etc - I would strongly prefer not to have to do each statistical step 5x each, and instead, come up with a reasonable dataset from which to proceed with my regression (and perform the regression steps once). Can I simply average each dataset to generate 1 useful dataset? (and again, why is the imputed data so variable?) Second, if I attempt to allow an individual time trend to be estimated for each individual, by interacting with the cross-section, my computer grinds for hours and never produces anything. Once, after several hours, I got an unknown error. Normally - with a 1, 2, or 3 time polynomial, it takes my computer (a Macbook OS X, with 2 gb ram, and a 2.4 ghz intel core duo chip) about 20 seconds to 1 minute to produce all five datasets. Do I need a supercomputer - or am I doing something wrong? Thanks in advance for your help, Daniel Matisoff Indiana University Georgia Institute of Technology - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia More info about Amelia: http://gking.harvard.edu/amelia

14 years, 6 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Amelia October 2009