Re: MI with AMELIA - Amelia

8 Mar 2003

On Wed, 5 Mar 2003, alison holman wrote:

...
  Dear Dr. King, I was given your name by the
statistical consultant on
 our grant b/c I am trying to figure out the best approach to handling
 the missing data in our study.  I am the data analyst for a study
 addressing the health (mental and physical) consequences of the 9-11
 terrorist attacks.  Our study is a longitudinal survey of adults
 (nationwide random sample, repeated measures over time).  We have data
 collected at 4 time points (9-14 days, 2 mo, 6mo, 12 mo post attacks).  
 Some respondents have only 2 time points, others have responded at all 4
 time points.

     I am writing you b/c I am struggling with learning the best way to
 deal with the missing data issue.  I would like to make the most of
 these data, as this is the most interesting and richest dataset I have
 had an opportunity to work on.  We have health data collected pre-9-11
 with approximately 12-19% missing.  After reading through a few papers,
 I have realized that I still may be able to do MI *even though* I
 suspect the data are not really MAR.  However, since I cannot directly
 test the MAR assumption, I am not sure how exactly to proceed.  I have
 been trying to identify potential biases in the missing value patterns
 using SPSS.  I have identified that the missing health data are
 associated with being younger.....a rather strong association.  The
 older folks are 80% less likely to have missing health data than the
 18-30 yr olds.  Given these differences, I was considering imputing
 values *within* age categories, using the other demographic data I have
 available in the dataset as well.
     I have looked at the descriptions of the programs you offer on your
 website and I am not sure which of these programs would be best to use
 for my purposes.  The health data I have are *completely categorical*
 (never diagnosed, self-diagnosed, md diagnosed) ailments.  But as I said
 earlier, my dataset is a national probability sample with oversampling
 in 4 communities, and with repeated measures on each participant over a
 year.  I also have post-stratification weights that I need to use for my
 analyses.  Given that I have complex survey data, what would you
 recommend vis a vis:

  (a) is one of the MI programs a reasonable and valid way to solve my
 missing data problem?
 (b) which (if any) of the programs would you recommend for me to use for
 imputing my missing values? 
Have a look at Amelia.  It is not designed specifically for sample 
attrition (which is your problem), but it has been used for that problem.

...
  (c) are there special considerations I need to be
aware of for complex
 survey data? 
not really.  You can try to include the sample weights as a (fully 
observed) variable in the analysis.

...
  (d) do I need to use weights when doing the
imputations? If so, can
 Amelia accommodate them? 
since you're not computing causal effects as part of this first stage 
(imputation) analysis, you don't need them as weights.  but if they're not 
functions of other variables in your analysis, you might control for them 
as above.

...
  (e) I intuitively (and perhaps naively) think I should
use the
 individual items rather than a scale score when imputing values, am I
 right?  (I have had some stats people advise me to impute at the scale
 score level, but that seems to me to compound any potential biases there
 may be in the data)... 
you're absolutely right.  the only qualification is if the individual 
items are always missing when one is missing, then you might as well use 
the scale score.  since this is not normally the case, your intution is 
right.

...
  (f) I am a beginner at using STATA--can Amelia run in
STATA and can you
 refer me to someone or some article that describes how to use it in
 STATA?  
Amelia is a stand-alone program.  but it produces imputations that you can 
use in another program like Stata.  If you use Stata, I'd suggest you use 
Clarify (also at the same web page), which will automatically combine the 
separate imputations.

...
  Finally, I have a more general, philosophical question
about doing
 imputation.  Please pardon my naivete about this stuff, but I am a
 novice at doing this....and I am very serious about wanting to do the
 right thing with these data.  I understand that there is some debate re
 whether it is considered legitimate to impute DVs in a dataset...yet the
 variables I am hoping to impute are going to be both IVs and DVs in
 different theoretical analyses (the health data).  Many of my colleagues
 are uncomfortable imputing DVs--given the nature of my dataset, is this
 something I should avoid? I understand that if I don't impute I could
 introduce bias simply by deleting cases or using the mean, but when the
 data are not MAR, other than having a rich mix of variables to use in
 making the imputations, what other precautions do you recommend? 
There is some misunderstanding about this issue, but I don't think there 
is any real disagreement.  You should certainly include the dependent 
variables.  Omitting them will cause bias.  Since the imputations are 
drawn from the posterior, there is no endogeneity bias.

...
   Thank you in advance for your expert advice.....I
really appreciate
 your help with this! 
Good luck!  Sounds like a great project.

Gary

     : Gary King, King(a)Harvard.Edu    http://GKing.Harvard.Edu :
     : Center for Basic Research      Direct    (617) 495-2027 :
     :   in the Social Sciences       Assistant (617) 495-9271 :
     : 34 Kirkland Street, Rm. 2      HU-MIT DC (617) 495-4734 :
     : Harvard U, Cambridge, MA 02138    eFax   (928) 832-7022 :

...

 Sincerely,

 Alison Holman, Ph.D.
 Professional Researcher
 Center for Health Policy and Research
 University of California, Irvine
 Irvine, CA 92697
 (949) 824-6849 (phone)
 (949) 824-3002 (fax)