On Wed, 5 Mar 2003, alison holman wrote:
Dear Dr. King, I was given your name by the
statistical consultant on
our grant b/c I am trying to figure out the best approach to handling
the missing data in our study. I am the data analyst for a study
addressing the health (mental and physical) consequences of the 9-11
terrorist attacks. Our study is a longitudinal survey of adults
(nationwide random sample, repeated measures over time). We have data
collected at 4 time points (9-14 days, 2 mo, 6mo, 12 mo post attacks).
Some respondents have only 2 time points, others have responded at all 4
time points.
I am writing you b/c I am struggling with learning the best way to
deal with the missing data issue. I would like to make the most of
these data, as this is the most interesting and richest dataset I have
had an opportunity to work on. We have health data collected pre-9-11
with approximately 12-19% missing. After reading through a few papers,
I have realized that I still may be able to do MI *even though* I
suspect the data are not really MAR. However, since I cannot directly
test the MAR assumption, I am not sure how exactly to proceed. I have
been trying to identify potential biases in the missing value patterns
using SPSS. I have identified that the missing health data are
associated with being younger.....a rather strong association. The
older folks are 80% less likely to have missing health data than the
18-30 yr olds. Given these differences, I was considering imputing
values *within* age categories, using the other demographic data I have
available in the dataset as well.
I have looked at the descriptions of the programs you offer on your
website and I am not sure which of these programs would be best to use
for my purposes. The health data I have are *completely categorical*
(never diagnosed, self-diagnosed, md diagnosed) ailments. But as I said
earlier, my dataset is a national probability sample with oversampling
in 4 communities, and with repeated measures on each participant over a
year. I also have post-stratification weights that I need to use for my
analyses. Given that I have complex survey data, what would you
recommend vis a vis:
(a) is one of the MI programs a reasonable and valid way to solve my
missing data problem?
(b) which (if any) of the programs would you recommend for me to use for
imputing my missing values?
Have a look at Amelia. It is not designed specifically for sample
attrition (which is your problem), but it has been used for that problem.
(c) are there special considerations I need to be
aware of for complex
survey data?
not really. You can try to include the sample weights as a (fully
observed) variable in the analysis.
(d) do I need to use weights when doing the
imputations? If so, can
Amelia accommodate them?
since you're not computing causal effects as part of this first stage
(imputation) analysis, you don't need them as weights. but if they're not
functions of other variables in your analysis, you might control for them
as above.
(e) I intuitively (and perhaps naively) think I should
use the
individual items rather than a scale score when imputing values, am I
right? (I have had some stats people advise me to impute at the scale
score level, but that seems to me to compound any potential biases there
may be in the data)...
you're absolutely right. the only qualification is if the individual
items are always missing when one is missing, then you might as well use
the scale score. since this is not normally the case, your intution is
right.
(f) I am a beginner at using STATA--can Amelia run in
STATA and can you
refer me to someone or some article that describes how to use it in
STATA?
Amelia is a stand-alone program. but it produces imputations that you can
use in another program like Stata. If you use Stata, I'd suggest you use
Clarify (also at the same web page), which will automatically combine the
separate imputations.
Finally, I have a more general, philosophical question
about doing
imputation. Please pardon my naivete about this stuff, but I am a
novice at doing this....and I am very serious about wanting to do the
right thing with these data. I understand that there is some debate re
whether it is considered legitimate to impute DVs in a dataset...yet the
variables I am hoping to impute are going to be both IVs and DVs in
different theoretical analyses (the health data). Many of my colleagues
are uncomfortable imputing DVs--given the nature of my dataset, is this
something I should avoid? I understand that if I don't impute I could
introduce bias simply by deleting cases or using the mean, but when the
data are not MAR, other than having a rich mix of variables to use in
making the imputations, what other precautions do you recommend?
There is some misunderstanding about this issue, but I don't think there
is any real disagreement. You should certainly include the dependent
variables. Omitting them will cause bias. Since the imputations are
drawn from the posterior, there is no endogeneity bias.
Thank you in advance for your expert advice.....I
really appreciate
your help with this!
Good luck! Sounds like a great project.
Gary
: Gary King, King(a)Harvard.Edu
http://GKing.Harvard.Edu :
: Center for Basic Research Direct (617) 495-2027 :
: in the Social Sciences Assistant (617) 495-9271 :
: 34 Kirkland Street, Rm. 2 HU-MIT DC (617) 495-4734 :
: Harvard U, Cambridge, MA 02138 eFax (928) 832-7022 :
Sincerely,
Alison Holman, Ph.D.
Professional Researcher
Center for Health Policy and Research
University of California, Irvine
Irvine, CA 92697
(949) 824-6849 (phone)
(949) 824-3002 (fax)