Dear Joseph
This is great. I wanted to see such a discussion for a while. Here
is my $0.02
There will always be some uncertainty about our
estimates, because
they are simulations that represent possible values of data that we
do not have. The only sure-fire means of validating the imputations
is to have the actual values, which would eliminate the need for
imputation. Ultimately, you have to make a judgment about the
credibility of the imputation model itself — does it create
reasonable estimates?
Yes. This is true. But you make it sound like this uncertainty is a
bad thing. It is a good thing as the imputation model needs to model
the uncertainty. If we were to impute the expected values (ala SPSS
MVA) we would be artificially reducing our standard errors.
AMELIA II offers two diagnostics tools for judging
imputed values —
compare and overimpute (both explained in Prof. King's recommended
readings).
The former command lets you compare the distribution of reported and
imputed variable’s values. Ask yourself, should the missing values
have the same distribution as reported values, or should their
distribution have a different central tendency, dispersion and/or
skew than the reported values? The graphs produced here will allow
you to assess your imputation's conformity to these expectations.
Compare will allow you to check whether these expectations about
imputed value distributions are fulfilled. Imputed values’s
distribution do not have to match the distribution of reported
values, but the differences between the two should be explainable.
Lets formalize this just a little more. If your missingness is MCAR
you should expect the distribution of the imputed values to be the
same as the observed values. If you have MAR missing data you might
end up seeing differences. If you have good predictors of income (or
the missingness of income) and low income people tend to non-respond
to the income question, you will see more imputations on the lower end
of the observed distribution.
You should justify why you are assuming MAR missing data. Therefore
you should be making claims like: I believe education is a good
predictor of the missingness of income and therefore for low education
people I expect imputations on the lower end of the distribution. I
do not believe either graphical diagnostics is much useful for making
such statements (and wonder if there is a good way of producing such a
visual).
So I really do wonder what is the purpose of the graphical description
of the two distributions. It can tell me if my data is more MAR or
MCAR (I guess, maybe not). It can tell me if the imputations looked
the same as the observeds. While seeing this is kinda *fun* in a very
geekish way, I would love to know if anyone else makes better use of
this information in arguing any points, etc.
Under some circumstances, your model will not predict
extreme values
well. This has happened to me some times. My principle concern here
is that these extreme values do not constitute much of your sample,
or do not present data points with undue influence on your results.
If such observations influence your model, then you have a concern
with which I have not yet dealt.
Interesting. Very good point. You should see how much of these cases
are off. I never thought of it this way.
In my book I drew the conclusion that our inferences for cases with
extreme values will be less reliable. I suggested to be more cautious
when drawing inferences for the variables that suffered from this issue.
But I have no fixes. There might be practical implications not too
well discussed in the manual. Sorry I have not yet read the newer
paper carefully but I skimmed it and it did not seem helpful in these
aspects. Just found it last week or so. Should have put a note here
when you posed it. Please do so when you post future materials,
publish papers on or using Amelia, etc. everyone, not just Matt and
Gary. Here is mine:
Levente Littvay (2007) Corruption and Democratic Performance. VDM
Verlag Dr. Mueller e.K
http://www.amazon.com/Corruption-Democratic-Performance-Levente-Littvay/dp/…
In addition, you can use your preferred spreadsheet or
statistical
package to graph reported and imputed values within panels. Compare
the imptued and reported values within panels, and ask yourself
whether the imputed models make sense. If the variable is expected
to take the form of a random walk, then do the imputed values also
suggest such a walk? If the variable is one that maintains stable
trends within panels over time, do the imputed values roughly
approximate this stable trend?
There is always judgment involved, and it is important that you are
able to make a case for the reader to believe in your imputations.
Again, the imputations do model a level of uncertainty. So I would
argue that what you describe above would be easy to do if only the
expected values were imputed. You can average your imputations across
the m datasets and you should get exactly what is above. But it is
not the expected values we use in the analysis. So if one imputation
is off of the trend line, that could be because it was drawn from the
tail end of the expected distribution. From the perspective of trends
it would look off. (Or am I off on this?)
I wonder if there is an easy way of doing this. I often wish I could
ask Amelia to produce datasets where only the imputed values are
present and the originals are not. Or output a file where 0s
represent observed case and 1s the imputed case. (I guess this would
not be hard in R, but I am not that good with R so I tend to use the
GUI. One day...)
This is how I've interpreted the materials
recommended by Prof.
King, but I am not a leading expert in missing data imputation. If
I am completely mistaken, someone please tell me.
What I am missing from Dr. King and coautors is good practical
perscriptions on the diagnistics features. If you see X, what could
that mean, how can you figure out what that means, is it a concern,
how much of a concern, anything you can do to alleviate the concern,
anything you can do to fix it, etc. A practical guide as such would
be VERY useful for all of us.
Levi
PS: Matt, have you gotten anywhere with the bug that Sheeder and Lynne
ran into? I saw a recent posting with a similar error message. I'd
appreciate and update on this. I need to talk to Lynne about
something I want from him. He might ask. Thanks.
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia