Hi Matissa,
Thanks for pointing this out, we'll look into this. I just want to
confirm that you are setting these values as nominals, correct?
I should point out that Amelia is designed to handle multivariate
normal data. It is true that even when the model is non-normal, Amelia
still does quite well. Unfortunately, even an 8 category nominal
variable is fairy non-normal. If you plan to impute a categorical
variable with 180 levels, you may need a different approach that is
better suited to that type of data. Having said this, the behavior
you are describing sounds like a bug that we can take a look at.
Thanks,
matt.
On Thu, Oct 14, 2010 at 10:01 AM, Matissa Hollister <m73hollis(a)yahoo.com> wrote:
Thanks for the advice. So how do I view the source
code? I'm not a R user. I
tried loading the library in R and typed amelia, but the only code it revealed
was:
function (x, ...)
{
UseMethod("amelia", x)
}
<environment: namespace:Amelia>
I'm skeptical that I'll be able to figure it out on my own, but I'll
certainly
take a look. Meanwhile I'm hoping that someone more familiar with the nominal
option in Amelia, particularly the developers, might respond to my plea.
I've run a few more tests and it seems to be a persistent problem even with
small numbers of possible values. I reduced the dataset down to just 8 possible
outcomes, and yet still the value that represents 89% of the complete cases is
imputed in only 66% of the missing cases. I understand that random variation
means that the distribution of the imputed values should not exactly match the
complete cases, but this 66% is consistent across all five imputations. This
occurs even when I just use a single variable, sex, for the imputation. The
missing values are missing completely at random, so there are no differences in
measured characteristics that should cause this difference. Any additional
insights would be gratefully welcomed.
Thanks,
Matissa Hollister
----- Original Message ----
From: James Marca <jmarca(a)translab.its.uci.edu>
To: amelia(a)lists.gking.harvard.edu
Sent: Wed, October 13, 2010 1:20:11 AM
Subject: Re: [amelia] Losing the most common value in nominal imputation
Hi,
I'm afraid I can't offer any insights into your excellent questions.
However, I would suggest reading through the source code to see what
is going on. I found that the code is fairly well organized, and once
you get over the initial learning curve of figuring out the logic, it
helps a great deal with lifting the veil of the "magic" of the
multiple imputation steps.
Regards,
James Marca
On Tue, Oct 12, 2010 at 07:13:56PM -0700, Matissa Hollister wrote:
Hi,
I'm experimenting for the first time with MI and Amelia so I apologize if I'm
missing something obvious. I'm also perhaps trying to do something that is
unfeasible and/or inadvisable. I'm trying to do MI for a nominal variable with
many possible values, and many of those values
are very uncommon. In certain
cases Amelia is giving me results that are highly suspicious. In particular, it
seems to be greatly reducing the probability of imputing the most common value,
and at times dropping this value completely. In other words, value "b"
accounts
for 85% of the complete cases, and yet not a single one of the imputed values
is
assigned "b" in any of the five sets of imputations. This doesn't seem
right.
Here are more details about the specifics of what I'm trying to do. I'm
looking
to do a rough approximation of an MI approach covered in the following paper:
Clogg, C.C., D.B. Rubin, Nathaniel Schenker, Bradley Schultz, and Lynn Weidman.
1991. "Multiple Imputation of Industry and Occupation Codes in Census
Public-use
Samples Using Bayesian Logistic Regression." Journal of the American
Statistical
Association 86:68–78.
http://www.jstor.org/stable/2289716.
The authors used a sub-sample of Census observations that were double-coded
under both the 1970 and 1980 occupation coding schemes to multiply impute 1980
occupation codes for the entire 1970 Census.
I'm looking to do a similar thing
but for the 1990 to 2000 change in occupation
coding schemes. Clogg & al's
approach was to tackle each 1970 occupation code separately. So, for instance,
they would take all observations with 1970
occupation "funeral director" and
make this a separate sample (the sample would include both double-coded funeral
directors (complete cases) and those without 1980 codes (missing values)). They
examined the variety of 1980 occupation codes that were assigned to the
"funeral
directors" in the double-coded dataset, and used observed characteristics (sex,
education, industry, etc) to impute 1980 occupation codes for those funeral
directors that were not double-coded. I'm looking to do a similar procedure,
but
assigning 1990 occupation codes to observations with only 2000 codes. I have a
large sample of double-coded observations.
The challenge is that some occupations have a very large number of possible
1990
codes. For instance, I have 7,463 "chief executives" in my double-coded
dataset,
and they are assigned to 183 different 1990 occupation codes. Most of these 183
codes are very uncommon, though, and over 75% of the double-coded observations
are assigned to a single code of "managers
n.e.c.". When I use Amelia to do MI
and impute 1990 occupation codes for the
"chief executives" in my dataset,
though, not a single observation in any of the five imputations is assigned the
"managers n.e.c." code. Instead they are distributed across pretty much every
code except the "managers n.e.c" code.
I think this has to do with very large number of possible values being imputed
in this nominal variable. Similar cases where
there are a large number of
possible values tend to either have the same problem (no imputations at all of
the most common value) or it vastly
under-represents the most common category
(e.g. 96% of the double-coded dataset has a particular code but only 22% of the
imputed values do). Cases where the number of possible codes are small seem to
have distributions that are more similar between
the complete (double-coded)
and
imputed values.
Does this have to do with how nominal variables are treated within Amelia? The
documentation indicates that nominal variables
are transformed into a set of
dummy variables for the MI process, and then converted back to a nominal
variable at the end. Does the transformation to the set of dummy variables
leave
the most common value as the omitted group? Is it possible that each of the
dummy variables is given a slightly higher probability than it should, so that
by the time it gets to the omitted group
it's much less likely to be imputed
than it should be?
These are only vague guesses. As I said, I realize that trying to impute a
nominal variable with so many possible values is quite unusual, but at the same
time I am trying to use it for an application for which MI was originally
developed.
Any thoughts, advice, or criticism would be greatly appreciated. I am happy to
provide a sample dataset (just 200k) that
demonstrates this problem.
Thank you for your help,
Matissa Hollister
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia
--
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: