Losing the most common value in nominal imputation - Amelia

12 Oct 2010

Hi,

I'm experimenting for the first time with MI and Amelia so I apologize if I'm 
missing something obvious. I'm also perhaps trying to do something that is 
unfeasible and/or inadvisable. I'm trying to do MI for a nominal variable with 
many possible values, and many of those values are very uncommon. In certain 
cases Amelia is giving me results that are highly suspicious. In particular, it 
seems to be greatly reducing the probability of imputing the most common value, 
and at times dropping this value completely. In other words, value "b" accounts

for 85% of the complete cases, and yet not a single one of the imputed values is 
assigned "b" in any of the five sets of imputations. This doesn't seem
right.

Here are more details about the specifics of what I'm trying to do. I'm looking

to do a rough approximation of an MI approach covered in the following paper:

Clogg, C.C., D.B. Rubin, Nathaniel Schenker, Bradley Schultz, and Lynn Weidman. 
1991. "Multiple Imputation of Industry and Occupation Codes in Census Public-use 
Samples Using Bayesian Logistic Regression." Journal of the American Statistical 
Association 86:68–78. http://www.jstor.org/stable/2289716.

The authors used a sub-sample of Census observations that were double-coded 
under both the 1970 and 1980 occupation coding schemes to multiply impute 1980 
occupation codes for the entire 1970 Census. I'm looking to do a similar thing 
but for the 1990 to 2000 change in occupation coding schemes. Clogg & al's 
approach was to tackle each 1970 occupation code separately. So, for instance, 
they would take all observations with 1970 occupation "funeral director" and 
make this a separate sample (the sample would include both double-coded funeral 
directors (complete cases) and those without 1980 codes (missing values)). They 
examined the variety of 1980 occupation codes that were assigned to the "funeral 
directors" in the double-coded dataset, and used observed characteristics (sex, 
education, industry, etc) to impute 1980 occupation codes for those funeral 
directors that were not double-coded. I'm looking to do a similar procedure, but 
assigning 1990 occupation codes to observations with only 2000 codes. I have a 
large sample of double-coded observations.

The challenge is that some occupations have a very large number of possible 1990 
codes. For instance, I have 7,463 "chief executives" in my double-coded dataset,

and they are assigned to 183 different 1990 occupation codes. Most of these 183 
codes are very uncommon, though, and over 75% of the double-coded observations 
are assigned to a single code of "managers n.e.c.". When I use Amelia to do MI 
and impute 1990 occupation codes for the "chief executives" in my dataset, 
though, not a single observation in any of the five imputations is assigned the 
"managers n.e.c." code. Instead they are distributed across pretty much every 
code except the "managers n.e.c" code.

I think this has to do with very large number of possible values being imputed 
in this nominal variable. Similar cases where there are a large number of 
possible values tend to either have the same problem (no imputations at all of 
the most common value) or it vastly under-represents the most common category 
(e.g. 96% of the double-coded dataset has a particular code but only 22% of the 
imputed values do). Cases where the number of possible codes are small seem to 
have distributions that are more similar between the complete (double-coded) and 
imputed values.

Does this have to do with how nominal variables are treated within Amelia? The 
documentation indicates that nominal variables are transformed into a set of 
dummy variables for the MI process, and then converted back to a nominal 
variable at the end. Does the transformation to the set of dummy variables leave 
the most common value as the omitted group? Is it possible that each of the 
dummy variables is given a slightly higher probability than it should, so that 
by the time it gets to the omitted group it's much less likely to be imputed 
than it should be?

These are only vague guesses. As I said, I realize that trying to impute a 
nominal variable with so many possible values is quite unusual, but at the same 
time I am trying to use it for an application for which MI was originally 
developed.

Any thoughts, advice, or criticism would be greatly appreciated. I am happy to 
provide a sample dataset (just 200k) that demonstrates this problem.

Thank you for your help,

Matissa Hollister

-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
More info about Amelia: http://gking.harvard.edu/amelia