Dear CEM list,
I am using CEM to match individuals receiving workers compensation from
an injury to non-injured workers. We have several continuous variables
(e.g., income, firm size, age), a categorical variable (e.g., industry)
as well as some dichotomous variables (e.g., gender, born in state).
The sample is quite large with many more potential controls (1.2
million) than injured workers (4 thousand). Prior to using CEM I
coarsened the data myself by putting income into quintiles, four firm
size categories, 4 age groups, and 10 industry categories. I then ran
CEM with automatic cuts. However, based upon the sample size Stuge's
Rule creates 22 bins for each variable which in many cases don't exist
(1/2 a woman). The bins tend not to be very "coarse" with approximately
2,000 strata.
To try and improve this, I put in some cut points similar (coarser than
above mention) and then the program never seemed to finish running (2
days later I killed it).
Thus, I am thinking of using a different set of auto cuts, but I think
the Freedman-Diaconis rule would yield even more cutpoints and I wasn't
sure what other algorithms were available (none listed in the Stata
Journal Article).
Do you have any suggestion how to coarsen the data further so that I can
get the most out of the program?
Thanks in advance for your help!
Ethan Scherer MPP, CPA
Doctoral Fellow, Pardee RAND Graduate School
1776 Main St., Mailstop M1N
Santa Monica, CA 90401
W: 310-393-0411 x6056
E: escherer(a)rand.org
__________________________________________________________________________
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.