Mark, Are you running Amelia on hmdc condor? I found running 50 separate
Amelia programs simultaneously on hmdc relatively quick and easy -- the
first 5 imputed datasets were returned within two or three weeks, and then
after that I got about 1 more condor egg ;) per day. I could start my data
analysis with the first five datasets and then additional datasets were
added as they came in. Anders
Note to the powers that be: it would be useful to set condor to give
streaming output. It is a little difficult to know what is going on with
Amelia (can't take advantage of the verbose function) when this standard
condor function isn't set. Thank you for such a great service hmdc!
On Fri, 14 Mar 2008, Mark Manger wrote:
Hi,
I apologize in advance for the lengthy question, but it's representative of
many issues I face when working with large panels of economic data, so I
would be extremely grateful for your suggestions, best practices, experiences
etc.
I'm wondering what I could do to speed up the imputation of my rather large
dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace, my imputations
would run months. Memory is not the issue, rather I think that I have too
many priors and/or too many missings on certain variables. See below,
especially lnAid and lnFDI. Note that the missings are concentrated on
certain T points (in early time points) rather than specific cross-sectional
units.
|
Variable || || | Obs Mean Std. Dev. Min
Max
Polity || || | 168160 .8924833 6.955011 -10
10
Corruptlvl|| || | 157820 5.441431 1.799652 0
10
RuleofLaw || || | 157820 5.247434 2.204846 0
10
GovStab || ||| 157820 5.935454 2.064963 0
10
log of bilat. Aid | 76079 1.919392 2.338255 -2.302585 9.692112
log of FDI in host| 32080 3.918487 2.928901 -2.372018 10.98025
Capital openness|||| | 155200 -.2888318 1.379179 -1.766966
2.602508
Polcon V || || | 154320 .3490876 .3158385 0
.89
log of GDPcap_host| 154560 7.95649 1.053043 4.933741 10.48464
||log of ||GDP_host | 166480 29.62135 3.04193 22.97718
43.12974
||log of ||GDP_home | 147381 31.1144 2.128313 26.15253
37.36032
|If I don't set range priors, I get nonsensical values for most of the
variables: negative GDP (real GDP, not negative log values), polity scores
out of range, etc. I haven't even tried higher-order polynomials or
interactions with cross-sectional units, although I would prefer to given
that FDI exhibits a clear trend. Breaking up the dataset randomly into pieces
by cross-sections doesn't improve speed.
It seems that I have to make tradeoffs. What do you think would be the best
thing to do, i.e. what is the most time-consuming issue for the EM algorithm?
Constrain/shorten the sample to have a higher proportion of observed values
on lnAid and lnFDI?
Accept imputations that are out of range (probably not)?
||Break up the dataset "vertically" into one with Aid and one with the FDI
variable, run two sets of imputations, and merge it again?
Many thanks,
Mark
--
Mark S. Manger, PhD
Assistant Professor
Department of Political Science, McGill University
mark.manger(a)mcgill.ca
on leave 2007-08:
Advanced Research Fellow, Program on US-Japan Relations
Weatherhead Center for International Affairs
Harvard University
61 Kirkland Street, Room 301
Cambridge, MA 02138
617-495-5998
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: