Mark, Are you running Amelia on hmdc condor? I found
running 50 separate
Amelia programs simultaneously on hmdc relatively quick and easy -- the first
5 imputed datasets were returned within two or three weeks, and then after
that I got about 1 more condor egg ;) per day. I could start my data analysis
with the first five datasets and then additional datasets were added as they
came in. Anders
Note to the powers that be: it would be useful to set condor to give
streaming output. It is a little difficult to know what is going on with
Amelia (can't take advantage of the verbose function) when this standard
condor function isn't set. Thank you for such a great service hmdc!
On Fri, 14 Mar 2008, Mark Manger wrote:
Hi,
I apologize in advance for the lengthy question, but it's representative
of many issues I face when working with large panels of economic data, so
I would be extremely grateful for your suggestions, best practices,
experiences etc.
I'm wondering what I could do to speed up the imputation of my rather
large dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace, my
imputations would run months. Memory is not the issue, rather I think that
I have too many priors and/or too many missings on certain variables. See
below, especially lnAid and lnFDI. Note that the missings are concentrated
on certain T points (in early time points) rather than specific
cross-sectional units.
|
Variable || || | Obs Mean Std. Dev. Min Max
Polity || || | 168160 .8924833 6.955011 -10 10
Corruptlvl|| || | 157820 5.441431 1.799652 0 10
RuleofLaw || || | 157820 5.247434 2.204846 0 10
GovStab || ||| 157820 5.935454 2.064963 0 10
log of bilat. Aid | 76079 1.919392 2.338255 -2.302585
9.692112
log of FDI in host| 32080 3.918487 2.928901 -2.372018
10.98025
Capital openness|||| | 155200 -.2888318 1.379179 -1.766966
2.602508
Polcon V || || | 154320 .3490876 .3158385 0 .89
log of GDPcap_host| 154560 7.95649 1.053043 4.933741
10.48464
| | log of ||GDP_host | 166480 29.62135 3.04193 22.97718
43.12974
| | log of ||GDP_home | 147381 31.1144 2.128313 26.15253
37.36032
| If I don't set range priors, I get nonsensical values for most of the
variables: negative GDP (real GDP, not negative log values), polity scores
out of range, etc. I haven't even tried higher-order polynomials or
interactions with cross-sectional units, although I would prefer to given
that FDI exhibits a clear trend. Breaking up the dataset randomly into
pieces by cross-sections doesn't improve speed.
It seems that I have to make tradeoffs. What do you think would be the
best thing to do, i.e. what is the most time-consuming issue for the EM
algorithm?
Constrain/shorten the sample to have a higher proportion of observed
values on lnAid and lnFDI?
Accept imputations that are out of range (probably not)?
| | Break up the dataset "vertically" into one with Aid and one with the
| | FDI
variable, run two sets of imputations, and merge it again?
Many thanks,
Mark
--
Mark S. Manger, PhD
Assistant Professor
Department of Political Science, McGill University
mark.manger(a)mcgill.ca
on leave 2007-08:
Advanced Research Fellow, Program on US-Japan Relations
Weatherhead Center for International Affairs
Harvard University
61 Kirkland Street, Room 301
Cambridge, MA 02138
617-495-5998
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: