Hi,
Thanks for the translation. When a parallel environment is running say 10
runs (the condor_submit default) of Amelia simultaneously, Amelia saves
the datasets as outdata1.csv, outdata2.csv, etc., to a single file from
which the condor (the computing cluster program) job was submitted. Is it
possible that Amelia is overwriting other imputed datasets that one of the
other 10 Amelia runs has already created? If so, maybe condor could create
a separate directory for each of the 10 Amelia runs, and/or Amelia could
check the directory it is saving into to make sure it isn't overwriting
any already-existing files.
Also, if I set Amelia to create 5 imputed datasets, and set a random-number
generating seed so I can replicate the results exactly, and then set
condor to run my single Amelia file 10 times, does that mean that condor
will create 10 identical sets of 5 imputed datasets? But then only one of
the 10 sets would be useful.
Alternatively, I suppose Amelia programmers may have intended users to
simply create as many directories and different Amelia.R files (in my case
50) as needed, and rerun the condor submit program 50 times for each of 50
random seeds the user sets in each of the 50 files. Fifty simultaneous
Amelia runs may sound excessive, but for a relatively large dataset (mine
is 17,000x100) it is almost necessary in order to get 10 or 20 imputed
datasets back within the first month of the run.
Again, thanks for all the excellent work by the people at HMDC.
Anders
On Fri, 14 Mar 2008, Gary King wrote:
a translation: condor distributes jobs to a 200-node research computing
cluster we have built at the Harvard-MIT Data Center, which is part of the
Institute for Quantitative Social Science at Harvard. unfortunately, that
system isn't open to the public (altho we are working at an open source
version of the software we produced to create the cluster). Anders' point
more generally is that Amelia can be used well in parallel environment.
Gary
On Fri, 14 Mar 2008, Anders Schwartz Corr wrote:
Mark, Are you running Amelia on hmdc condor? I
found running 50 separate
Amelia programs simultaneously on hmdc relatively quick and easy -- the
first 5 imputed datasets were returned within two or three weeks, and then
after that I got about 1 more condor egg ;) per day. I could start my data
analysis with the first five datasets and then additional datasets were
added as they came in. Anders
Note to the powers that be: it would be useful to set condor to give
streaming output. It is a little difficult to know what is going on with
Amelia (can't take advantage of the verbose function) when this standard
condor function isn't set. Thank you for such a great service hmdc!
On Fri, 14 Mar 2008, Mark Manger wrote:
Hi,
I apologize in advance for the lengthy question, but it's representative
of many issues I face when working with large panels of economic data, so
I would be extremely grateful for your suggestions, best practices,
experiences etc.
I'm wondering what I could do to speed up the imputation of my rather
large dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace, my
imputations would run months. Memory is not the issue, rather I think
that
I have too many priors and/or too many missings on certain variables. See
below, especially lnAid and lnFDI. Note that the missings are
concentrated
on certain T points (in early time points) rather than specific
cross-sectional units.
|
Variable || || | Obs Mean Std. Dev. Min Max
Polity || || | 168160 .8924833 6.955011 -10 10
Corruptlvl|| || | 157820 5.441431 1.799652 0 10
RuleofLaw || || | 157820 5.247434 2.204846 0 10
GovStab || ||| 157820 5.935454 2.064963 0 10
log of bilat. Aid | 76079 1.919392 2.338255 -2.302585
9.692112
log of FDI in host| 32080 3.918487 2.928901 -2.372018
10.98025
Capital openness|||| | 155200 -.2888318 1.379179 -1.766966
2.602508
Polcon V || || | 154320 .3490876 .3158385 0 .89
log of GDPcap_host| 154560 7.95649 1.053043 4.933741
10.48464
| | log of ||GDP_host | 166480 29.62135 3.04193 22.97718
43.12974
| | log of ||GDP_home | 147381 31.1144 2.128313 26.15253
37.36032
| If I don't set range priors, I get nonsensical values for most of the
variables: negative GDP (real GDP, not negative log values), polity
scores
out of range, etc. I haven't even tried higher-order polynomials or
interactions with cross-sectional units, although I would prefer to given
that FDI exhibits a clear trend. Breaking up the dataset randomly into
pieces by cross-sections doesn't improve speed.
It seems that I have to make tradeoffs. What do you think would be the
best thing to do, i.e. what is the most time-consuming issue for the EM
algorithm?
Constrain/shorten the sample to have a higher proportion of observed
values on lnAid and lnFDI?
Accept imputations that are out of range (probably not)?
| | Break up the dataset "vertically" into one with Aid and one with the |
| FDI
variable, run two sets of imputations, and merge it again?
Many thanks,
Mark
--
Mark S. Manger, PhD
Assistant Professor
Department of Political Science, McGill University
mark.manger(a)mcgill.ca
on leave 2007-08:
Advanced Research Fellow, Program on US-Japan Relations
Weatherhead Center for International Affairs
Harvard University
61 Kirkland Street, Room 301
Cambridge, MA 02138
617-495-5998
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia