R and therefore Amelia doesn't have parallel processing built in.
But the way it works, if you want m imputations, you can run Amelia m
times in separate jobs and it will run m*100% faster. But you do need to
tell your parallelization software (condor in this case) to run them each
completely separately so they don't overwrite different imputations with
the same file names.
Best,
Gary
On Sat, 15 Mar 2008, Anders Schwartz Corr wrote:
Hi,
Thanks for the translation. When a parallel environment is running say 10
runs (the condor_submit default) of Amelia simultaneously, Amelia saves the
datasets as outdata1.csv, outdata2.csv, etc., to a single file from which the
condor (the computing cluster program) job was submitted. Is it possible that
Amelia is overwriting other imputed datasets that one of the other 10 Amelia
runs has already created? If so, maybe condor could create a separate
directory for each of the 10 Amelia runs, and/or Amelia could check the
directory it is saving into to make sure it isn't overwriting any
already-existing files.
Also, if I set Amelia to create 5 imputed datasets, and set a random-number
generating seed so I can replicate the results exactly, and then set condor
to run my single Amelia file 10 times, does that mean that condor will create
10 identical sets of 5 imputed datasets? But then only one of the 10 sets
would be useful.
Alternatively, I suppose Amelia programmers may have intended users to simply
create as many directories and different Amelia.R files (in my case 50) as
needed, and rerun the condor submit program 50 times for each of 50 random
seeds the user sets in each of the 50 files. Fifty simultaneous Amelia runs
may sound excessive, but for a relatively large dataset (mine is 17,000x100)
it is almost necessary in order to get 10 or 20 imputed datasets back within
the first month of the run.
Again, thanks for all the excellent work by the people at HMDC.
Anders
On Fri, 14 Mar 2008, Gary King wrote:
a translation: condor distributes jobs to a 200-node research computing
cluster we have built at the Harvard-MIT Data Center, which is part of the
Institute for Quantitative Social Science at Harvard. unfortunately, that
system isn't open to the public (altho we are working at an open source
version of the software we produced to create the cluster). Anders' point
more generally is that Amelia can be used well in parallel environment.
Gary
On Fri, 14 Mar 2008, Anders Schwartz Corr wrote:
Mark, Are you running Amelia on hmdc condor? I
found running 50 separate
Amelia programs simultaneously on hmdc relatively quick and easy -- the
first 5 imputed datasets were returned within two or three weeks, and
then after that I got about 1 more condor egg ;) per day. I could start
my data analysis with the first five datasets and then additional
datasets were added as they came in. Anders
Note to the powers that be: it would be useful to set condor to give
streaming output. It is a little difficult to know what is going on with
Amelia (can't take advantage of the verbose function) when this standard
condor function isn't set. Thank you for such a great service hmdc!
On Fri, 14 Mar 2008, Mark Manger wrote:
Hi,
I apologize in advance for the lengthy question, but it's
representative
of many issues I face when working with large panels of economic
data, so
I would be extremely grateful for your suggestions, best practices,
experiences etc.
I'm wondering what I could do to speed up the imputation of my rather
large dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace,
my
imputations would run months. Memory is not the issue, rather I think
that
I have too many priors and/or too many missings on certain variables.
See
below, especially lnAid and lnFDI. Note that the missings are
concentrated
on certain T points (in early time points) rather than specific
cross-sectional units.
|
Variable || || | Obs Mean Std. Dev. Min
Max
Polity || || | 168160 .8924833 6.955011 -10
10
Corruptlvl|| || | 157820 5.441431 1.799652 0
10
RuleofLaw || || | 157820 5.247434 2.204846 0
10
GovStab || ||| 157820 5.935454 2.064963 0
10
log of bilat. Aid | 76079 1.919392 2.338255 -2.302585
9.692112
log of FDI in host| 32080 3.918487 2.928901 -2.372018
10.98025
Capital openness|||| | 155200 -.2888318 1.379179 -1.766966
2.602508
Polcon V || || | 154320 .3490876 .3158385 0
.89
log of GDPcap_host| 154560 7.95649 1.053043 4.933741
10.48464
| | log of ||GDP_host | 166480 29.62135 3.04193 22.97718
43.12974
| | log of ||GDP_home | 147381 31.1144 2.128313 26.15253
37.36032
| If I don't set range priors, I get nonsensical values for most of
| the
variables: negative GDP (real GDP, not negative log values), polity
scores
out of range, etc. I haven't even tried higher-order polynomials or
interactions with cross-sectional units, although I would prefer to
given
that FDI exhibits a clear trend. Breaking up the dataset randomly
into
pieces by cross-sections doesn't improve speed.
It seems that I have to make tradeoffs. What do you think would be
the
best thing to do, i.e. what is the most time-consuming issue for the
EM
algorithm?
Constrain/shorten the sample to have a higher proportion of observed
values on lnAid and lnFDI?
Accept imputations that are out of range (probably not)?
| | Break up the dataset "vertically" into one with Aid and one with
| | the |
| FDI
variable, run two sets of imputations, and merge it again?
Many thanks,
Mark
--
Mark S. Manger, PhD
Assistant Professor
Department of Political Science, McGill University
mark.manger(a)mcgill.ca
on leave 2007-08:
Advanced Research Fellow, Program on US-Japan Relations
Weatherhead Center for International Affairs
Harvard University
61 Kirkland Street, Room 301
Cambridge, MA 02138
617-495-5998
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: