R and therefore Amelia doesn't have parallel processing built in.
But the way it works, if you want m imputations, you can run Amelia m
times in separate jobs and it will run m*100% faster. But you do need to
tell your parallelization software (condor in this case) to run them each
completely separately so they don't overwrite different imputations with
the same file names. Best,
Gary
On Sat, 15 Mar 2008, Anders Schwartz Corr wrote:
Hi,
Thanks for the translation. When a parallel environment is running say
10 runs (the condor_submit default) of Amelia simultaneously, Amelia
saves the datasets as outdata1.csv, outdata2.csv, etc., to a single file
from which the condor (the computing cluster program) job was submitted.
Is it possible that Amelia is overwriting other imputed datasets that
one of the other 10 Amelia runs has already created? If so, maybe condor
could create a separate directory for each of the 10 Amelia runs, and/or
Amelia could check the directory it is saving into to make sure it isn't
overwriting any already-existing files.
Also, if I set Amelia to create 5 imputed datasets, and set a
random-number generating seed so I can replicate the results exactly,
and then set condor to run my single Amelia file 10 times, does that
mean that condor will create 10 identical sets of 5 imputed datasets?
But then only one of the 10 sets would be useful.
Alternatively, I suppose Amelia programmers may have intended users to
simply create as many directories and different Amelia.R files (in my
case 50) as needed, and rerun the condor submit program 50 times for
each of 50 random seeds the user sets in each of the 50 files. Fifty
simultaneous Amelia runs may sound excessive, but for a relatively large
dataset (mine is 17,000x100) it is almost necessary in order to get 10
or 20 imputed datasets back within the first month of the run.
Again, thanks for all the excellent work by the people at HMDC.
Anders
On Fri, 14 Mar 2008, Gary King wrote:
a translation: condor distributes jobs to a 200-node research
computing
cluster we have built at the Harvard-MIT Data Center, which is part of
the
Institute for Quantitative Social Science at Harvard. unfortunately,
that
system isn't open to the public (altho we are working at an open
source
version of the software we produced to create the cluster). Anders'
point
more generally is that Amelia can be used well in parallel
environment.
Gary
On Fri, 14 Mar 2008, Anders Schwartz Corr wrote:
> Mark, Are you running Amelia on hmdc condor? I found running 50
separate > Amelia programs simultaneously on hmdc relatively quick
and easy -- the > first 5 imputed datasets were returned within two
or three weeks, and > then after that I got about 1 more condor egg
;) per day. I could start > my data analysis with the first five
datasets and then additional > datasets were added as they came in.
Anders
> > Note to the powers that be: it would be useful to set condor to
give > streaming output. It is a little difficult to know what is
going on with > Amelia (can't take advantage of the verbose function)
when this standard > condor function isn't set. Thank you for such a
great service hmdc!
> > On Fri, 14 Mar 2008, Mark Manger wrote:
> > > Hi,
> > > > I apologize in advance for the lengthy question, but it's
> > representative
> > of many issues I face when working with large panels of
economic > > data, so
> > I would be extremely grateful for your suggestions, best
practices,
> > experiences etc.
> > > > I'm wondering what I could do to speed up the imputation
of my rather
> > large dataset (a panel of N 2120 x T 80 = 169600 obs). At this
pace, > > my
> > imputations would run months. Memory is not the issue, rather
I think > > that
> > I have too many priors and/or too many missings on certain
variables. > > See
> > below, especially lnAid and lnFDI. Note that the missings are
> > concentrated
> > on certain T points (in early time points) rather than specific
> > cross-sectional units.
> > | > > Variable || || | Obs Mean Std.
Dev. Min > > Max
> > Polity || || | 168160 .8924833
6.955011 -10 > > 10
> > Corruptlvl|| || | 157820 5.441431
1.799652 0 > > 10
> > RuleofLaw || || | 157820 5.247434
2.204846 0 > > 10
> > GovStab || ||| 157820 5.935454
2.064963 0 > > 10
> > log of bilat. Aid | 76079 1.919392 2.338255 -2.302585
> > 9.692112
> > log of FDI in host| 32080 3.918487 2.928901 -2.372018
> > 10.98025
> > Capital openness|||| | 155200 -.2888318 1.379179
-1.766966
> > 2.602508
> > Polcon V || || | 154320 .3490876
.3158385 0 > > .89
> > log of GDPcap_host| 154560 7.95649 1.053043 4.933741
> > 10.48464
> > | | log of ||GDP_host | 166480 29.62135 3.04193
22.97718
> > 43.12974
> > | | log of ||GDP_home | 147381 31.1144 2.128313
26.15253
> > 37.36032
> > > > | If I don't set range priors, I get nonsensical values for
most of > > | the
> > variables: negative GDP (real GDP, not negative log values),
polity > > scores
> > out of range, etc. I haven't even tried higher-order
polynomials or
> > interactions with cross-sectional units, although I would
prefer to > > given
> > that FDI exhibits a clear trend. Breaking up the dataset
randomly > > into
> > pieces by cross-sections doesn't improve speed.
> > > > It seems that I have to make tradeoffs. What do you think
would be > > the
> > best thing to do, i.e. what is the most time-consuming issue
for the > > EM
> > algorithm?
> > Constrain/shorten the sample to have a higher proportion of
observed
> > values on lnAid and lnFDI?
> > Accept imputations that are out of range (probably not)?
> > | | Break up the dataset "vertically" into one with Aid and one
with > > | | the | > > | FDI
> > variable, run two sets of imputations, and merge it again?
> > > > Many thanks,
> > > > Mark
> > > > --
> > Mark S. Manger, PhD
> > Assistant Professor
> > Department of Political Science, McGill University
> > mark.manger(a)mcgill.ca
> > > > on leave 2007-08:
> > Advanced Research Fellow, Program on US-Japan Relations
> > Weatherhead Center for International Affairs
> > Harvard University
> > 61 Kirkland Street, Room 301
> > Cambridge, MA 02138
> > 617-495-5998
> > -
> > Amelia mailing list served by Harvard-MIT Data Center
> > [Un]Subscribe/View Archive: > >
http://lists.gking.harvard.edu/?info=amelia
> > > -
> Amelia mailing list served by Harvard-MIT Data Center
> [Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
> >
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: