Hi,
Thanks for the translation. When a parallel environment is running
say 10 runs (the condor_submit default) of Amelia simultaneously,
Amelia saves the datasets as outdata1.csv, outdata2.csv, etc., to a
single file from which the condor (the computing cluster program) job
was submitted. Is it possible that Amelia is overwriting other
imputed datasets that one of the other 10 Amelia runs has already
created? If so, maybe condor could create a separate directory for
each of the 10 Amelia runs, and/or Amelia could check the directory
it is saving into to make sure it isn't overwriting any
already-existing files.
Also, if I set Amelia to create 5 imputed datasets, and set a
random-number generating seed so I can replicate the results exactly,
and then set condor to run my single Amelia file 10 times, does that
mean that condor will create 10 identical sets of 5 imputed datasets?
But then only one of the 10 sets would be useful.
Alternatively, I suppose Amelia programmers may have intended users
to simply create as many directories and different Amelia.R files (in
my case 50) as needed, and rerun the condor submit program 50 times
for each of 50 random seeds the user sets in each of the 50 files.
Fifty simultaneous Amelia runs may sound excessive, but for a
relatively large dataset (mine is 17,000x100) it is almost necessary
in order to get 10 or 20 imputed datasets back within the first month
of the run.
Again, thanks for all the excellent work by the people at HMDC.
Anders
On Fri, 14 Mar 2008, Gary King wrote:
a translation: condor distributes jobs to a 200-node research
computing
cluster we have built at the Harvard-MIT Data Center, which is part
of the
Institute for Quantitative Social Science at Harvard.
unfortunately, that
system isn't open to the public (altho we are working at an open
source
version of the software we produced to create the cluster).
Anders' point
more generally is that Amelia can be used well in parallel
environment.
Gary
On Fri, 14 Mar 2008, Anders Schwartz Corr wrote:
Mark, Are you running Amelia on hmdc condor? I
found running 50
separate > Amelia programs simultaneously on hmdc relatively
quick
and easy -- the > first 5 imputed datasets were returned within two
or three weeks, and > then after that I got about 1 more condor egg
;) per day. I could start > my data analysis with the first five
datasets and then additional > datasets were added as they came in.
Anders
> Note to the powers that be: it would be
useful to set condor to
give > streaming output. It is a little difficult to
know what is
going on with > Amelia (can't take advantage of the verbose
function) when this standard > condor function isn't set. Thank you
for such a great service hmdc!
> On Fri, 14 Mar 2008, Mark Manger wrote:
> > Hi,
> > > I apologize in advance for the lengthy question, but it's
> representative
> of many issues I face when working with large panels of
economic > >
data, so
> I would be extremely grateful for your
suggestions, best
practices,
> experiences etc.
> > > I'm wondering what I could do to speed up the imputation
of
my rather
> large dataset (a panel of N 2120 x T 80 =
169600 obs). At this
pace, > > my
> imputations would run months. Memory is
not the issue, rather
I think > > that
> I have too many priors and/or too many
missings on certain
variables. > > See
> below, especially lnAid and lnFDI. Note
that the missings are
> concentrated
> on certain T points (in early time points) rather than specific
> cross-sectional units.
> | > > Variable || || | Obs Mean Std.
Dev.
Min > > Max
> Polity || || | 168160
.8924833
6.955011 -10 > > 10
> Corruptlvl|| || | 157820
5.441431
1.799652 0 > > 10
> RuleofLaw || || | 157820
5.247434
2.204846 0 > > 10
> GovStab || ||| 157820
5.935454
2.064963 0 > > 10
> log of bilat. Aid | 76079 1.919392
2.338255 -2.302585
> 9.692112
> log of FDI in host| 32080 3.918487 2.928901 -2.372018
> 10.98025
> Capital openness|||| | 155200 -.2888318 1.379179
-1.766966
> 2.602508
> Polcon V || || | 154320 .3490876
.3158385 0 >
> .89
> log of GDPcap_host| 154560 7.95649
1.053043 4.933741
> 10.48464
> | | log of ||GDP_host | 166480 29.62135 3.04193
22.97718
> 43.12974
> | | log of ||GDP_home | 147381 31.1144 2.128313
26.15253
> 37.36032
> > > | If I don't set range priors, I get nonsensical values for
most of > > | the
> variables: negative GDP (real GDP, not
negative log values),
polity > > scores
> out of range, etc. I haven't even
tried higher-order
polynomials or
> interactions with cross-sectional units,
although I would
prefer to > > given
> that FDI exhibits a clear trend. Breaking
up the dataset
randomly > > into
> pieces by cross-sections doesn't
improve speed.
> > > It seems that I have to make tradeoffs. What do you think
would
be > > the
> best thing to do, i.e. what is the most
time-consuming issue
for the > > EM
> algorithm?
> Constrain/shorten the sample to have a higher proportion of
observed
> values on lnAid and lnFDI?
> Accept imputations that are out of range (probably not)?
> | | Break up the dataset "vertically" into one with Aid and one
with > > | | the | > > | FDI
> variable, run two sets of imputations, and
merge it again?
> > > Many thanks,
> > > Mark
> > > --
> Mark S. Manger, PhD
> Assistant Professor
> Department of Political Science, McGill University
> mark.manger(a)mcgill.ca
> > > on leave 2007-08:
> Advanced Research Fellow, Program on US-Japan Relations
> Weatherhead Center for International Affairs
> Harvard University
> 61 Kirkland Street, Room 301
> Cambridge, MA 02138
> 617-495-5998
> -
> Amelia mailing list served by Harvard-MIT Data Center
> [Un]Subscribe/View Archive: > >
http://lists.gking.harvard.edu/?info=amelia
> > -
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
>
-
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia