Re: [amelia] what slows the imputation down so much?

15 Mar 2008

R and therefore Amelia doesn't have parallel processing built in.
But the way it works, if you want m imputations, you can run Amelia m 
times in separate jobs and it will run m*100% faster.  But you do need to 
tell your parallelization software (condor in this case) to run them each 
completely separately so they don't overwrite different imputations with 
the same file names. 
Best,
Gary

On Sat, 15 Mar 2008, Anders Schwartz Corr wrote:

...

 Hi,

 Thanks for the translation. When a parallel environment is running say 10 
 runs (the condor_submit default) of Amelia simultaneously, Amelia saves the 
 datasets as outdata1.csv, outdata2.csv, etc., to a single file from which the 
 condor (the computing cluster program) job was submitted. Is it possible that 
 Amelia is overwriting other imputed datasets that one of the other 10 Amelia 
 runs has already created? If so, maybe condor could create a separate 
 directory for each of the 10 Amelia runs, and/or Amelia could check the 
 directory it is saving into to make sure it isn't overwriting any 
 already-existing files.

 Also, if I set Amelia to create 5 imputed datasets, and set a random-number 
 generating seed so I can replicate the results exactly, and then set condor 
 to run my single Amelia file 10 times, does that mean that condor will create 
 10 identical sets of 5 imputed datasets? But then only one of the 10 sets 
 would be useful.

 Alternatively, I suppose Amelia programmers may have intended users to simply 
 create as many directories and different Amelia.R files (in my case 50) as 
 needed, and rerun the condor submit program 50 times for each of 50 random 
 seeds the user sets in each of the 50 files. Fifty simultaneous Amelia runs 
 may sound excessive, but for a relatively large dataset (mine is 17,000x100) 
 it is almost necessary in order to get 10 or 20 imputed datasets back within 
 the first month of the run.

 Again, thanks for all the excellent work by the people at HMDC.

 Anders

 On Fri, 14 Mar 2008, Gary King wrote:

  a translation: condor distributes jobs to a 200-node research computing
  cluster we have built at the Harvard-MIT Data Center, which is part of the
  Institute for Quantitative Social Science at Harvard.  unfortunately, that
  system isn't open to the public (altho we are working at an open source
  version of the software we produced to create the cluster).  Anders' point
  more generally is that Amelia can be used well in parallel environment.

  Gary

  On Fri, 14 Mar 2008, Anders Schwartz Corr wrote:

   Mark, Are you running Amelia on hmdc condor? I
found running 50 separate 
  Amelia programs simultaneously on hmdc relatively quick and easy -- the 
  first 5 imputed datasets were returned within two or three weeks, and 
  then after that I got about 1 more condor egg ;) per day. I could start 
  my data analysis with the first five datasets and then additional 
  datasets were added as they came in. Anders

  Note to the powers that be: it would be useful to set condor to give 
  streaming output. It is a little difficult to know what is going on with 
  Amelia (can't take advantage of the verbose function) when this standard 
  condor function isn't set. Thank you for such a great service hmdc!

  On Fri, 14 Mar 2008, Mark Manger wrote:

    Hi,

   I apologize in advance for the lengthy question, but it's 
   representative
   of many issues I face when working with large panels of economic 
   data, so
   I would be extremely grateful for your suggestions, best practices,
   experiences etc.

   I'm wondering what I could do to speed up the imputation of my rather
   large dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace, 
   my
   imputations would run months. Memory is not the issue, rather I think 
  that
   I have too many priors and/or too many missings on certain variables. 
   See
   below, especially lnAid and lnFDI. Note that the missings are 
  concentrated
   on certain T points (in early time points) rather than specific
   cross-sectional units.
 | 
   Variable    ||    ||  |       Obs        Mean    Std. Dev.       Min 
   Max
   Polity    ||      ||  |    168160    .8924833    6.955011        -10 
   10
   Corruptlvl||   ||     |    157820    5.441431    1.799652          0 
   10
   RuleofLaw  ||   ||    |    157820    5.247434    2.204846          0 
   10
   GovStab     ||      |||    157820    5.935454    2.064963          0 
   10
   log of bilat. Aid |     76079    1.919392    2.338255  -2.302585
   9.692112
   log of FDI in host|     32080    3.918487    2.928901  -2.372018
   10.98025
   Capital openness||||  |    155200   -.2888318    1.379179  -1.766966
   2.602508
   Polcon V  ||     ||   |    154320    .3490876    .3158385          0 
   .89
   log of GDPcap_host|    154560     7.95649    1.053043   4.933741
   10.48464
 | |  log of ||GDP_host   |    166480    29.62135     3.04193   22.97718
   43.12974
 | |  log of ||GDP_home   |    147381     31.1144    2.128313   26.15253
   37.36032

 |  If I don't set range priors, I get nonsensical values for most of 
 |  the
  variables: negative GDP (real GDP, not negative log values), polity 
  scores
   out of range, etc. I haven't even tried higher-order polynomials or
   interactions with cross-sectional units, although I would prefer to 
   given
   that FDI exhibits a clear trend. Breaking up the dataset randomly 
   into
   pieces by cross-sections doesn't improve speed.

   It seems that I have to make tradeoffs. What do you think would be 
   the
   best thing to do, i.e. what is the most time-consuming issue for the 
   EM
   algorithm?
   Constrain/shorten the sample to have a higher proportion of observed
   values on lnAid and lnFDI?
   Accept imputations that are out of range (probably not)?
 | |  Break up the dataset "vertically" into one with Aid and one with 
 | |  the | 
 |  FDI
   variable, run two sets of imputations, and merge it again?

   Many thanks,

   Mark

   --
   Mark S. Manger, PhD
   Assistant Professor
   Department of Political Science, McGill University
   mark.manger(a)mcgill.ca

   on leave 2007-08:
   Advanced Research Fellow, Program on US-Japan Relations
   Weatherhead Center for International Affairs
   Harvard University
   61 Kirkland Street, Room 301
   Cambridge, MA 02138
   617-495-5998
   -
   Amelia mailing list served by Harvard-MIT Data Center
   [Un]Subscribe/View Archive: 
   http://lists.gking.harvard.edu/?info=amelia
    -
  Amelia mailing list served by Harvard-MIT Data Center
  [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

   -
  Amelia mailing list served by Harvard-MIT Data Center
  [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

 -
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [amelia] what slows the imputation down so much?