Re: [amelia] what slows the imputation down so much?

15 Mar 2008

Hi,

Thanks for the translation. When a parallel environment is running say 10 
runs (the condor_submit default) of Amelia simultaneously, Amelia saves 
the datasets as outdata1.csv, outdata2.csv, etc., to a single file from 
which the condor (the computing cluster program) job was submitted. Is it 
possible that Amelia is overwriting other imputed datasets that one of the 
other 10 Amelia runs has already created? If so, maybe condor could create 
a separate directory for each of the 10 Amelia runs, and/or Amelia could 
check the directory it is saving into to make sure it isn't overwriting 
any already-existing files.

Also, if I set Amelia to create 5 imputed datasets, and set a random-number 
generating seed so I can replicate the results exactly, and then set 
condor to run my single Amelia file 10 times, does that mean that condor 
will create 10 identical sets of 5 imputed datasets? But then only one of 
the 10 sets would be useful.

Alternatively, I suppose Amelia programmers may have intended users to 
simply create as many directories and different Amelia.R files (in my case 
50) as needed, and rerun the condor submit program 50 times for each of 50 
random seeds the user sets in each of the 50 files. Fifty simultaneous 
Amelia runs may sound excessive, but for a relatively large dataset (mine 
is 17,000x100) it is almost necessary in order to get 10 or 20 imputed 
datasets back within the first month of the run.

Again, thanks for all the excellent work by the people at HMDC.

Anders

On Fri, 14 Mar 2008, Gary King wrote:

...

 a translation: condor distributes jobs to a 200-node research computing 
 cluster we have built at the Harvard-MIT Data Center, which is part of the 
 Institute for Quantitative Social Science at Harvard.  unfortunately, that 
 system isn't open to the public (altho we are working at an open source 
 version of the software we produced to create the cluster).  Anders' point 
 more generally is that Amelia can be used well in parallel environment.

 Gary

 On Fri, 14 Mar 2008, Anders Schwartz Corr wrote:

  Mark, Are you running Amelia on hmdc condor? I
found running 50 separate 
 Amelia programs simultaneously on hmdc relatively quick and easy -- the 
 first 5 imputed datasets were returned within two or three weeks, and then 
 after that I got about 1 more condor egg ;) per day. I could start my data 
 analysis with the first five datasets and then additional datasets were 
 added as they came in. Anders

 Note to the powers that be: it would be useful to set condor to give 
 streaming output. It is a little difficult to know what is going on with 
 Amelia (can't take advantage of the verbose function) when this standard 
 condor function isn't set. Thank you for such a great service hmdc!

 On Fri, 14 Mar 2008, Mark Manger wrote:

   Hi,

  I apologize in advance for the lengthy question, but it's representative
  of many issues I face when working with large panels of economic data, so
  I would be extremely grateful for your suggestions, best practices,
  experiences etc.

  I'm wondering what I could do to speed up the imputation of my rather
  large dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace, my
  imputations would run months. Memory is not the issue, rather I think 
 that
  I have too many priors and/or too many missings on certain variables. See
  below, especially lnAid and lnFDI. Note that the missings are 
 concentrated
  on certain T points (in early time points) rather than specific
  cross-sectional units.
 |
  Variable    ||    ||  |       Obs        Mean    Std. Dev.       Min Max
  Polity    ||      ||  |    168160    .8924833    6.955011        -10 10
  Corruptlvl||   ||     |    157820    5.441431    1.799652          0 10
  RuleofLaw  ||   ||    |    157820    5.247434    2.204846          0 10
  GovStab     ||      |||    157820    5.935454    2.064963          0 10
  log of bilat. Aid |     76079    1.919392    2.338255  -2.302585
  9.692112
  log of FDI in host|     32080    3.918487    2.928901  -2.372018
  10.98025
  Capital openness||||  |    155200   -.2888318    1.379179  -1.766966
  2.602508
  Polcon V  ||     ||   |    154320    .3490876    .3158385          0 .89
  log of GDPcap_host|    154560     7.95649    1.053043   4.933741
  10.48464
 | | log of ||GDP_host   |    166480    29.62135     3.04193   22.97718
  43.12974
 | | log of ||GDP_home   |    147381     31.1144    2.128313   26.15253
  37.36032

 | If I don't set range priors, I get nonsensical values for most of the
  variables: negative GDP (real GDP, not negative log values), polity 
 scores
  out of range, etc. I haven't even tried higher-order polynomials or
  interactions with cross-sectional units, although I would prefer to given
  that FDI exhibits a clear trend. Breaking up the dataset randomly into
  pieces by cross-sections doesn't improve speed.

  It seems that I have to make tradeoffs. What do you think would be the
  best thing to do, i.e. what is the most time-consuming issue for the EM
  algorithm?
  Constrain/shorten the sample to have a higher proportion of observed
  values on lnAid and lnFDI?
  Accept imputations that are out of range (probably not)?
 | | Break up the dataset "vertically" into one with Aid and one with the | 
 | FDI
  variable, run two sets of imputations, and merge it again?

  Many thanks,

  Mark

  --
  Mark S. Manger, PhD
  Assistant Professor
  Department of Political Science, McGill University
  mark.manger(a)mcgill.ca

  on leave 2007-08:
  Advanced Research Fellow, Program on US-Japan Relations
  Weatherhead Center for International Affairs
  Harvard University
  61 Kirkland Street, Room 301
  Cambridge, MA 02138
  617-495-5998
  -
  Amelia mailing list served by Harvard-MIT Data Center
  [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
   -
 Amelia mailing list served by Harvard-MIT Data Center
 [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

   -
 Amelia mailing list served by Harvard-MIT Data Center
 [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
 -
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [amelia] what slows the imputation down so much?