Re: [amelia] what slows the imputation down so much?

15 Mar 2008

Matt or James might have a better guess about the error message, but 
you're probably asking it to do something that's impossible.

on the stats question, if you're using the imputations for something 
linear, like a variable in a regression or an explanatory variable in a 
logit or probit or something like that, then its best to just use the 
imputation as is.  the only time you need to do more is if your analysis 
model requires it, such as for a binary dependent variable in a logit 
model or ordered variable for an ordered logit, etc.  then you can round, 
or Amelia can do this or things like it for you by telling it what type of 
variable you have.

Gary

On Sat, 15 Mar 2008, Mark Manger wrote:

...
  Hi,

 I'm intrigued by an off-list comment Anders made. I'm setting priors because 
 I know that certain variables are constrained in a particular way, e.g. I 
 treat a variable coded as 0-10 as continuous, so if Amelia draws an 11.3 that 
 doesn't seem to be a reasonable value.

 But are these values --mathematically speaking-- better to use if that's what 
 the imputation spits out? My understanding is that whether I set a range or 
 distribution prior, internally Amelia just draws from a normal distribution 
 with specific s.d. and mean anyway rather than from a closed interval. Please 
 correct me if I'm wrong about this. I would rather set distribution priors 
 because I seem to be getting the

 Error in chol(copy.theta[c(FALSE, m[ss, ]), c(FALSE, m[ss, ])]) :
 	 the leading minor of order 5 is not positive definite

 every once in a while with range priors, but not with distribution priors.

 Cheers,

 Mark

 Mark S. Manger, PhD
 Assistant Professor
 Department of Political Science, McGill University
 mark.manger(a)mcgill.ca

 on leave 2007-08:
 Advanced Research Fellow, Program on US-Japan Relations
 Weatherhead Center for International Affairs
 Harvard University
 61 Kirkland Street, Room 301
 Cambridge, MA 02138
 617-495-5998

 Gary King wrote:

  R and therefore Amelia doesn't have parallel processing built in.
  But the way it works, if you want m imputations, you can run Amelia m
  times in separate jobs and it will run m*100% faster.  But you do need to
  tell your parallelization software (condor in this case) to run them each
  completely separately so they don't overwrite different imputations with
  the same file names. Best,
  Gary

  On Sat, 15 Mar 2008, Anders Schwartz Corr wrote:

  Hi,

  Thanks for the translation. When a parallel environment is running say 
  10 runs (the condor_submit default) of Amelia simultaneously, Amelia 
  saves the datasets as outdata1.csv, outdata2.csv, etc., to a single file 
  from which the condor (the computing cluster program) job was submitted. 
  Is it possible that Amelia is overwriting other imputed datasets that 
  one of the other 10 Amelia runs has already created? If so, maybe condor 
  could create a separate directory for each of the 10 Amelia runs, and/or 
  Amelia could check the directory it is saving into to make sure it isn't 
  overwriting any already-existing files.

  Also, if I set Amelia to create 5 imputed datasets, and set a 
  random-number generating seed so I can replicate the results exactly, 
  and then set condor to run my single Amelia file 10 times, does that 
  mean that condor will create 10 identical sets of 5 imputed datasets? 
  But then only one of the 10 sets would be useful.

  Alternatively, I suppose Amelia programmers may have intended users to 
  simply create as many directories and different Amelia.R files (in my 
  case 50) as needed, and rerun the condor submit program 50 times for 
  each of 50 random seeds the user sets in each of the 50 files. Fifty 
  simultaneous Amelia runs may sound excessive, but for a relatively large 
  dataset (mine is 17,000x100) it is almost necessary in order to get 10 
  or 20 imputed datasets back within the first month of the run.

  Again, thanks for all the excellent work by the people at HMDC.

  Anders

  On Fri, 14 Mar 2008, Gary King wrote:

  a translation: condor distributes jobs to a 200-node research 
  computing
  cluster we have built at the Harvard-MIT Data Center, which is part of 
  the
  Institute for Quantitative Social Science at Harvard.  unfortunately, 
  that
  system isn't open to the public (altho we are working at an open 
  source
  version of the software we produced to create the cluster).  Anders' 
  point
  more generally is that Amelia can be used well in parallel 
  environment.

   Gary

   On Fri, 14 Mar 2008, Anders Schwartz Corr wrote:

 >   Mark, Are you running Amelia on hmdc condor? I found running 50 
  separate >  Amelia programs simultaneously on hmdc relatively quick 
  and easy -- the >  first 5 imputed datasets were returned within two 
  or three weeks, and >  then after that I got about 1 more condor egg 
 ;)  per day. I could start >  my data analysis with the first five 
  datasets and then additional >  datasets were added as they came in. 
  Anders
 > >   Note to the powers that be: it would be useful to set condor to 
  give >  streaming output. It is a little difficult to know what is 
  going on with >  Amelia (can't take advantage of the verbose function) 
  when this standard >  condor function isn't set. Thank you for such a 
  great service hmdc!
 > >   On Fri, 14 Mar 2008, Mark Manger wrote:
 > > >    Hi,
 > > > >    I apologize in advance for the lengthy question, but it's 
 > >    representative
 > >    of many issues I face when working with large panels of 
 economic > >    data, so
 > >    I would be extremely grateful for your suggestions, best 
  practices,
 > >    experiences etc.
 > > > >    I'm wondering what I could do to speed up the imputation 
  of my rather
 > >    large dataset (a panel of N 2120 x T 80 = 169600 obs). At this 
  pace, > >   my
 > >    imputations would run months. Memory is not the issue, rather 
  I think > >  that
 > >    I have too many priors and/or too many missings on certain 
  variables. > >   See
 > >   below, especially lnAid and lnFDI. Note that the missings are 
 > >   concentrated
 > >    on certain T points (in early time points) rather than specific
 > >    cross-sectional units.
 > > | > >    Variable    ||    ||  |       Obs        Mean    Std. 
  Dev.       Min > >   Max
 > >    Polity    ||      ||  |    168160    .8924833 
  6.955011        -10 > >   10
 > >    Corruptlvl||   ||     |    157820    5.441431 
  1.799652          0 > >   10
 > >    RuleofLaw  ||   ||    |    157820    5.247434 
  2.204846          0 > >   10
 > >    GovStab     ||      |||    157820    5.935454 
  2.064963          0 > >   10
 > >    log of bilat. Aid |     76079    1.919392    2.338255  -2.302585
 > >    9.692112
 > >    log of FDI in host|     32080    3.918487    2.928901  -2.372018
 > >    10.98025
 > >    Capital openness||||  |    155200   -.2888318    1.379179 
  -1.766966
 > >    2.602508
 > >    Polcon V  ||     ||   |    154320    .3490876 
  .3158385          0 > >   .89
 > >    log of GDPcap_host|    154560     7.95649    1.053043   4.933741
 > >    10.48464
 > > | |   log of ||GDP_host   |    166480    29.62135     3.04193 
  22.97718
 > >    43.12974
 > > | |   log of ||GDP_home   |    147381     31.1144    2.128313 
  26.15253
 > >    37.36032
 > > > > |   If I don't set range priors, I get nonsensical values for 
  most of > > |  the
 > >   variables: negative GDP (real GDP, not negative log values), 
 polity > >   scores
 > >    out of range, etc. I haven't even tried higher-order 
  polynomials or
 > >    interactions with cross-sectional units, although I would 
  prefer to > >   given
 > >    that FDI exhibits a clear trend. Breaking up the dataset 
 randomly > >    into
 > >    pieces by cross-sections doesn't improve speed.
 > > > >    It seems that I have to make tradeoffs. What do you think 
  would be > >   the
 > >    best thing to do, i.e. what is the most time-consuming issue 
  for the > >   EM
 > >    algorithm?
 > >    Constrain/shorten the sample to have a higher proportion of 
  observed
 > >    values on lnAid and lnFDI?
 > >    Accept imputations that are out of range (probably not)?
 > > | |   Break up the dataset "vertically" into one with Aid and one 
 with > > | |   the | > > |  FDI
 > >    variable, run two sets of imputations, and merge it again?
 > > > >    Many thanks,
 > > > >    Mark
 > > > >    --
 > >    Mark S. Manger, PhD
 > >    Assistant Professor
 > >    Department of Political Science, McGill University
 > >    mark.manger(a)mcgill.ca
 > > > >    on leave 2007-08:
 > >    Advanced Research Fellow, Program on US-Japan Relations
 > >    Weatherhead Center for International Affairs
 > >    Harvard University
 > >    61 Kirkland Street, Room 301
 > >    Cambridge, MA 02138
 > >    617-495-5998
 > >    -
 > >    Amelia mailing list served by Harvard-MIT Data Center
 > >    [Un]Subscribe/View Archive: > > 
  http://lists.gking.harvard.edu/?info=amelia
 > > >   -
 >   Amelia mailing list served by Harvard-MIT Data Center
 >   [Un]Subscribe/View Archive: 
  http://lists.gking.harvard.edu/?info=amelia
 > > 
   -
   Amelia mailing list served by Harvard-MIT Data Center
   [Un]Subscribe/View Archive: 
  http://lists.gking.harvard.edu/?info=amelia

   -
  Amelia mailing list served by Harvard-MIT Data Center
  [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia  -
 Amelia mailing list served by Harvard-MIT Data Center
 [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
 -
Amelia mailing list served by Harvard-MIT Data Center
[Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [amelia] what slows the imputation down so much?