restriction on cluster running MI - Amelia

6 May 2015

Dear all,

I am trying to impute a data set containing panel data (country-years) on socio-economic
indicators for developing countries. I have troubles with a one-week limit on our cluster
service and look for options to speed things up. 

The overall missingness in the dataset is 40%. Some variables have missingness over 60%
(see below missingness per variable in one of my data sets).

Here’s my amelia specifications (using Amelia 1.7.2 and R 3.1.3)

set.seed(1001)

nmi=1

amelia.out.tiny2015.04.16_1001 <- amelia(t, m=nmi, ts="year",
cs="country",intercs=TRUE,polytime=1,log=vlogs) 

I created about 150 single jobs (m=1) with different variables included in the imputation
(e.g. leaving out a high-missingness variable in one of those versions;leaving out years
that have high missingness). I wanted to run all these 150 imputations on a cluster, but
none of these converge before the cut-off limit of one week.

I wonder if I can do anything else other than reducing missingness in the original data
and submitting parallel jobs to a cluster service. 

I did think about requesting more memory for each job, but I am not sure this is likely to
work, since the automatic script for memory allocations tends to be good at assigning as
much memory as needed.

I saw on this help list:

"The "empri" argument sets an empirical/ridge prior.  A value of a half to
1 percent of the sample size would be small, aid numerical stability, and unlikely to
noticably change results (unless you are using time series cross sectional data, in which
case you might use 1 percent of the sample within any cross sectional unit).

The "tolerance" changes the point at which the EM algorithm is judged to have
converged, and setting that larger, (like .001, or even .005) is probably quite safe.  We
were very conservative with this tolerance choice, and should reexamine other options to
set it dynamically."

Unfortunately I don’t quite understand these options, and I’m not sure on how to go about
testing which of these options speed up Amelia, and how ‘far’ I can go not to mess things
up.

If you have any tips for me on what my next steps should be I’d be very grateful!

Best,
Nicole
P.S. If anyone wants to see the data to help please let me know and I’d be happy to send
them over.

####################### Missingness per variable  #################

#                      variable nmiss    n   propmiss
# 29                       Lag_confl   160 4640 0.03448276
# 27                  Lag_population   171 4640 0.03685345
# 2                     infmortality   176 4640 0.03793103
# 1                          lifeexp   312 4640 0.06724138
# 30                 Lag_UN_FDI_flow   565 4640 0.12176724
# 31            Lag_UN_FDI_flow_pgdp   565 4640 0.12176724
# 26                    Lag_GDP_curr   586 4640 0.12629310
# 32                Lag_UN_FDI_stock   699 4640 0.15064655
# 33           Lag_UN_FDI_stock_pgdp   699 4640 0.15064655
# 28                       Lag_trade   921 4640 0.19849138
# 3                       CIRI_WECON  1071 4640 0.23081897
# 25                     Lag_polity2  1278 4640 0.27543103
# 18           Lag_US_fdi_electrical  1772 4640 0.38189655
# 19            Lag_US_fdi_transport  1821 4640 0.39245690
# 17            Lag_US_fdi_machinery  1822 4640 0.39267241
# 4                       CIRI_WOSOC  1879 4640 0.40495690
# 16       Lag_US_fdi_prim_fab_metal  1930 4640 0.41594828
# 15             Lag_US_fdi_chemical  1959 4640 0.42219828
# 14                 Lag_US_fdi_food  1999 4640 0.43081897
# 20          Lag_US_fdi_whole_trade  2085 4640 0.44935345
# 13          Lag_US_fdi_total_manuf  2099 4640 0.45237069
# 11                Lag_US_fdi_total  2144 4640 0.46206897
# 22       Lag_US_fdi_finance_except  2154 4640 0.46422414
# 21           Lag_US_fdi_depository  2429 4640 0.52349138
# 23            Lag_US_fdi_other_ind  2604 4640 0.56120690
# 8    Core.Country.Right.to.Housing  3095 4640 0.66702586
# 7     Core.Country.Right.to.Health  3198 4640 0.68922414
# 6  Core.Country.Right.to.Education  3238 4640 0.69784483
# 9       Core.Country.Right.to.Food  3335 4640 0.71875000
# 24               Lag_US_fdi_mining  3416 4640 0.73620690
# 10      Core.Country.Right.to.Work  3449 4640 0.74331897
# 12               Lag_US_fdi_petrol  3659 4640 0.78857759
# 5          Core.Country.SERF.Index  3682 4640 0.79353448 

####################### Amelia Specifications #################

setwd("/Users/Janz01/Dropbox/MI_SERGIO_2015_04_19/MI1/")
require(Amelia)

# load data
load("tiny2015.04.16.RData")

t <- tiny2015.04.16

# Declare variables NOT to log by Amelia
vlogs <- c(   
"lifeexp","infmortality","Core.Country.SERF.Index","Core.Country.Right.to.Education",

"Core.Country.Right.to.Health","Core.Country.Right.to.Housing","Core.Country.Right.to.Food","Core.Country.Right.to.Work",

               "Lag_GDP_curr","Lag_population","Lag_trade",

"l_Lag_US_fdi_total","l_Lag_US_fdi_petrol","l_Lag_US_fdi_total_manuf","l_Lag_US_fdi_food",

"l_Lag_US_fdi_chemical","l_Lag_US_fdi_prim_fab_metal","l_Lag_US_fdi_machinery","l_Lag_US_fdi_electrical",

"l_Lag_US_fdi_transport","l_Lag_US_fdi_whole_trade","l_Lag_US_fdi_depository","l_Lag_US_fdi_finance_except",

"l_Lag_US_fdi_other_ind","l_Lag_US_fdi_mining","l_Lag_UN_FDI_flow","l_Lag_UN_FDI_flow_pgdp",

               "l_Lag_UN_FDI_stock","l_Lag_UN_FDI_stock_pgdp")

set.seed(1001)
nmi=1

amelia.out.tiny2015.04.16_1001 <- amelia(t, m=nmi, ts="year",
cs="country",intercs=TRUE,polytime=1,log=vlogs) 

# save file
save(amelia.out.tiny2015.04.16_1001,
file="amelia.out.tiny2015.04.16_1001.Rdata")

-- 
Dr Nicole Janz
Research Methods Associate
University of Cambridge
Social Sciences Research Methods Centre
Email (personal): nj248(a)cam.ac.uk
Website: www.nicolejanz.de
Skype: nicole.janz
Twitter: @polscireplicate
Blog: http://politicalsciencereplication.wordpress.com/