Dear all,
I am trying to impute a data set containing panel data (country-years) on socio-economic
indicators for developing countries. I have troubles with a one-week limit on our cluster
service and look for options to speed things up.
The overall missingness in the dataset is 40%. Some variables have missingness over 60%
(see below missingness per variable in one of my data sets).
Here’s my amelia specifications (using Amelia 1.7.2 and R 3.1.3)
set.seed(1001)
nmi=1
amelia.out.tiny2015.04.16_1001 <- amelia(t, m=nmi, ts="year",
cs="country",intercs=TRUE,polytime=1,log=vlogs)
I created about 150 single jobs (m=1) with different variables included in the imputation
(e.g. leaving out a high-missingness variable in one of those versions;leaving out years
that have high missingness). I wanted to run all these 150 imputations on a cluster, but
none of these converge before the cut-off limit of one week.
I wonder if I can do anything else other than reducing missingness in the original data
and submitting parallel jobs to a cluster service.
I did think about requesting more memory for each job, but I am not sure this is likely to
work, since the automatic script for memory allocations tends to be good at assigning as
much memory as needed.
I saw on this help list:
"The "empri" argument sets an empirical/ridge prior. A value of a half to
1 percent of the sample size would be small, aid numerical stability, and unlikely to
noticably change results (unless you are using time series cross sectional data, in which
case you might use 1 percent of the sample within any cross sectional unit).
The "tolerance" changes the point at which the EM algorithm is judged to have
converged, and setting that larger, (like .001, or even .005) is probably quite safe. We
were very conservative with this tolerance choice, and should reexamine other options to
set it dynamically."
Unfortunately I don’t quite understand these options, and I’m not sure on how to go about
testing which of these options speed up Amelia, and how ‘far’ I can go not to mess things
up.
If you have any tips for me on what my next steps should be I’d be very grateful!
Best,
Nicole
P.S. If anyone wants to see the data to help please let me know and I’d be happy to send
them over.
####################### Missingness per variable #################
# variable nmiss n propmiss
# 29 Lag_confl 160 4640 0.03448276
# 27 Lag_population 171 4640 0.03685345
# 2 infmortality 176 4640 0.03793103
# 1 lifeexp 312 4640 0.06724138
# 30 Lag_UN_FDI_flow 565 4640 0.12176724
# 31 Lag_UN_FDI_flow_pgdp 565 4640 0.12176724
# 26 Lag_GDP_curr 586 4640 0.12629310
# 32 Lag_UN_FDI_stock 699 4640 0.15064655
# 33 Lag_UN_FDI_stock_pgdp 699 4640 0.15064655
# 28 Lag_trade 921 4640 0.19849138
# 3 CIRI_WECON 1071 4640 0.23081897
# 25 Lag_polity2 1278 4640 0.27543103
# 18 Lag_US_fdi_electrical 1772 4640 0.38189655
# 19 Lag_US_fdi_transport 1821 4640 0.39245690
# 17 Lag_US_fdi_machinery 1822 4640 0.39267241
# 4 CIRI_WOSOC 1879 4640 0.40495690
# 16 Lag_US_fdi_prim_fab_metal 1930 4640 0.41594828
# 15 Lag_US_fdi_chemical 1959 4640 0.42219828
# 14 Lag_US_fdi_food 1999 4640 0.43081897
# 20 Lag_US_fdi_whole_trade 2085 4640 0.44935345
# 13 Lag_US_fdi_total_manuf 2099 4640 0.45237069
# 11 Lag_US_fdi_total 2144 4640 0.46206897
# 22 Lag_US_fdi_finance_except 2154 4640 0.46422414
# 21 Lag_US_fdi_depository 2429 4640 0.52349138
# 23 Lag_US_fdi_other_ind 2604 4640 0.56120690
# 8 Core.Country.Right.to.Housing 3095 4640 0.66702586
# 7 Core.Country.Right.to.Health 3198 4640 0.68922414
# 6 Core.Country.Right.to.Education 3238 4640 0.69784483
# 9 Core.Country.Right.to.Food 3335 4640 0.71875000
# 24 Lag_US_fdi_mining 3416 4640 0.73620690
# 10 Core.Country.Right.to.Work 3449 4640 0.74331897
# 12 Lag_US_fdi_petrol 3659 4640 0.78857759
# 5 Core.Country.SERF.Index 3682 4640 0.79353448
####################### Amelia Specifications #################
setwd("/Users/Janz01/Dropbox/MI_SERGIO_2015_04_19/MI1/")
require(Amelia)
# load data
load("tiny2015.04.16.RData")
t <- tiny2015.04.16
# Declare variables NOT to log by Amelia
vlogs <- c(
"lifeexp","infmortality","Core.Country.SERF.Index","Core.Country.Right.to.Education",
"Core.Country.Right.to.Health","Core.Country.Right.to.Housing","Core.Country.Right.to.Food","Core.Country.Right.to.Work",
"Lag_GDP_curr","Lag_population","Lag_trade",
"l_Lag_US_fdi_total","l_Lag_US_fdi_petrol","l_Lag_US_fdi_total_manuf","l_Lag_US_fdi_food",
"l_Lag_US_fdi_chemical","l_Lag_US_fdi_prim_fab_metal","l_Lag_US_fdi_machinery","l_Lag_US_fdi_electrical",
"l_Lag_US_fdi_transport","l_Lag_US_fdi_whole_trade","l_Lag_US_fdi_depository","l_Lag_US_fdi_finance_except",
"l_Lag_US_fdi_other_ind","l_Lag_US_fdi_mining","l_Lag_UN_FDI_flow","l_Lag_UN_FDI_flow_pgdp",
"l_Lag_UN_FDI_stock","l_Lag_UN_FDI_stock_pgdp")
set.seed(1001)
nmi=1
amelia.out.tiny2015.04.16_1001 <- amelia(t, m=nmi, ts="year",
cs="country",intercs=TRUE,polytime=1,log=vlogs)
# save file
save(amelia.out.tiny2015.04.16_1001,
file="amelia.out.tiny2015.04.16_1001.Rdata")
--
Dr Nicole Janz
Research Methods Associate
University of Cambridge
Social Sciences Research Methods Centre
Email (personal): nj248(a)cam.ac.uk
Website:
www.nicolejanz.de
Skype: nicole.janz
Twitter: @polscireplicate
Blog:
http://politicalsciencereplication.wordpress.com/