Hi Nicole,
Apologies for the delay. I think your best bet here is to experiment with
the "empri" argument. If you set that to N, then that is similar to adding
N pseudo-observations to the dataset. In these pseudo-observations, each
variable has its mean equal to the overall mean of the observed variable
and all of the variables are independent of one another. Thus, adding the
"empri" argument pushes the imputations slightly toward an imputation model
where each variable is independent of one another. This little bit of
smoothing can help computational issues quite a bit.
You might try to set this to about 1% of the rows of the data, to see if
that helps stabilize things. If not, you might try increasing it to about
5%, but if things are still not working there, you might want to look into
your data to see if there are two variables that are extremely highly
correlated or even linearly dependent that shouldn't be in the imputation.
Hope that helps!
Cheers,
Matt
~~~~~~~~~~~
Matthew Blackwell
Assistant Professor of Government
Harvard University
url:
On Wed, May 6, 2015 at 5:26 AM Nicole Janz <nicolejanz(a)gmail.com> wrote:
Dear all,
I am trying to impute a data set containing panel data (country-years) on
socio-economic indicators for developing countries. I have troubles with a
one-week limit on our cluster service and look for options to speed things
up.
The overall missingness in the dataset is 40%. Some variables have
missingness over 60% (see below missingness per variable in one of my data
sets).
Here’s my amelia specifications (using Amelia 1.7.2 and R 3.1.3)
set.seed(1001)
nmi=1
amelia.out.tiny2015.04.16_1001 <- amelia(t, m=nmi, ts="year",
cs="country",intercs=TRUE,polytime=1,log=vlogs)
I created about 150 single jobs (m=1) with different variables included in
the imputation (e.g. leaving out a high-missingness variable in one of
those versions;leaving out years that have high missingness). I wanted to
run all these 150 imputations on a cluster, but none of these converge
before the cut-off limit of one week.
I wonder if I can do anything else other than reducing missingness in the
original data and submitting parallel jobs to a cluster service.
I did think about requesting more memory for each job, but I am not sure
this is likely to work, since the automatic script for memory allocations
tends to be good at assigning as much memory as needed.
I saw on this help list:
"The "empri" argument sets an empirical/ridge prior. A value of a half
to
1 percent of the sample size would be small, aid numerical stability, and
unlikely to noticably change results (unless you are using time series
cross sectional data, in which case you might use 1 percent of the sample
within any cross sectional unit).
The "tolerance" changes the point at which the EM algorithm is judged to
have converged, and setting that larger, (like .001, or even .005) is
probably quite safe. We were very conservative with this tolerance choice,
and should reexamine other options to set it dynamically."
Unfortunately I don’t quite understand these options, and I’m not sure on
how to go about testing which of these options speed up Amelia, and how
‘far’ I can go not to mess things up.
If you have any tips for me on what my next steps should be I’d be very
grateful!
Best,
Nicole
P.S. If anyone wants to see the data to help please let me know and I’d be
happy to send them over.
####################### Missingness per variable #################
# variable nmiss n propmiss
# 29 Lag_confl 160 4640 0.03448276
# 27 Lag_population 171 4640 0.03685345
# 2 infmortality 176 4640 0.03793103
# 1 lifeexp 312 4640 0.06724138
# 30 Lag_UN_FDI_flow 565 4640 0.12176724
# 31 Lag_UN_FDI_flow_pgdp 565 4640 0.12176724
# 26 Lag_GDP_curr 586 4640 0.12629310
# 32 Lag_UN_FDI_stock 699 4640 0.15064655
# 33 Lag_UN_FDI_stock_pgdp 699 4640 0.15064655
# 28 Lag_trade 921 4640 0.19849138
# 3 CIRI_WECON 1071 4640 0.23081897
# 25 Lag_polity2 1278 4640 0.27543103
# 18 Lag_US_fdi_electrical 1772 4640 0.38189655
# 19 Lag_US_fdi_transport 1821 4640 0.39245690
# 17 Lag_US_fdi_machinery 1822 4640 0.39267241
# 4 CIRI_WOSOC 1879 4640 0.40495690
# 16 Lag_US_fdi_prim_fab_metal 1930 4640 0.41594828
# 15 Lag_US_fdi_chemical 1959 4640 0.42219828
# 14 Lag_US_fdi_food 1999 4640 0.43081897
# 20 Lag_US_fdi_whole_trade 2085 4640 0.44935345
# 13 Lag_US_fdi_total_manuf 2099 4640 0.45237069
# 11 Lag_US_fdi_total 2144 4640 0.46206897
# 22 Lag_US_fdi_finance_except 2154 4640 0.46422414
# 21 Lag_US_fdi_depository 2429 4640 0.52349138
# 23 Lag_US_fdi_other_ind 2604 4640 0.56120690
# 8 Core.Country.Right.to.Housing 3095 4640 0.66702586
# 7 Core.Country.Right.to.Health 3198 4640 0.68922414
# 6 Core.Country.Right.to.Education 3238 4640 0.69784483
# 9 Core.Country.Right.to.Food 3335 4640 0.71875000
# 24 Lag_US_fdi_mining 3416 4640 0.73620690
# 10 Core.Country.Right.to.Work 3449 4640 0.74331897
# 12 Lag_US_fdi_petrol 3659 4640 0.78857759
# 5 Core.Country.SERF.Index 3682 4640 0.79353448
####################### Amelia Specifications #################
setwd("/Users/Janz01/Dropbox/MI_SERGIO_2015_04_19/MI1/")
require(Amelia)
# load data
load("tiny2015.04.16.RData")
t <- tiny2015.04.16
# Declare variables NOT to log by Amelia
vlogs <- c(
"lifeexp","infmortality","Core.Country.SERF.Index","Core.Country.Right.to.Education",
"Core.Country.Right.to.Health","Core.Country.Right.to.Housing","Core.Country.Right.to.Food","Core.Country.Right.to.Work",
"Lag_GDP_curr","Lag_population","Lag_trade",
"l_Lag_US_fdi_total","l_Lag_US_fdi_petrol","l_Lag_US_fdi_total_manuf","l_Lag_US_fdi_food",
"l_Lag_US_fdi_chemical","l_Lag_US_fdi_prim_fab_metal","l_Lag_US_fdi_machinery","l_Lag_US_fdi_electrical",
"l_Lag_US_fdi_transport","l_Lag_US_fdi_whole_trade","l_Lag_US_fdi_depository","l_Lag_US_fdi_finance_except",
"l_Lag_US_fdi_other_ind","l_Lag_US_fdi_mining","l_Lag_UN_FDI_flow","l_Lag_UN_FDI_flow_pgdp",
"l_Lag_UN_FDI_stock","l_Lag_UN_FDI_stock_pgdp")
set.seed(1001)
nmi=1
amelia.out.tiny2015.04.16_1001 <- amelia(t, m=nmi, ts="year",
cs="country",intercs=TRUE,polytime=1,log=vlogs)
# save file
save(amelia.out.tiny2015.04.16_1001,
file="amelia.out.tiny2015.04.16_1001.Rdata")
--
Dr Nicole Janz
Research Methods Associate
University of Cambridge
Social Sciences Research Methods Centre
Email (personal): nj248(a)cam.ac.uk
Website:
www.nicolejanz.de
Skype: nicole.janz
Twitter: @polscireplicate
Blog:
http://politicalsciencereplication.wordpress.com/
--
Amelia mailing list served by HUIT
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia
Amelia mailing list
Amelia(a)lists.gking.harvard.edu
To unsubscribe from this list or get other information:
https://lists.gking.harvard.edu/mailman/listinfo/amelia