Dear list & Matt,
After running a successful imputation in AmeliaView I wish to produce a
further few sets of (5) imputations and view the corresponding Diagnostics
for each set. I wish to do this without exiting & restarting AmeliaView.
If I do restart AmeliaView and run an (apparantly) identically specified set
of imputations on identical data I notice that for each set of imputations
the corresponding diagnostic plots vary slightly. I assume this is normal?
However after running a second and subsequent set of imputations the Output
log does not seem to change. Without restarting AmeliaView I change the
output directory before each fresh set of imputations; AmeliaView behaves
well, producing fresh sets of (5) imputations and saving them to the
allocated directory (csv files). However the log does not appear to change
and the diagnostic plots appear identical to those produced for the first
set of imputations. Am I missing something please?
many thanks
Simon UK
I have a question about combining imputed data from Amelia. I understand the rationale for running your end analysis on each imputed data set separately, and then combining the model results. However, what if your analysis is more complicated than a simple LM? For example, for my analysis, I am using imputed data sets (5) of time series variables (12 independent water quality variables, all time series, and 1 dependent time series). I am decomposing each series using loess and extracting the trend only. Then, I am using prewhitening and cross correlation to identify lags of variables that may be useful predictors. Finally, I am differencing each series and creating and comparing ARIMA models with external regressors to find the best model. I am having a hard time understanding how going through each of these steps with each imputed data set separately (and trying to combine the best models) is not going to create more variability and decrease the confidence of the model compared to averaging the imputed data sets before doing any analysis.
In short, if my imputed data sets are not "that" different, and the range of values for each of my predictors is relatively small, could it possibly be better to average the data first instead of trying to combine the best model from each?
I would greatly appreciate any comments or suggestions.
Thank you for your help.
Hello,
I have a question about an error code, and whether or not there are options for working around it.
Here is the situation:
I have 4 variables (actually 4 classes of variables) that are linear combinations of each other. Thus, they all cannot be in the imputation model. I am using the "idvars" argument to hold one of these variables out of the imputation process to avoid the "error: inv_sympd(): matrix appears to be singular" message, which I think is related to the fact that these variables are related.
However, when I try to hold one of these variables out of the imputations using the "idvars" argument, I get the message below:
>
Amelia Error Code: 32
Transfomations must be mutually exclusive, so one
variable can only be assigned one transformation. You have the
same variable designated for two transformations.
>
I have been able to get the imputations to run with the following:
1.) Running the imputation with only 3 of 4 the variables
2.) Running the imputation with 3 of the variables included in the imputation, but one left out using "idvars", and specifying incheck=FALSE
Also, the "idvars" argument works just fine when I hold out two ID variables from the imputation...so it is only failing when I also add one of these 4 correlated variables.
The 4 variables I am speaking of are not necessarily important for the imputation process, but are important for the stats I would like to run down the road on the imputed data (thus I need them in the final imputed data set). When I put all 4 of the variables into the "idvars" argument, I get the error related to the number of excluded variables >> number of variables used for imputation.
Any advice on how to handle this problem is greatly appreciated! Thank you!
All the best regards,
Ryan
Good afternoon,
I am student at the Rochester Institute of Technology and I am currently
working with a time-series data set that I am trying to impute, however I
get the following error:
Loading required package: Rcpp
##
## Amelia II: Multiple Imputation
## (Version 1.7.3, built: 2014-11-14)
## Copyright (C) 2005-2015 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
amelia starting
beginning prep functions
Variables used: Volume Speed Delay Stops
Error in colSums(sapply(priors[, 1, drop = FALSE], ">", blanks)) :
'x' must be an array of at least two dimensions
Calls: amelia ... amelia.default -> amelia.prep -> amsubset -> colSums
I am calling Amelia with the following command:
amelia(dfMat, m=5, ts=5, priors=priorsMat, bouds=boundsMat, p2s=2)
where dfMat is the data matrix of size 60000x5, where the fifth column is
the time variable, priorsMat is the matrix of observational priors with 4
columns(I specify the mean and deviation for each data point). I've checked
that dfMat and priorsMat have the correct dimensions and thus I am not sure
what could be giving me this error.
I would be very grateful if somebody could give me any pointers as to what
might be the issue.
Thank you very much.
Sincerely,
Michal Kucer
--
Michal Kucer
Rochester Institute of Technology
Dear List Members,
I experience an error when running tscsPlot on an Amelia output with the message:
Error in output$theta[, , ceiling(i/drawsperimp)] :
subscript out of bounds
The error is due to the ceiling function, which for some settings returns wrong results due to floating point arithmetic. In my case m = 6 and draws = 100 resulting in
ceiling(100 / ( (1/6) * 100)) = 7 != 6
so index is out of bounds.
I attached a patch which only uses integer arithmetic and is thus not susceptible to floating point arithmetic errors.
Hope that helps!
Kind regards,
David
Dear all,
I am trying to impute a data set containing panel data (country-years) on socio-economic indicators for developing countries. I have troubles with a one-week limit on our cluster service and look for options to speed things up.
The overall missingness in the dataset is 40%. Some variables have missingness over 60% (see below missingness per variable in one of my data sets).
Here’s my amelia specifications (using Amelia 1.7.2 and R 3.1.3)
set.seed(1001)
nmi=1
amelia.out.tiny2015.04.16_1001 <- amelia(t, m=nmi, ts="year", cs="country",intercs=TRUE,polytime=1,log=vlogs)
I created about 150 single jobs (m=1) with different variables included in the imputation (e.g. leaving out a high-missingness variable in one of those versions;leaving out years that have high missingness). I wanted to run all these 150 imputations on a cluster, but none of these converge before the cut-off limit of one week.
I wonder if I can do anything else other than reducing missingness in the original data and submitting parallel jobs to a cluster service.
I did think about requesting more memory for each job, but I am not sure this is likely to work, since the automatic script for memory allocations tends to be good at assigning as much memory as needed.
I saw on this help list:
"The "empri" argument sets an empirical/ridge prior. A value of a half to 1 percent of the sample size would be small, aid numerical stability, and unlikely to noticably change results (unless you are using time series cross sectional data, in which case you might use 1 percent of the sample within any cross sectional unit).
The "tolerance" changes the point at which the EM algorithm is judged to have converged, and setting that larger, (like .001, or even .005) is probably quite safe. We were very conservative with this tolerance choice, and should reexamine other options to set it dynamically."
Unfortunately I don’t quite understand these options, and I’m not sure on how to go about testing which of these options speed up Amelia, and how ‘far’ I can go not to mess things up.
If you have any tips for me on what my next steps should be I’d be very grateful!
Best,
Nicole
P.S. If anyone wants to see the data to help please let me know and I’d be happy to send them over.
####################### Missingness per variable #################
# variable nmiss n propmiss
# 29 Lag_confl 160 4640 0.03448276
# 27 Lag_population 171 4640 0.03685345
# 2 infmortality 176 4640 0.03793103
# 1 lifeexp 312 4640 0.06724138
# 30 Lag_UN_FDI_flow 565 4640 0.12176724
# 31 Lag_UN_FDI_flow_pgdp 565 4640 0.12176724
# 26 Lag_GDP_curr 586 4640 0.12629310
# 32 Lag_UN_FDI_stock 699 4640 0.15064655
# 33 Lag_UN_FDI_stock_pgdp 699 4640 0.15064655
# 28 Lag_trade 921 4640 0.19849138
# 3 CIRI_WECON 1071 4640 0.23081897
# 25 Lag_polity2 1278 4640 0.27543103
# 18 Lag_US_fdi_electrical 1772 4640 0.38189655
# 19 Lag_US_fdi_transport 1821 4640 0.39245690
# 17 Lag_US_fdi_machinery 1822 4640 0.39267241
# 4 CIRI_WOSOC 1879 4640 0.40495690
# 16 Lag_US_fdi_prim_fab_metal 1930 4640 0.41594828
# 15 Lag_US_fdi_chemical 1959 4640 0.42219828
# 14 Lag_US_fdi_food 1999 4640 0.43081897
# 20 Lag_US_fdi_whole_trade 2085 4640 0.44935345
# 13 Lag_US_fdi_total_manuf 2099 4640 0.45237069
# 11 Lag_US_fdi_total 2144 4640 0.46206897
# 22 Lag_US_fdi_finance_except 2154 4640 0.46422414
# 21 Lag_US_fdi_depository 2429 4640 0.52349138
# 23 Lag_US_fdi_other_ind 2604 4640 0.56120690
# 8 Core.Country.Right.to.Housing 3095 4640 0.66702586
# 7 Core.Country.Right.to.Health 3198 4640 0.68922414
# 6 Core.Country.Right.to.Education 3238 4640 0.69784483
# 9 Core.Country.Right.to.Food 3335 4640 0.71875000
# 24 Lag_US_fdi_mining 3416 4640 0.73620690
# 10 Core.Country.Right.to.Work 3449 4640 0.74331897
# 12 Lag_US_fdi_petrol 3659 4640 0.78857759
# 5 Core.Country.SERF.Index 3682 4640 0.79353448
####################### Amelia Specifications #################
setwd("/Users/Janz01/Dropbox/MI_SERGIO_2015_04_19/MI1/")
require(Amelia)
# load data
load("tiny2015.04.16.RData")
t <- tiny2015.04.16
# Declare variables NOT to log by Amelia
vlogs <- c( "lifeexp","infmortality","Core.Country.SERF.Index","Core.Country.Right.to.Education",
"Core.Country.Right.to.Health","Core.Country.Right.to.Housing","Core.Country.Right.to.Food","Core.Country.Right.to.Work",
"Lag_GDP_curr","Lag_population","Lag_trade",
"l_Lag_US_fdi_total","l_Lag_US_fdi_petrol","l_Lag_US_fdi_total_manuf","l_Lag_US_fdi_food",
"l_Lag_US_fdi_chemical","l_Lag_US_fdi_prim_fab_metal","l_Lag_US_fdi_machinery","l_Lag_US_fdi_electrical",
"l_Lag_US_fdi_transport","l_Lag_US_fdi_whole_trade","l_Lag_US_fdi_depository","l_Lag_US_fdi_finance_except",
"l_Lag_US_fdi_other_ind","l_Lag_US_fdi_mining","l_Lag_UN_FDI_flow","l_Lag_UN_FDI_flow_pgdp",
"l_Lag_UN_FDI_stock","l_Lag_UN_FDI_stock_pgdp")
set.seed(1001)
nmi=1
amelia.out.tiny2015.04.16_1001 <- amelia(t, m=nmi, ts="year", cs="country",intercs=TRUE,polytime=1,log=vlogs)
# save file
save(amelia.out.tiny2015.04.16_1001, file="amelia.out.tiny2015.04.16_1001.Rdata")
--
Dr Nicole Janz
Research Methods Associate
University of Cambridge
Social Sciences Research Methods Centre
Email (personal): nj248(a)cam.ac.uk
Website: www.nicolejanz.de
Skype: nicole.janz
Twitter: @polscireplicate
Blog: http://politicalsciencereplication.wordpress.com/