Amelia July 2013

amelia@lists.gking.harvard.edu

7 participants
11 discussions

by James Marca

While I'm bombarding this list... A long time ago I read that Amelia had the ability to do spatial imputations as well as timeseries. I think it was in one of the pdf papers linked from the website. I believe the example was countries in Africa, and how their economic activity (?) could be linked spatially and temporally Is there any activity on developing this function? As a transportation researcher, both space and time are important to my work, and it would be useful to be able to specify, say, network distance between locations and somehow use that variable to influence the imputation process. Thanks, James -- James E. Marca, PhD Researcher Institute of Transportation Studies AIRB Suite 4000 University of California Irvine, CA 92697-3600

10 years, 9 months

splinetime vs polytime options?

by James Marca

Hi Amelia users, The reason I was hitting the website was to see if there was any better documentation on the difference between the splinetime and polytime options in Amelia. My old code uses splinetime, and I have a comment saying "use splinetime, not polytime", but I forget exactly why and it isn't well documented in the PDF: ## polytime integer between 0 and 3 indicating what power of polynomial should be included in the imputation model to account for the effects of time. A setting of 0 would indicate constant levels, 1 would indicate linear time effects, 2 would indicate squared effects, and 3 would indicate cubic time effects. ## splinetime interger (sic) value of 0 or greater to control cubic smoothing splines of time. Values between 0 and 3 create a simple polynomial of time (identical to the polytime argument). Values k greater than 3 create a spline with an additional k-3 knot-points To me that says splinetime is a superset of polytime (values k in (0,1,2,3) are the same as polytime=k), which in turn implies to me that polytime is there only for backwards compatibility. Is that correct? Regards, James -- James E. Marca, PhD Researcher Institute of Transportation Studies AIRB Suite 4000 University of California Irvine, CA 92697-3600

10 years, 9 months

Amelia website problems?

by James Marca

Hi Amelia maintainers, I was just hitting the Amelia website from CRAN (http://gking.harvard.edu/amelia) and it has pegged a CPU core twice on my laptop and once on my desktop. I've had to kill Firefox to close the page Laptop Firefox version 19.0 on Linux, desktop Firefox version 22.0 on Linux. Regards, James -- James E. Marca, PhD Researcher Institute of Transportation Studies AIRB Suite 4000 University of California Irvine, CA 92697-3600

10 years, 9 months

parallel option: only one R process/thread is created

by Nicole Janz

Dear all, I have been working on submitting amelia jobs on the cluster service of my university. I cannot get the parallel option to work. When I invoke amelia with: amelia(t, m=nmi, ts="year", cs="country",intercs=TRUE,polytime=1,log=vlogs,parallel="multicore",ncpus=4) only one R process/thread is created. Cluster information: 16-core RedHat Enterprise Linux 6 x86_64 server. Any ideas? Thank you! Nicole Nicole Janz, PhD Cand. Lecturer at Social Sciences Research Methods Centre 2013/14 University of Cambridge Department of Politics and International Studies www.nicolejanz.de | nj248(a)cam.ac.uk | Mobile: +44 (0) 7905 70 1 69 4 Skype: nicole.janz Blog: politicalsciencereplication.wordpress.com

10 years, 9 months

Error in match.fun(FUN) : object 'is' not found

by Nicole Janz

Can anyone help me with this error when trying to run Amelia: > Error in match.fun(FUN) : object 'is' not found > Calls: amelia -> amelia.default -> sapply -> match.fun > Execution halted > Loading required package: Amelia > Loading required package: foreign > Loading required package: Rcpp > Loading required package: RcppArmadillo I run the same Amelia code (see copied below) fine on my mac, and on my uni's windows 7 computer (both with Amelia 1.7.2. and R 3.0.1). When I try to run the code for several imputation files on our uni's cluster, I get the above error in the output file. I cannot re-create the error on my own computers. The cluster also uses the same R and Amelia version. My code is copied below. Thank you, Nicole Nicole Janz, PhD Cand. Lecturer at Social Sciences Research Methods Centre 2013/14 University of Cambridge Department of Politics and International Studies www.nicolejanz.de | nj248(a)cam.ac.uk | Mobile: +44 (0) 7905 70 1 69 4 Skype: nicole.janz Blog: politicalsciencereplication.wordpress.com Code for imputation (one of 25 for m=1 each): ####################### # Multiple Imputation # Darwin - Janz - original FDI - 1 Year Lag ####################### require(Amelia) #load data set load("t1.Rdata") dim(t1) # 3822 90 colnames(t1) head(t1) t <- t1 # remove unnecessary variables (original FDI) vrm <- c( "Mos_labor", "CIRI_NEW_EMPINX" , "US_fdi_total", "US_fdi_petrol", "US_fdi_total_manuf", "US_fdi_food", "US_fdi_chemical", "US_fdi_prim_fab_metal", "US_fdi_machinery", "US_fdi_electrical" , "US_fdi_transport" , "US_fdi_whole_trade" , "US_fdi_depository" , "US_fdi_finance_except", "US_fdi_mining" , "polity2" , "GDP_const2000" , "GDP_curr" , "WB_FDI_percentGDP" , "WB_FDI_curr" , "population" , "trade" , "confl" , "UN_FDI_flow" , "UN_FDI_flow_pgdp" , "UN_FDI_stock" , "UN_FDI_stock_pgdp" , "l_US_fdi_total" , "l_US_fdi_petrol" , "l_US_fdi_total_manuf" , "l_US_fdi_food" , "l_US_fdi_chemical" , "l_US_fdi_prim_fab_metal", "l_US_fdi_machinery" , "l_US_fdi_electrical" , "l_US_fdi_transport" , "l_US_fdi_whole_trade" , "l_US_fdi_depository" , "l_US_fdi_finance_except", "l_US_fdi_mining" , "l_WB_FDI_percentGDP" , "l_WB_FDI_curr" , "l_UN_FDI_flow" , "l_UN_FDI_flow_pgdp" , "l_UN_FDI_stock" , "l_UN_FDI_stock_pgdp" , "Lag_l_WB_FDI_percentGDP" , "Lag_l_WB_FDI_curr" , "Lag_WB_FDI_percentGDP" , "Lag_WB_FDI_curr" , "Lag_GDP_const2000", "Lag_UN_FDI_flow", "Lag_UN_FDI_flow_pgdp", "Lag_UN_FDI_stock", "Lag_UN_FDI_stock_pgdp" ) length(vrm)#55 t.new <- t[,!(colnames(t) %in% vrm)] dim(t)# 3822 dim(t.new) #3822 37 t <- t.new colnames(t) # Declare log variables that have no negative values vlogs <- c( "Lag_l_US_fdi_mining", #newly inserted "Lag_GDP_curr", "Lag_population", "Lag_trade", "Lag_l_US_fdi_total", "Lag_l_US_fdi_petrol", "Lag_l_US_fdi_total_manuf", "Lag_l_US_fdi_food", "Lag_l_US_fdi_chemical", "Lag_l_US_fdi_prim_fab_metal", "Lag_l_US_fdi_machinery", "Lag_l_US_fdi_electrical", "Lag_l_US_fdi_transport", "Lag_l_US_fdi_whole_trade", "Lag_l_US_fdi_depository", "Lag_l_US_fdi_finance_except", #"Lag_l_US_fdi_mining", "Lag_l_UN_FDI_flow", "Lag_l_UN_FDI_flow_pgdp", "Lag_l_UN_FDI_stock", "Lag_l_UN_FDI_stock_pgdp", "lifeexp", "infmortality" ) set.seed(1004) nmi=1 amelia.out2013.07.10.0004 <- amelia(t, m=nmi, ts="year", cs="country",intercs=TRUE,polytime=1,log=vlogs) # save file save(amelia.out2013.07.10.0004, file="amelia.out2013.07.10.0004.Rdata")

10 years, 9 months

Using Amelia on a Cluster

by Jacalyn Huband

I have been able to run the "multicore" option of amelia on a single (Linux) computer with 8 cores, but I am having problems porting my program to a (Linux) cluster. Specifically, I have requested (through the PBS resource management software) a single node with 16 cores. My program appears to run correctly, but then gets killed by PBS (error message: ncpus 4.01 exceeded limit 1). In the amelia function, I'm setting m = 5, parallel = "multicore", and ncpus = 1. I've tried other values for ncpus (including 5, 10, and 16), but the job still gets killed. (Note: It does not get killed if m = 1, but that defeats the purpose of using multiple cores.) Is the amelia function trying to spawn threads of which I am unaware? J. Huband

10 years, 9 months

resume job after disruption

by Nicole Janz

Is there a way Amelia could resume the imputation after being disrupted, e.g. when you run a job on a cluster and you have a 12 hours maximum for each job. I'm wondering if I could save the current imputation and resume from the last saved state? Thank you! Nicole Nicole Janz, PhD Cand. Lecturer at Social Sciences Research Methods Centre 2013/14 University of Cambridge Department of Politics and International Studies www.nicolejanz.de | nj248(a)cam.ac.uk | Mobile: +44 (0) 7905 70 1 69 4 Skype: nicole.janz Blog: politicalsciencereplication.wordpress.com

10 years, 9 months

multilevel model in Amelia

by Natalia & David, Freedman & Pinto

Dear list server group, Because I've previously analyzed already-imputed data in the past (DXA data from NHANES) and performed some simple imputations in cross-sectional data, I've volunteered (been nominated?) to help co-workers perform multiple imputation in a longitudinal, multilevel data set. The sample is ~1500 infants who were visited every month for the first year of life, with the main exposure being parent-reported sugar-sweetened beverage (SSB) consumption over the last week (yes/no); the outcome is obesity at age 6 y. Overall, about 17% of the data for SSB is missing, with the amount of missing data increasing in the latter monthly visits. While SSB consumption generally increases from 1% to 11% over the 12 months, about 10% of the sample has a ‘yes’ that is followed by a ‘no’ for SSB intake. There is also a large among of missing data on other level-1 variables, such as solid food introduction, and for level-2 covariates such as family income, birth weight, mother’s weight status, etc. In Amelia, I've been treating each child as a cross-sectional unit (cs= ’childID’) and using month of visit for the time-series variable (ts=’month’). I've included SSB in the lags and leads options. An initial attempt at using ’polytime=2’ (with or without the intercs=T option) failed to converge even after an hour. So, my question is whether this approach, based on using cs=, ts=, lags=, and leads= is adequate for dealing with multilevel data of this type? Or should I really be using polytime and interceps=T in Amelia, or using mice.impute.2l.norm in MICE? None of the intercorrelations in the data are very strong, with the highest being about r=0.20. I’ve been running the imputations in Ubuntu with options(amelia.parallel='multicore', amelia.ncpus=4). Thanks very much for any help/suggestions David Freedman, Division of Nutrition, CDC Atlanta

10 years, 9 months

error: inv(): matrix appears to be singular

by Nicole Janz

Dear all, I am rerunning an imputation from a while ago (Amelia version 1.6.3.), and the only change is that I'm using a one year lag on all independent variables instead of the 'original' variables. Otherwise the code + data is pretty much the same, and it ran fine then. I now use Amelia_1.7.2. After a chain length of 21, I get several lines saying error: inv(): matrix appears to be singular. I have now already kicked out a variable that was .98 correlated with another one - that did not solve the problem. The link to my data set: https://www.dropbox.com/s/uc2jst4asiwae4h/t1.Rdata The Rcode: https://www.dropbox.com/s/33cfk7fqcasngdq/Janz_Darwin_0001.R The session info is copied below. I would be very grateful for any hints on what I can improve. Thank you! Nicole -- > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Amelia_1.7.2 RcppArmadillo_0.3.900.0 Rcpp_0.10.3 foreign_0.8-54 loaded via a namespace (and not attached): [1] tools_3.0.1 -- Nicole Janz, PhD Cand. Lecturer at Social Sciences Research Methods Centre University of Cambridge Department of Politics and International Studies www.nicolejanz.de | nj248(a)cam.ac.uk | Mobile: +44 (0) 7905 70 1 69 4 Skype: nicole.janz

10 years, 9 months

Re: [amelia] Is single imputation faster in parallel? Need help speeding up imputation.

by Honaker, James

Isaac, In a time-series cross-sectional setting, I might. Suggest 5% (to 10%) of the n in each cross-section (which is going to be be smaller typically than 1% of the total n). So in your series of 21 observations per cross section, an empri=1 (or 2) should again aid stability and not shrink the coefficients significantly. This is advice from a mix of intuition, exploration and experience from use cases, but of course this could really vary in some settings. Off list I got some follow up email about my earlier note, which made it clear that I wasn't very clear. The "tolerance" argument I also suggested adjusting changes how the EM algorithm judges whether it has converged. This is a separate thing you might adjust in addition to empirical/ridge priors. Larger numbers would mean that the model parameters (on z-transformed data) can have larger changes between EM-steps and be considered converged. James -- James Honaker, Senior Research Scientist //// Institute for Quantitative Social Science, Harvard University -----Original message----- From: Isaac Petersen <dadrivr(a)gmail.com> To: "Honaker, James" <jhonaker(a)iq.harvard.edu> Cc: Amelia Mailing List <amelia(a)lists.gking.harvard.edu> Sent: Tue, Jul 2, 2013 21:13:19 GMT+00:00 Subject: Re: [amelia] Is single imputation faster in parallel? Need help speeding up imputation. Thanks, James. Your response was very helpful. Just to clarify on the ridge prior: My matrix to be imputed is 12,285 rows by 62 columns, composed of 585 cross sectional units and 21 time series units. Would a good ridge prior be 1 percent of 21 (where 21 is the number of rows---i.e., time series units---within each cross-sectional unit)? Thanks for clarifying. -Isaac On Tue, Jul 2, 2013 at 10:41 AM, Honaker, James <jhonaker(a)iq.harvard.edu<mailto:jhonaker@iq.harvard.edu>> wrote: Isaac, In addition to the newer "multicore" abilities you mention, a small empirical prior, will speed up convergence. The "empri" argument sets an empirical/ridge prior. A value of a half to 1 percent of the sample size would be small, aid numerical stability, and unlikely to noticably change results (unless you are using time series cross sectional data, in which case you might use 1 percent of the sample within any cross sectional unit). The "tolerance" changes the point at which the EM algorithm is judged to have converged, and setting that larger, (like .001, or even .005) is probably quite safe. We were very conservative with this tolerance choice, and should reexamine other options to set it dynamically. Best, James. -- James Honaker, Senior Research Scientist //// Institute for Quantitative Social Science, Harvard University ________________________________ From: amelia-bounces(a)lists.gking.harvard.edu<mailto:amelia-bounces@lists.gking.harvard.edu> [amelia-bounces(a)lists.gking.harvard.edu<mailto:amelia-bounces@lists.gking.harvard.edu>] on behalf of Isaac Petersen [dadrivr(a)gmail.com<mailto:dadrivr@gmail.com>] Sent: Tuesday, July 02, 2013 9:55 AM To: Amelia Mailing List Subject: [amelia] Is single imputation faster in parallel? Need help speeding up imputation. I'm looking to speed up the run time of a single imputation on a large data set with repeated measures that takes many hours. Will running the imputation in parallel with the parallel="multicore" option and 6 cores speed up the run time of a single imputation, or will it only speed up the run time of multiple imputations (by running them simultaneously)? What are my best options for making the single imputation run faster while minimizing any sacrifices in imputation accuracy? Many thanks! -Isaac

10 years, 9 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Amelia July 2013