Amelia September 2010

amelia@lists.gking.harvard.edu

8 participants
7 discussions

Error in if (sum(non.vary == 0)) in Amelia version 1.2-17

by Gregory Eady

I added a new real number variable to my dataset, but when I try to impute it along with the rest of the dataset, I receive the following error: Error in if (sum(non.vary == 0)) { : argument is not interpretable as logical It is a real number variable (checked it with is.real()), and it doesn't seem to be different from the other real number variables in the dataset. I've updated to the newest version of Amelia (Version 1.2-17, built: 2010-05-10). I noticed from looking through the mailing list archives that this error was present in 1.2-14, and was fixed (by Matt Blackwell) in 1.2.-15. It seems to be back. Perhaps this is my fault somehow however. Any help would be appreciated. Thank you, Gregory

13 years, 6 months

general questions on amelia

by Wardt, Marc van de

Dear fellow Amelia users, For my PhD project I am currently working on a pooled cross-sectional time-series dataset which contains many missing values. Since Im new to Amelia, I have a couple of questions on the use of the program. I also tried to find the answers to my questions in the archive, but some issues are still unclear to me and I hope some of you could help me out. My questions are as follows: 1) My dataset on political parties spans the 1984-2006 period and I only have observations for 1984, 1988, 1992, 1996, 1999, 2002 and 2006, as data is only gathered around elections. Altogether this entails that I have far more missing values than observations. Would you still recommend using Amelia with such an enormous degree of missingness? All my other questions deal with possible violations of the criterion that the imputation model should include at least as much information as will be used in the analysis model. This is stressed in the Amelia manual (page 10) and journal article. 2) My analysis model contains interaction-effects and Euclidian distance measures, I understand that I also have to add these to the imputation model. However, the consequence of this approach is that I end up with imputed interaction-effects and Euclidian distance measures that dont make any sense, as Amelia does not know how these variables are constructed. For example: in my analysis model, the interaction effect C is meant to be A multiplied by B, but the Amelia algorithm will replace missing values of C by something different than A*B. Since this fundamentally alters the goal of my analysis, I wonder whether it is also allowed to transform the data AFTER running Amelia. In case of the example above, this would imply that I would only include A and B in my imputation model and compute interaction-effect C myself after running Amelia. Is this a good procedure, or would it bias my results? 3)In my analysis model I focus on lagged or first-differenced effects of my X variables on Y. Do I also have to make these transformations before I run the imputation model, or can I lag/take first differences of my variables after running Amelia? The latter would be much more practical to me, because I always have regular 3-4 year intervals of missingness between 2 datapoints in my time-series, which means that I will be unable to take first differences of any variable before running the imputation model (as this will generate a variable that is always missing). 4)My final question is whether it is allowed to add new data to the analysis model after the data has been imputed by Amelia. I want to do this, because I would like to merge parties with their supporters on the basis of left-right positions. However, in order to know the left-right positions of the parties for every year, I first have to impute the missing data on the parties left-right positions. After running these imputations I have all the information I need in order to merge the parties with another dataset that contains information on their supporters. Do I want to do this, or would I again violate the assumption that the analysis model must contain the same information as the imputation model? I hope my questions make sense (Im sorry in case they dont) and hope that some of you have any advice. Your help is very much appreciated. Let me know if anything is unclear, so I can clarify it. Best regards, Marc

13 years, 6 months

question about characteristics of counted events

by James Marca

Hi, I have a data set that is based on observations of vehicles by lane. For example, each truck that passes the detector will be counted, and its characteristics recorded (length, weight etc). By summing up the counts into higher time periods, say an hour, I can use Amelia to impute missing counts of vehicles (statisticians look the other way, but I tell Amelia that the time series varies by time of day (the ts variable runs from 0 to 24) and by inserting day of week as the cs (cross section) variable (0 through 6). While that may be non-standard perversion of the input parameters, it seems to work pretty well.) I have other data for the missing periods from other detectors, so I think it makes sense to try to use Amelia rather than simply estimating a time series model for the missing counts. Now that I can impute counts I want to impute missing characteristics. For example in an hour of good observation, every truck will have a length recorded. When the detector is kaput for some reason, I want to impute the missing average lengths along with the missing truck counts. The problem is that sometimes there are no observations (a true count of zero) for a period, and so the expected length for the period is a "true" NA, rather than just a missing variable. This is quite common; while the trucks are *usually* in the right hand lanes, they are sometimes are detected in the middle lanes. The middle lane detectors therefore *usually* have a count of zero and indeterminate characteristics. My question is how to proceed using Amelia. My naive strategy would be to run Amelia once to impute the counts, and then run Amelia again for each imputation (5 times), for the characteristics of the vehicles (as a non-time dependent imputation) *only* for the non-zero periods and lanes, and then use Zelig to compute average lengths. Does this make sense, or have I crossed the line from imputation to imagination? My other thought would be to aggregate up to daily periods and make it so there should never be zero counts, but I'd really like to preserve the hourly variation in the data. One other note: I've coded my data by observation time (with multiple lanes of data). I could also code it as one record per lane per observation time, which would allow me to drop zero count lanes. I just can't see how this would help. Any advice would be appreciated. Regards, James Marca

13 years, 7 months

Imputation for information, not data analysis

by Fernando Mayer

Hi, I have a dataset where my variable of interest is the fisheries production (a continuous variable). This dataset contains information, in general, from 2005 to 2007 by month and county, which characterizes a time-series-cross-section data. What I need to do is to impute the values for 2008 for every month and county, based on past values and trends. There are some values for some counties only at the beginning of 2008 (mainly for the first four months), all the rest is missing. Since the sample design is fixed (i.e. every month all counties were visited to collect information), I created this unavailable counties and months for 2008 (based on previous available information), and filled with NA the fisheries production I wanted to impute. Then I used Amelia II to impute the values as follows: out <- amelia(data.na, ts = "TIME", cs = "COUNTY", polytime = 2, logs = "PROD", p2s = 2, m = 15, lags = "PROD", leads = "PROD", empri = 0.1 * nrow(data.na), intercs = TRUE) where data.na is my dataset, TIME is continuous from the first to the last available information ordered according to year and month, COUNTY are the counties, and PROD is the variable of interest. I used logs= because the data is highly skewed. I also used lags= and leads=, and a ridge prior (empri=) due to the high rate of missingness. Now, my aim here is not to make any further data analysis. The objective of the imputation is to have an estimated production for each county and month, only with the purpose of information, since there were no data collection for the imputed period. Said that, my question is: to have this estimated production could I just take the mean of this m = 15 imputed values? If not, what would be the best approach to get these result? Thanks in advance, --- Fernando Mayer e-mail: fernandomayer [@] gmail.com - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia More info about Amelia: http://gking.harvard.edu/amelia

13 years, 7 months

Re: theoretical/practical question about non-ignorable missing processes and how to model them in Amelia

by Antonio P. Ramos

> > Hi all, > > I'm working with a panel data set which contains many missing values for > demographic variable (e.g. mortality rates, life expectancy, etc). Some of > them will be modeled as outcome variables. As it is well know, they are not > missing at random, mostly concentrating around poor and non-democratic > countries. Thus I am assuming I have to model the missing process or at > least provide some information such as the mean of the missing data should > lower or higher for a particular set of countries. I have seen that I can > provide some prior for the imputation procedure in your software but I not > sure whether this is an explicit model of the non-ignorable process. Any > suggestions? > > Congratulations on your amazing software! > > Help and advice really appreciated, > > Antonio. >

13 years, 7 months

Can Amelia do MI if all variables have missing data? (Stata can't)

by Jennie Day

Hi out there! I hear Amelia is amazing. I've always been a Stata user, but I'm happy to convert if someone can help me with this problem I have. I am trying to to perform multiple imputation on a panel dataset where all variables have some missing values (save the unique ID numbers and the time variable). The missing rate varies from 1% to 10%. I would like to ask what options Amelia offers me to move forward. Stata is a dead end, I think. I understand the issues with this level of missingness. Listwise deletion is not an option because 1) the missingness is very likely correlated with other observed variables (e.g., income), so the missingness is MAR at best, and 2) it would make the sample size too small to be useful. STATA doesn't offer pairwise deletion, so I'd have to code this up myself. And plus - according to a Stata listserv thread - pairwise deletion generates worse biases than listwise according to (Allison 2002). So here's my situation: I have a rich panel dataset from a developing country that could yield some interesting policy results. It is the unfortunate consequence of working on data from a developing country, that the data has missing values. I've tried the mi functions in Stata using mvn (the multivariate normal estimation option), and I get error messages like the ones copied below. I've read in STATA's MI manual that doing univariate estimations for multiple imputations is incorrect procedure if the results are not used in independent analyses. I understand this, but it may be my only option. Can Amelia help me? Thanks, jennie ERROR MESSAGES 1) Iteration 0: variance-covariance matrix (Sigma) is not positive definite posterior distribution is not proper 2) Iteration 0: imputed data contain missing values This may occur when imputation variables are used as independent variables, when independent variables contain missing values, or when variance-covariance matrix becomes not positive definite. You can specify option force if you wish to proceed anyway. 3) Iteration 0: variance-covariance matrix (Sigma) is not positive definite EM did not converge

13 years, 7 months

bug found in code (quite by accident!)

by James Marca

Hi I was looking at the source code to try to figure out the difference between splinetime and polytime, and I *think* I came across a bug. I found "polytime" inside of amcheck.r in a section dealing with "splinetime". Here is my diff patch: diff --git a/R/amcheck.r b/R/amcheck.r index 246fa75..981ef9b 100644 --- a/R/amcheck.r +++ b/R/amcheck.r @@ -472,7 +472,7 @@ amcheck <- function(x,m=5,p2s=1,frontend=FALSE,idvars=NULL,logs=NULL, if (!identical(splinetime,NULL)) { #Error code: 54 #Spline of time are longer than one integer - if (length(polytime) > 1) { + if (length(splinetime) > 1) { error.code<-54 error.mess<-paste("The spline of time setting is greater than one integer.") return(list(code=error.code,mess=error.mess)) @@ -497,7 +497,7 @@ amcheck <- function(x,m=5,p2s=1,frontend=FALSE,idvars=NULL,logs=NULL, error.mess<-paste("You have set splines of time without setting the time series variable.") return(list(code=error.code,mess=error.mess)) } - if (all(!intercs,identical(polytime,0))) { + if (all(!intercs,identical(splinetime,0))) { warning(paste("You've set the spline of time to zero with no interaction with \n", "the cross-sectional variable. This has no effect on the imputation.")) } Hope that is signal rather than noise. Regards, James Marca

13 years, 7 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Amelia September 2010