question about characteristics of counted events - Amelia

Matt Blackwell

14 Sep 14 Sep

8:36 a.m.

Hi James, A few thoughts. If the truck length variables changes over time, then you can impute it along with the counts in the same Amelia run. If there is no empirical time dependence, then Amelia will not use time to impute the truck lengths. As for the truly missing truck lengths. You can always go back into your imputed data and manually code those as missing. R code would look something like: a.out <- amelia(***your call here***) for (i in 1:length(a.out$imputations)) { mask <- a.out$imputations[[i]]$count == 0 is.na(a.out$imputations[[i]]) <- mask } You'll want to double check that it works (I haven't tested it). The reason why this will work is that the imputed cell has not added any information to the data itself, it has only added the information from the observed values of the cell. Thus, omitting that observation from the imputation entirely would bias the imputation. I hope that helps. Cheers, matt. On Tue, Sep 14, 2010 at 1:03 AM, James Marca <jmarca(a)translab.its.uci.edu> wrote:

...

Hi, I have a data set that is based on observations of vehicles by lane. For example, each truck that passes the detector will be counted, and its characteristics recorded (length, weight etc). By summing up the counts into higher time periods, say an hour, I can use Amelia to impute missing counts of vehicles (statisticians look the other way, but I tell Amelia that the time series varies by time of day (the ts variable runs from 0 to 24) and by inserting day of week as the cs (cross section) variable (0 through 6). While that may be non-standard perversion of the input parameters, it seems to work pretty well.) I have other data for the missing periods from other detectors, so I think it makes sense to try to use Amelia rather than simply estimating a time series model for the missing counts. Now that I can impute counts I want to impute missing characteristics. For example in an hour of good observation, every truck will have a length recorded. When the detector is kaput for some reason, I want to impute the missing average lengths along with the missing truck counts. The problem is that sometimes there are no observations (a true count of zero) for a period, and so the expected length for the period is a "true" NA, rather than just a missing variable. This is quite common; while the trucks are *usually* in the right hand lanes, they are sometimes are detected in the middle lanes. The middle lane detectors therefore *usually* have a count of zero and indeterminate characteristics. My question is how to proceed using Amelia. My naive strategy would be to run Amelia once to impute the counts, and then run Amelia again for each imputation (5 times), for the characteristics of the vehicles (as a non-time dependent imputation) *only* for the non-zero periods and lanes, and then use Zelig to compute average lengths. Does this make sense, or have I crossed the line from imputation to imagination? My other thought would be to aggregate up to daily periods and make it so there should never be zero counts, but I'd really like to preserve the hourly variation in the data. One other note: I've coded my data by observation time (with multiple lanes of data). I could also code it as one record per lane per observation time, which would allow me to drop zero count lanes. I just can't see how this would help. Any advice would be appreciated. Regards, James Marca

- Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia More info about Amelia: http://gking.harvard.edu/amelia

Reply

James Marca

15 Sep 15 Sep

7:18 p.m.

Matt, Thanks for the advice. It seems to be working. I have two follow up questions. First I tried to use tscsPlot to compare before and after setting the "zero-count weight estimates" to NA, and it didn't appear to do the right thing. For example, on Mondays in 2008 for a site, there should have been no imputed truck weights at all after NA-ing the zero count records, but I see lots in that day's tscsPlot. Aside from cutting, pasting and hacking, is there a way to get tscsPlot to plot the segment bars with the non-NA imputed estimates and the observed? Second question, I have a high proportion of values that I expect to be missing in the final set. The sensors look for trucks, and trucks generally move in the right hand lane. That means middle lanes aren't likely to see trucks at all. As an example, at one site for the 3rd lane from the right I have a missing matrix of Mode FALSE TRUE NA's logical 1677 4879 0 for the "truck weight" column, but there are only 85 missing counts in that lane (the other 4800- are *non-missing* counts of 0). Amelia works hard to impute 4800+ "missing" weights on very little information, and in the end I assign NA to most of those values. There isn't much point to all that imputation work. If I take out the "secondary" variables (weight, length, speed) and run the imputation just for counts, the imputations converge *generally* in less than 10 iterations (usually just 3 to 5). An example iterHist looks like: [,1] [,2] [,3] [1,] 2852 0 0 [2,] 363 0 0 [3,] 1 0 0 [4,] 0 0 0 With those mean variables in, the iterations bog down

...

aout.all$iterHist[[1]]

[,1] [,2] [,3] [1,] 6088 0 0 [2,] 4836 0 0 [3,] 3929 0 0 [4,] 3301 0 0 [5,] 2992 0 0 ... [47,] 660 0 0 [48,] 658 0 0 [49,] 653 0 0 [50,] 651 0 0 ... My question is whether I can safely expect that my count variables and the variables associated with non-zero counts periods have settled down to good estimates early in the iteration process, or whether they are still fluctuating. I can set the max chain length with emburn, but I'm not sure a priori how long to set it (200 is okay, 2000 is too slow). I also see there is a "tolerance" argument, but I'm not sure how to use it or what it means. Is it better to leave tolerance alone and just cut off the chains, or up the tolerance to something like 0.01? Again, thanks for any insights you can provide. regards, James On Tue, Sep 14, 2010 at 08:36:55AM -0400, Matt Blackwell wrote:

...

Hi James, A few thoughts. If the truck length variables changes over time, then you can impute it along with the counts in the same Amelia run. If there is no empirical time dependence, then Amelia will not use time to impute the truck lengths. As for the truly missing truck lengths. You can always go back into your imputed data and manually code those as missing. R code would look something like: a.out <- amelia(***your call here***) for (i in 1:length(a.out$imputations)) { mask <- a.out$imputations[[i]]$count == 0 is.na(a.out$imputations[[i]]) <- mask } You'll want to double check that it works (I haven't tested it). The reason why this will work is that the imputed cell has not added any information to the data itself, it has only added the information from the observed values of the cell. Thus, omitting that observation from the imputation entirely would bias the imputation. I hope that helps. Cheers, matt. On Tue, Sep 14, 2010 at 1:03 AM, James Marca <jmarca(a)translab.its.uci.edu> wrote: > Hi, > > I have a data set that is based on observations of vehicles by lane. > For example, each truck that passes the detector will be counted, and > its characteristics recorded (length, weight etc). By summing up the > counts into higher time periods, say an hour, I can use Amelia to > impute missing counts of vehicles (statisticians look the other way, > but I tell Amelia that the time series varies by time of day (the ts > variable runs from 0 to 24) and by inserting day of week as the cs > (cross section) variable (0 through 6). While that may be > non-standard perversion of the input parameters, it seems to work > pretty well.) I have other data for the missing periods from other > detectors, so I think it makes sense to try to use Amelia rather than > simply estimating a time series model for the missing counts. > > Now that I can impute counts I want to impute missing characteristics. > For example in an hour of good observation, every truck will have a > length recorded. When the detector is kaput for some reason, I want > to impute the missing average lengths along with the missing truck > counts. > > The problem is that sometimes there are no observations (a true count > of zero) for a period, and so the expected length for the period is a > "true" NA, rather than just a missing variable. This is quite common; > while the trucks are *usually* in the right hand lanes, they are > sometimes are detected in the middle lanes. The middle lane detectors > therefore *usually* have a count of zero and indeterminate characteristics. > > My question is how to proceed using Amelia. My naive strategy would > be to run Amelia once to impute the counts, and then run Amelia again > for each imputation (5 times), for the characteristics of the vehicles > (as a non-time dependent imputation) *only* for the non-zero periods > and lanes, and then use Zelig to compute average lengths. Does this > make sense, or have I crossed the line from imputation to imagination? > > My other thought would be to aggregate up to daily periods and make it > so there should never be zero counts, but I'd really like to preserve > the hourly variation in the data. > > One other note: I've coded my data by observation time (with multiple > lanes of data). I could also code it as one record per lane per > observation time, which would allow me to drop zero count lanes. I > just can't see how this would help. > > Any advice would be appreciated. > > Regards, > James Marca >

-- James E. Marca, PhD Researcher Institute of Transportation Studies AIRB Suite 4000 University of California Irvine, CA 92697-3600 jmarca(a)translab.its.uci.edu (949) 824-6287

Reply

James Marca

7:50 p.m.

On Fri, Sep 24, 2010 at 04:23:55PM -0400, Matt Blackwell wrote:

...

I have two follow up questions. First I tried to use tscsPlot to compare before and after setting the "zero-count weight estimates" to NA, and it didn't appear to do the right thing. For example, on Mondays in 2008 for a site, there should have been no imputed truck weights at all after NA-ing the zero count records, but I see lots in that day's tscsPlot. Aside from cutting, pasting and hacking, is there a way to get tscsPlot to plot the segment bars with the non-NA imputed estimates and the observed?

Unfortunately, the tscsPlot works by taking a bunch of draws from the imputation model, so it wouldn't know that certain observation really shouldn't be imputed. We might build in a feature to mark observations as "truly" missing, which would also make tscsPlot work for you, but this is still in development.

I got that by looking at the code after I sent the email. It was actually quite instructive to read the code and see what was being done...now I have more of a clue of what Amelia is doing...estimating the parameters of the distribution with the EM chains, and _then_ doing the random draws. This explains the visual "pause" (for lack of a better term) when the EM chains converge and the next one starts...it is doing the random draws, and if I put a hard bound at zero sometimes it does 1000 draws or whatever before going with zero.

...

I can set the max chain length with emburn, but I'm not sure a priori how long to set it (200 is okay, 2000 is too slow). I also see there is a "tolerance" argument, but I'm not sure how to use it or what it means. Is it better to leave tolerance alone and just cut off the chains, or up the tolerance to something like 0.01?

Cutting off the chains is problematic because you cannot be sure that the parameters that you care about have converged. One idea might be to simply use totals instead of averages and then set those "structurally" missing cells to zero.

Yeah, that is what I was thinking---not converged is not converged. Also, as you suggest using totals instead of averages is the way to go! I think makes more sense and definitely runs faster. I am now imputing the sum total of all truck weights (and so on) observed in the period and lane. My concern was that this would introduce a stair step kind of pattern to the data, but I guess the randomness of the data plus the robustness of Amelia combine to make this not an issue at all. Regards, James

Reply