Matt,
Thanks for the advice. It seems to be working.
I have two follow up questions. First I tried to use tscsPlot to
compare before and after setting the "zero-count weight estimates" to
NA, and it didn't appear to do the right thing. For example, on
Mondays in 2008 for a site, there should have been no imputed truck
weights at all after NA-ing the zero count records, but I see lots in
that day's tscsPlot. Aside from cutting, pasting and hacking, is
there a way to get tscsPlot to plot the segment bars with the non-NA
imputed estimates and the observed?
Second question, I have a high proportion of values that I expect to
be missing in the final set. The sensors look for trucks, and trucks
generally move in the right hand lane. That means middle lanes aren't
likely to see trucks at all. As an example, at one site for the 3rd
lane from the right I have a missing matrix of
Mode FALSE TRUE NA's
logical 1677 4879 0
for the "truck weight" column, but there are only 85 missing counts in
that lane (the other 4800- are *non-missing* counts of 0). Amelia
works hard to impute 4800+ "missing" weights on very little
information, and in the end I assign NA to most of those values. There
isn't much point to all that imputation work.
If I take out the "secondary" variables (weight, length, speed) and run the
imputation just for counts, the imputations converge *generally* in
less than 10 iterations (usually just 3 to 5). An example iterHist
looks like:
[,1] [,2] [,3]
[1,] 2852 0 0
[2,] 363 0 0
[3,] 1 0 0
[4,] 0 0 0
With those mean variables in, the iterations bog down
aout.all$iterHist[[1]]
[,1] [,2] [,3]
[1,] 6088 0 0
[2,] 4836 0 0
[3,] 3929 0 0
[4,] 3301 0 0
[5,] 2992 0 0
...
[47,] 660 0 0
[48,] 658 0 0
[49,] 653 0 0
[50,] 651 0 0
...
My question is whether I can safely expect that my count variables and
the variables associated with non-zero counts periods have settled down
to good estimates early in the iteration process, or whether they are
still fluctuating.
I can set the max chain length with emburn, but I'm not sure a priori
how long to set it (200 is okay, 2000 is too slow). I also see there
is a "tolerance" argument, but I'm not sure how to use it or what it
means. Is it better to leave tolerance alone and just cut off the
chains, or up the tolerance to something like 0.01?
Again, thanks for any insights you can provide.
regards,
James
On Tue, Sep 14, 2010 at 08:36:55AM -0400, Matt Blackwell wrote:
Hi James,
A few thoughts. If the truck length variables changes over time, then
you can impute it along with the counts in the same Amelia run. If
there is no empirical time dependence, then Amelia will not use time
to impute the truck lengths.
As for the truly missing truck lengths. You can always go back into
your imputed data and manually code those as missing. R code would
look something like:
a.out <- amelia(***your call here***)
for (i in 1:length(a.out$imputations)) {
mask <- a.out$imputations[[i]]$count == 0
is.na(a.out$imputations[[i]]) <- mask
}
You'll want to double check that it works (I haven't tested it). The
reason why this will work is that the imputed cell has not added any
information to the data itself, it has only added the information from
the observed values of the cell. Thus, omitting that observation from
the imputation entirely would bias the imputation.
I hope that helps.
Cheers,
matt.
On Tue, Sep 14, 2010 at 1:03 AM, James Marca
<jmarca(a)translab.its.uci.edu> wrote:
> Hi,
>
> I have a data set that is based on observations of vehicles by lane.
> For example, each truck that passes the detector will be counted, and
> its characteristics recorded (length, weight etc). By summing up the
> counts into higher time periods, say an hour, I can use Amelia to
> impute missing counts of vehicles (statisticians look the other way,
> but I tell Amelia that the time series varies by time of day (the ts
> variable runs from 0 to 24) and by inserting day of week as the cs
> (cross section) variable (0 through 6). While that may be
> non-standard perversion of the input parameters, it seems to work
> pretty well.) I have other data for the missing periods from other
> detectors, so I think it makes sense to try to use Amelia rather than
> simply estimating a time series model for the missing counts.
>
> Now that I can impute counts I want to impute missing characteristics.
> For example in an hour of good observation, every truck will have a
> length recorded. When the detector is kaput for some reason, I want
> to impute the missing average lengths along with the missing truck
> counts.
>
> The problem is that sometimes there are no observations (a true count
> of zero) for a period, and so the expected length for the period is a
> "true" NA, rather than just a missing variable. This is quite common;
> while the trucks are *usually* in the right hand lanes, they are
> sometimes are detected in the middle lanes. The middle lane detectors
> therefore *usually* have a count of zero and indeterminate characteristics.
>
> My question is how to proceed using Amelia. My naive strategy would
> be to run Amelia once to impute the counts, and then run Amelia again
> for each imputation (5 times), for the characteristics of the vehicles
> (as a non-time dependent imputation) *only* for the non-zero periods
> and lanes, and then use Zelig to compute average lengths. Does this
> make sense, or have I crossed the line from imputation to imagination?
>
> My other thought would be to aggregate up to daily periods and make it
> so there should never be zero counts, but I'd really like to preserve
> the hourly variation in the data.
>
> One other note: I've coded my data by observation time (with multiple
> lanes of data). I could also code it as one record per lane per
> observation time, which would allow me to drop zero count lanes. I
> just can't see how this would help.
>
> Any advice would be appreciated.
>
> Regards,
> James Marca
>
--
James E. Marca, PhD
Researcher
Institute of Transportation Studies
AIRB Suite 4000
University of California
Irvine, CA 92697-3600
jmarca(a)translab.its.uci.edu
(949) 824-6287