On Fri, Sep 24, 2010 at 04:23:55PM -0400, Matt Blackwell wrote:
I have two
follow up questions. First I tried to use tscsPlot to
compare before and after setting the "zero-count weight estimates" to
NA, and it didn't appear to do the right thing. For example, on
Mondays in 2008 for a site, there should have been no imputed truck
weights at all after NA-ing the zero count records, but I see lots in
that day's tscsPlot. Aside from cutting, pasting and hacking, is
there a way to get tscsPlot to plot the segment bars with the non-NA
imputed estimates and the observed?
Unfortunately, the tscsPlot works by taking a bunch of draws from the
imputation model, so it wouldn't know that certain observation really
shouldn't be imputed. We might build in a feature to mark observations
as "truly" missing, which would also make tscsPlot work for you, but
this is still in development.
I got that by looking at the code after I sent the email. It was
actually quite instructive to read the code and see what was being
done...now I have more of a clue of what Amelia is doing...estimating
the parameters of the distribution with the EM chains, and _then_
doing the random draws. This explains the visual "pause" (for lack of
a better term) when the EM chains converge and the next one
starts...it is doing the random draws, and if I put a hard bound at
zero sometimes it does 1000 draws or whatever before going with zero.
I can set the max chain length with emburn, but I'm not sure a priori
how long to set it (200 is okay, 2000 is too slow). I also see there
is a "tolerance" argument, but I'm not sure how to use it or what it
means. Is it better to leave tolerance alone and just cut off the
chains, or up the tolerance to something like 0.01?
Cutting off the chains is problematic because you cannot be sure that
the parameters that you care about have converged. One idea might be
to simply use totals instead of averages and then set those
"structurally" missing cells to zero.
Yeah, that is what I was thinking---not converged is not
converged.
Also, as you suggest using totals instead of averages is the way
to go! I think makes more sense and definitely runs faster. I am now
imputing the sum total of all truck weights (and so on) observed in
the period and lane. My concern was that this would introduce a stair
step kind of pattern to the data, but I guess the randomness of the
data plus the robustness of Amelia combine to make this not an issue
at all.
Regards,
James