Hi Trevor,
1) I don't think there's a specific procedure other than reasoning through
the causes of missingness and whether they depend on the values of the
missing data. The number of missing variables within a given unit might
make this less plausible, but I don't think there are any chances to the
way we might think about it.
2) Yes, definitely include any information you can in the imputation model.
Though if you are using "intercs = TRUE" then you will be including a
within-unit fixed effect or trend that will absorb any time-constant
variables.
3) Might as well do them all together because otherwise the model might
think that some previously imputed data is "real."
4) Not hopeless! Though you may use Amelia in addition to maybe a more
tailored missing data solution (to handle NMAR data) if you increasing
suspect that's a problem.
Hope that helps!
Cheers,
Matt
~~~~~~~~~~~
Matthew Blackwell
Assistant Professor of Government
Harvard University
url:
On Fri, Sep 18, 2015 at 4:09 PM Trevor Lyons <trevor.lyons(a)gmail.com> wrote:
Hello,
I have TSCS data for 540 Brazilian municipalities from 1992 to 2012
collected from the national Treasury Department related to local budget
receipts and expenditures. I have around 15-20 variables of interest,
including total receipts and expenditures which are then broken down into
categories such as "own-source tax revenue" or "spending on health and
sanitation." With extremely few exceptions, the data are either entirely
missing or all present, which occurs in 427 of my 8,000+ observations.
Approximately 40% of municipalities have at least one missing year in the
time series (13% total missing only one year, 10% missing two, 6% missing
three, and tapering off with only 11 of 560 missing more than 6).
There are two clear patterns that I've identified
1) There is a marked spike for 1998-1999, with over 90 missing for both,
and almost always a municipality that is missing one of these two years is
missing the other. Besides this, there are rarely any consecutive runs of
missing values, and these 90 municipalities are no more or less likely to
have missing observations outside of this time period.
2) The municipalities with more than three missing are clustered in a
handful of states frequently associated with poverty and/or corruption
The reasoning behind why a given year is missing is relevant - it means
that they failed to turn over their annual accounts data to the federal
government as required by law. As of 2001, there are even (in theory)
sanctions for not providing this data that could lead to the withholding of
grants or the removal of a mayor's ability to run for any elected office
for the following eight years, although this is only sporadically enforced.
There is a corresponding drop in the average number of missing observations
after this point, with the very real possibility that at least the post
2001 missing data are related to administrative improprieties.
My biggest bind is that there are exceedingly few other reliably available
annual data at the municipal level up until 1999, leaving me with
population size and age distribution as my only continuously changing
controls for this time period. I am running a fixed effects model to test
for the effect of a particular policy that was implemented in different
cities beginning in 1989, and if I were to consider only the budgetary data
from 1999 onwards I would have no "pre-implementation" observations for
over half of the implementing municipalities. This leaves me highly
dependent on properly addressing the missing data issue, but I am still a
little unclear as to how I should use AMELIA in this particular situation.
If I were just working with, say, 2000-2012, then I think I understand all
the necessary steps.
Most of the budgetary variables have a fairly stable tendency to increase
over time, with only certain categories of taxes and spending categories
going through drastic annual changes. I used a polynomial time trend unique
to each unit, using "polytime = 2" and " intercs = TRUE". Looking at
different graphs as diagnostics, the imputations seem to perform reasonably
well, but I'm sure I have missed something.
My questions are:
1) Is there a specific procedure when nearly all of the variables are
missing for a given observation, assuming that the few continuous variables
present satisfy the MAR assumption?
2) To improve both the strength of the imputation as well as the MAR
assumption behind it, can I use variables that are only measured at a few
points in time (i.e. decennial census data) or that are time-constant?
3) Is it better to perform the imputations for all of the missing
variables at once, or should they be done incrementally?
4) Is this all a fool's errand because it is unrealistic to assume my data
are MAR?
Thank you,
Trevor
--
Amelia mailing list served by HUIT
[Un]Subscribe/View Archive:
http://lists.gking.harvard.edu/?info=amelia
More info about Amelia:
http://gking.harvard.edu/amelia
Amelia mailing list
Amelia(a)lists.gking.harvard.edu
To unsubscribe from this list or get other information:
https://lists.gking.harvard.edu/mailman/listinfo/amelia