I would advise getting the data from the original survey. If you have that data
and it omits the city names, you might try some sort of Monte Carlo method (or
even an exact method) to get the distribution(s) of your subjective beliefs
about the missing data values and any quantities of interest. You will
probably have to make the assumption that the survey data reflects the
population.
Quoting Clayton Nall <nall(a)fas.harvard.edu>du>:
For our replication project, Justin Grimmer and I have
been exploring
missing data problems. We were given a data set by the authors that
was stripped of most identifiable information. Alas, it retains only
those variables necessary for the authors' OLS regressions, even
though it has a rich data history: it was produced by merging 1990
Census data with a detailed 1986 survey of municipalities. The
authors omitted any unique variable that would let us link back to the
original databases, a copy of which we have in our possession.
To identify the missing data points (the unit of analysis is
municipalities), we are trying to link the data table they used in
their regressions with the 1986 raw data or the 1990 Census. Lacking
a unique identifier, we have been unable to get an exact match, but we
have been trying to link on the following variables:
-Log of 1990 Census population--rounded to either 2 or 3 decimal pts.
(The problem here is that increasing precision captures rounding
errors between the two data sets, while decreasing precision leads to
too many false matches when we merge.) Because the authors rounded
the log of population to 3 (or so) decimal points, we get rounding
errors if we just try to exponentiate.
-Census region
-A variety of dummy variables for various survey responses. These
have been only somewhat useful since they are not a unique identifier.
-We think that the states are listed in alphabetical order in their
data table (based on the sequencing of regions) but we aren't sure how
to use this fact.
Linking on any combination of these produces either too few
identifiable matches or too many. We've been using the merge()
function in R. It seems to work okay but it is not good at identify
what's causing mismatches. We have gotten close to linking the 1990
Census data with the 1986 survey, but this only gets us to their first
step; it doesn't let us work backward to match with their table.
Has anyone on the list dealt with such a problem in the past? Any
suggestions are greatly appreciated. We're considering going back to
the authors, but we suspect they may not have what we need.
Thanks,
Clayton and Justin
_______________________________________________
gov2001-l mailing list
gov2001-l(a)lists.fas.harvard.edu
http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l