For our replication project, Justin Grimmer and I have been exploring
missing data problems. We were given a data set by the authors that
was stripped of most identifiable information. Alas, it retains only
those variables necessary for the authors' OLS regressions, even
though it has a rich data history: it was produced by merging 1990
Census data with a detailed 1986 survey of municipalities. The
authors omitted any unique variable that would let us link back to the
original databases, a copy of which we have in our possession.
To identify the missing data points (the unit of analysis is
municipalities), we are trying to link the data table they used in
their regressions with the 1986 raw data or the 1990 Census. Lacking
a unique identifier, we have been unable to get an exact match, but we
have been trying to link on the following variables:
-Log of 1990 Census population--rounded to either 2 or 3 decimal pts.
(The problem here is that increasing precision captures rounding
errors between the two data sets, while decreasing precision leads to
too many false matches when we merge.) Because the authors rounded
the log of population to 3 (or so) decimal points, we get rounding
errors if we just try to exponentiate.
-Census region
-A variety of dummy variables for various survey responses. These
have been only somewhat useful since they are not a unique identifier.
-We think that the states are listed in alphabetical order in their
data table (based on the sequencing of regions) but we aren't sure how
to use this fact.
Linking on any combination of these produces either too few
identifiable matches or too many. We've been using the merge()
function in R. It seems to work okay but it is not good at identify
what's causing mismatches. We have gotten close to linking the 1990
Census data with the 1986 survey, but this only gets us to their first
step; it doesn't let us work backward to match with their table.
Has anyone on the list dealt with such a problem in the past? Any
suggestions are greatly appreciated. We're considering going back to
the authors, but we suspect they may not have what we need.
Thanks,
Clayton and Justin
Hi,
I was thinking about how to do exponential regression and came up with this
quick optimization. I was considering using it to fill in missing parts of a
Zipf distribution, but I have decided that the assumptions to do so are not
met, particularly considering that political manifestos in German (which has a
lot of compound words) have a pretty well-filled Zipf distribution whereas
manifestos in English (which has a lot of multi-word terms) do not. I am
curious as to how to adapt this estimator to time-series analysis and to make
it more robust.
Geoff
# the function to optimize
f <- function(par, X, Y, W=rep(1, nrow(X)))
{
beta <- par
beta[2] <- 1/par[2]
sum(t(W) %*% abs(exp(X %*% beta) - Y))
}
# read in the data and set up variables
table <- read.csv("table.csv", header=T)
Y <- as.matrix(cbind(table[7]))
X <- as.matrix(cbind(1, rev(c(1:nrow(Y)))))
W <- as.matrix(table[6])
# do a linear regression on transformed Y values, taking the
# reciprocal of beta_2 for optimization simplicity
lm.out <- lm(log(Y)~(X[,2]), weights=as.vector(W))
par <- c(coefficients(lm.out)[1], 1/coefficients(lm.out)[2])
b0 <- par
# extract a set of coefficients for initializing the optimization
betahat_ <- optim(par, f, method="CG", X=X, Y=Y, W=W)$par
betahat <- betahat_
betahat[2] <- 1/betahat[2]
# plot
plot(X[,2], Y, main="Price of IBM vs Time", xlab="day", ylab="adjusted price")
lines(X[,2], exp(X %*% betahat), col="blue")
b0[2] <- 1/b0[2]
lines(X[,2], exp(X %*% b0), col="red")
legend(x=0, y=120, legend=c("Naive Transformed Least Squares", "Absolute Least
Error Predictor"), fill=c("red", "blue"), bty="n")
when you write up your replication, it would be helpful if you explained
what the model was you were replicating. when you do this, don't just
write down the name of the model, since the same models are often called
different things.
what you should do is to write down the model in full mathematical
notation so we can all figure out what it is. the way to do this is
exactly as i've been doing in class, including all the first principles
necessary to produce the likelihood function. this will always include a
stochastic component and a systematic component, and often more.
when you write mathematical notation, be sure that each and every symbol
is defined. if you have an index, such as for an observation number, be
sure to say what it goes from and to (e.g., Y_i for observation i,
i=1,...,n).
Gary
Hi All
Can somebody help me to understand the two types of simulation that Gary
gave lecture on. I am still bit confused. I use SPSS for my logit works but
I strongly believe that we have to move beyond calculating simple betas and
odds and give quantities of interest along with uncertainity.
Suppose Beta = .0250I for education and Beta = .06531 for income in a
logistic regression equation: Logit (turnout) = .02501 education + .06531
income. I would like to know through an example how would you simulate the
impact of race on turnout
1. while holding constant income and education at their means.
2. with income bracket of 30,000 to 45,000 dollars and less than high school
of education.
Can somebody give example by drawing three to four samples?
Also many times when you have predicted probabilities of voting in an
election for a data set using logistic regression model for each case in the
sample of a state or an area and after considering probability of less than
.50 not voting and more than .50 voting, how can you show the impact of
changing a value of the parameter e.g. education with less than high school
to all the sample having atleast high school education, on the predicted
turnout of say 45 percent for the sample.
That is I would like to say that changing a certain parameter (kind of first
difference) the total turnout would improve from 45 percent to 50 percent or
whatever.
I know I can do that in SPSS but it wont give me uncertanity or confidence
intervals: which most of the analysts dont give for such type of "what if
analysis" I am going through the work of Wolfinger and Rosentone "Who
votes"; excelent work but no confidence interval levels or uncertanity in
explaining their quantities of interest claculating through probit.
How can you use Zelig for producing such quantities of interest?
Bilal
Hi everyone,
The gate in the basement of CGIS has gone down, but I'm still in the
training lab if anyone wants to stop by for the next 30 minutes. You can
enter through the front door of the Fung Library.
Best,
Ian
Hi everyone,
A quick reminder that I will be holding office hours today from 4:30-6:30
in CGIS N018. If you are not in some tropical paradise, and have
questions/concerns/frustrations about the replication project, please feel
free to stop by.
Best,
Ian
Hi,
I'm having trouble installing Zelig on a macbook laptop (the kind
with the intel cpu).
For example, running this line
> install.packages("Zelig", repos = "http://gking.harvard.edu")
gives errors
Warning in install.packages("Zelig", repos = "http://gking.harvard.edu") :
argument 'lib' is missing: using ~/Library/R/library
Warning: unable to access index for repository
http://gking.harvard.edu/bin/macosx/i686/contrib/2.2
Warning in download.packages(pkgs, destdir = tmpd, available = available, :
no package 'Zelig' at the repositories
and indeed the URL there doesn't work.
Other repositories and commands (like
source(http://gking.../install.R)) give similar errors.
Is there a Zelig binary somewhere that's compatible with intel macs?
-aram
Hi everyone,
I got an email from the FAS computing staff tonight saying they have
completed the migration of everyone's home directories, and as a result
they have to do some final upgrade on the icegov servers tonight.
They claim this will only affect three users on icegov1 (whom I have
already contacted), but just to be safe, I would recommend backing up all
your files to your local machine tonight if at all possible.
Please let me know ASAP if you are having any server issues....
Best,
Ian