Hey Everyone,
Sana had two great questions that will be helpful for the problem set so I
wanted to forward them to the list.
Here is the first
> In problem 2.1, we're asked to control "for the above-mentioned
> covariates". Does this include turnout, which
> was a dependent variable in the first model?
We do not want you to include turnout, but we do want you to include the
interaction term that was in the first model.
And here is the second question:
> Again about problem 2.1:
> Zelig reports back the coefficients, as well as three 'intercepts'. I
> thought the intercepts referred to the threshold parameters, but if
> that's the case, why are there three instead of two, since the
> dependent variable (level of attention) has only four categories?
Zelig is reporting three "intercepts" (which I'll call thresholds)
because it identifies
the model by assuming a variance for the latent variable, and by assuming
that the intercept (or \beta_0) is 0. This allows zelig to estimate three
threshold parameters. Remember, these parameters need to describe three
walls for a four part dependent variable: the wall between 1 and 2, 2 and 3, and 3 and
4.
In the section code, we used a different identifying assumption: that the
first threshold was zero. In this case, we would only estimate two
threshold parameters, and then an intercept. Also remember that these two
identifying assumptions estimate the same model, just with a different
parameterization.
Thanks for the great questions--please do not hesistate to email the list
with any other
Cheers,
Justin
Hi all,
I have another R question about subsetting data. Basically, we're
trying to construct a subset of everyone in our dataset who answered yes
to a certain question (BPMED=1, n=5091), but beacuse of the skip
patterns, not everyone was asked the question, so there are a lot of
missing values. When we try to subset based on BPMED=1, it still brings
the NA values. I'm wondering if people know how to exclude NA values
when subsetting. Our code is shown below, in case that's more helpful
than my explanation. Basically, we want a subset that only contains the
5091 people for whom BPMED=1.
> #subset among BP medicines users
> hrs.BPmeds <- hrs[hrs$BPMED==1,]
> #verify
> dim(hrs.BPmeds)
[1] 9760 129
> table(hrs.BPmeds$BPMED)
1
5091
Thanks so much for any advice you have.
Katy and Sheila
--
Katy Backes Kozhimannil, M.P.A.
Ph.D. Program in Health Policy
Resident Tutor, Adams House
Harvard University
474 Adams Mail Center
Cambridge, MA 02138
kbackes at fas.harvard.edu
Hello all,
here are some more details regarding the upcoming replication
assignment. The papers are due AT THE BEGINNING of class on April 2nd.
Please bring the following things with you:
For each group,
1) THREE CDs that contain all the data needed to replicate the paper,
the article you chose to replicate as a pdf, electronic codebooks
for the data, and your R code. If you should have several datasets,
please also include a readme file that describes the overall structure
of the data.
2) TWO hardcopies of the article you chose to replicate for Justin and me.
The replication paper should mostly consist of the replication of the
key tables of the original article. Quickly describe to what extent you
were able to replicate the original findings, and indicate in what ways
you plan to improve/extend the original paper (2-3 pages).
At the end of lecture, please stick around for a couple of minutes so
that we can give you the CDs prepared by one of the other groups. We
will try to match groups according to research interests/background.
Note that extension school students do not write final papers and thus
do not replicate articles. They do, however, replicate other groups'
replications. Several groups will therefore receive comments from two
different sources: from another group and a long distance student. We
will send out an additional email to long distance students with more
details later. Also note the the reaction papers to the replications are
co-authored by each group, not individually.
Your replication of another group's replication will be due on Monday,
April 9. Your are graded according to how constructive, not destructive
you are. You should make sure that the other group successfully
replicated the major claims of the original article by running their R
code and then comment on their plans for extending/improving it. The
more concrete advice you can provide on how to improve the original
research the better. Your comments should generally take between 3 and 5
pages.
cheers,
Holger
--
Holger Lutz Kern
Graduate Student
Department of Government
Cornell University
Institute for Quantitative Social Science
Harvard University
1737 Cambridge Street N350
Cambridge, MA 02138
www.people.cornell.edu/pages/hlk23
Hi all,
we had a small typo in the solutions for ps 5. The p-value for the LR
test was calculated incorrectly. We've posted corrected versions of the
writeup and R code.
cheers,
Holger
--
Holger Lutz Kern
Graduate Student
Department of Government
Cornell University
Institute for Quantitative Social Science
Harvard University
1737 Cambridge Street N350
Cambridge, MA 02138
www.people.cornell.edu/pages/hlk23
Okay, I know there have already been a million questions about this, but
here's one more 11th hour attempt:
Because there are 23 actual observations, "holding pressure at its observed
values" implies that there will be 23 observations even when temperature is
pinned to a particular value. Therefore, when we calculate the probability
of failure for a given temperature (say, 31 degrees), we will get a VECTOR
of probabilities for any single beta vector.
Then, since we of course want to draw many beta-vectors from the relevant
distribution, we end up with many VECTORS of probabilities.
One of the preceding emails stated that we should average across the 23 to
get a single probability. This seems fishy, as we would then be averaging
once across the 23 observed pressures to get a single probability for a
given beta, and then again across our draws from beta. This would introduce
two deviations, which in theory we would have to propagate through.
Is this indeed correct, or am I totally misunderstanding the situation?
Phew.
Any thoughts?
Tom
Hi all,
My partner Sheila and I are both new to R this semester, and have come
across a few challenges as we approach our replication project. We have
a few questions about some of the basics of using R for analysis and
were hoping that some of you may have thoughts or suggestions:
1) How do you change the R/Zelig default settings so that output does
not appear in scientific notation?
2) How do you subset data to run a regression on just that subset in
Zelig?
3) What is the best way to view cross tabulations of data and to add row
or column percentages?
4) What is a general rule of thumb for determining the most appropriate
R format for different types of variables (i.e as matrix, as factor, as
numeric)? We have run into a few different error messages regarding
formatting, and that has been difficult for us to troubleshoot without
much intuition for the different types of formats.
Thanks so much for any tips you can provide - we really appreciate it.
Best,
Katy
--
Katy Backes Kozhimannil, M.P.A.
Ph.D. Program in Health Policy
Resident Tutor, Adams House
Harvard University
474 Adams Mail Center
Cambridge, MA 02138
kbackes at fas.harvard.edu
Hi everyone,
I'm trying to tackle 1.6. Maybe some of you can help.
I'm basically trying to follow what Gary does on pgs. 84-86 in his book. I
ran a zelig logit twice, one using the date variable and once not using the
date variable. Then I took the coefficient estimates and entered them into
our logit model (1/(1 + exp((-1) * X*B)) thus getting two pi estimates
(again, one with a date variable and one without a date variable).
But when I took the ratio of the two pi estimates according to the formula
on pg 84 ((-2) * log(logit.no.date/logit.date), I got multiple R values,
some of which are negative (which I know is wrong---R values should always
be positive since you know the model with more variables will always have
more explanatory power).
What's wrong with this approach?
thanks for any help--
Maya
Hi Gavril,
it should be one graph, with the x-axis displaying
the temperature and the y-axis displaying the
expected probability given this temperature,
leaving pressure at its observed values. The
expected probabilities at each temperature will be
the averages of the expected probabilities for all
23 observations.
Hope that helps,
Holger
Bilev, Gavril wrote:
> Hey Holger,
> sorry to bug you again - but just a quick question to make sure - in 1.4 you expect 3 different graphs, correct? 1 for every different level of Pressure (there are only 3) or do you expect us to average them somehow and combine them into 1?
> Best,
> Gav
--
Holger Lutz Kern
Graduate Student
Department of Government
Cornell University
Institute for Quantitative Social Science
Harvard University
1737 Cambridge Street N350
Cambridge, MA 02138
www.people.cornell.edu/pages/hlk23
I know this has been discussed on at least one prior thread, but I'm not
sure I understand (or even see) the conceptual difference between problems
1.3 and 1.4 on the problem set.
I know that's a vague question, so thanks for any help--
Maya
Hi Patrick,
you are almost right. However, even in a logit
model, expected and predicted values are not the
same. The shortcut you're referring to saves you
steps 4 and 5 of the expected value algorithm.
Compare that to the 4 steps in the algorithm for
the predicted value ...
Holger
Patrick Lam wrote:
> Hi Holger,
>
> I had a question on the problem set. In 1.4 and 1.5, when you refer to
> expected probabilities versus predicted probabilities, I'm not sure what
> you mean. According to King, Tomz, and Wittenberg, we should take the
> following steps.
>
> 1) Draw a value of beta.
> 2) Multiply it by X
> 3) Transform that into a probability.
> 4) Draw M simulations (1s and 0s) from the Bernoulli distribution using
> that probability. (this accounts for fundamental uncertainty)
> 5) Average out the simulations. This is the expected probability and in
> logit, is equal to the probability derived in step 3.
> 6) Repeat for all betas
>
> My question is what you mean by expected versus predicted
> probabilities. I think you are trying to get at how the expected and
> predicted probabilities are the same in the logit case. So is the
> expected probability that you are referring to just the probabilities
> without accounting for fundamental uncertainty ( i.e. just taking the
> probabilities from step 3) and the predicted probabilities are just
> going through all the steps? Or do you mean that the expected
> probabilities are when M is large and the predicted probabilities are
> when M is small in step 4.
>
> Thanks
>
> -Patrick
--
Holger Lutz Kern
Graduate Student
Department of Government
Cornell University
Institute for Quantitative Social Science
Harvard University
1737 Cambridge Street N350
Cambridge, MA 02138
www.people.cornell.edu/pages/hlk23