It's hard to figure out why this happens just from the information given.
The fractions may be different depending on which obs you keep in the merged
dataset (only those that match on x or those on y, see the all options in
the merge help file.) you may want to try to delete all missing obs before
the merging (if you not plan to use these anyway).
-----Original Message-----
From: gov2001-l-bounces at
lists.fas.harvard.edu [mailto:gov2001-l-
bounces at
lists.fas.harvard.edu] On Behalf Of Keith Schnakenberg
Sent: Sunday, March 30, 2008 10:57 PM
To: gov2001-l at
lists.fas.harvard.edu
Subject: [gov2001-l] NA's in subsetted sample
My problems with merge() from the earlier email seems to have been
caused by a sample that included a lot of NA's for the sort variable,
and I realized that this was a problem with the way I subset the
sample. The sample used in the study is limited to women 40 or older
who have not had a hysterectomy, so I limited the sample as follows:
#ORIGINAL DATA
brfss <- read.csv("brfss.csv")
#SUBSET
sample2=brfss[brfss$HADHYST2==2,]
sample1=sample2[brfss$AGE >= 40,]
The problem is, for some reason I have a much higher percentage of
NA's for every variable in the subsetted sample. For example, note
the differences in proportions of NA's below:
sum(as.integer(is.na(brfss$CTYCODE))/length(brfss$CTYCODE))
[1] 0.1380183
sum(as.integer(is.na(sample1$CTYCODE))/length(sample1$CTYCODE))
[1] 0.6219678
sum(as.integer(is.na(brfss$AGE))/length(brfss$AGE))
[1] 0
sum(as.integer(is.na(sample1$AGE))/length(sample1$AGE))
[1] 0.5652472
sum(as.integer(is.na(brfss$RACE))/length(brfss$RACE))
[1] 0.0001316560
sum(as.integer(is.na(sample1$RACE))/length(sample1$RACE))
[1] 0.565318
I can't think of a plausible cause for this problem. Does anyone have
some idea why this might happen?
_______________________________________________
gov2001-l mailing list
gov2001-l at
lists.fas.harvard.edu
http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l