My problems with merge() from the earlier email seems to have been
caused by a sample that included a lot of NA's for the sort variable,
and I realized that this was a problem with the way I subset the
sample. The sample used in the study is limited to women 40 or older
who have not had a hysterectomy, so I limited the sample as follows:
#ORIGINAL DATA
brfss <- read.csv("brfss.csv")
#SUBSET
sample2=brfss[brfss$HADHYST2==2,]
sample1=sample2[brfss$AGE >= 40,]
The problem is, for some reason I have a much higher percentage of
NA's for every variable in the subsetted sample. For example, note
the differences in proportions of NA's below:
sum(as.integer(is.na(brfss$CTYCODE))/length(brfss$CTYCODE))
[1] 0.1380183
sum(as.integer(is.na(sample1$CTYCODE))/length(sample1$CTYCODE))
[1] 0.6219678
sum(as.integer(is.na(brfss$AGE))/length(brfss$AGE))
[1] 0
sum(as.integer(is.na(sample1$AGE))/length(sample1$AGE))
[1] 0.5652472
sum(as.integer(is.na(brfss$RACE))/length(brfss$RACE))
[1] 0.0001316560
sum(as.integer(is.na(sample1$RACE))/length(sample1$RACE))
[1] 0.565318
I can't think of a plausible cause for this problem. Does anyone have
some idea why this might happen?