[gov2001-l] NA's in subsetted sample - Gov2001

30 Mar 2008

My problems with merge() from the earlier email seems to have been  
caused by a sample that included a lot of NA's for the sort variable,  
and I realized that this was a problem with the way I subset the  
sample. The sample used in the study is limited to women 40 or older  
who have not had a hysterectomy, so I limited the sample as follows:

#ORIGINAL DATA
brfss <- read.csv("brfss.csv")

#SUBSET
sample2=brfss[brfss$HADHYST2==2,]
sample1=sample2[brfss$AGE >= 40,]

The problem is, for some reason I have a much higher percentage of  
NA's for every variable in the subsetted sample. For example, note  
the differences in proportions of NA's below:

...

sum(as.integer(is.na(brfss$CTYCODE))/length(brfss$CTYCODE)) [1] 0.1380183
...

sum(as.integer(is.na(sample1$CTYCODE))/length(sample1$CTYCODE)) [1] 0.6219678
...
  sum(as.integer(is.na(brfss$AGE))/length(brfss$AGE))
[1] 0
...

sum(as.integer(is.na(sample1$AGE))/length(sample1$AGE)) [1] 0.5652472
...
  sum(as.integer(is.na(brfss$RACE))/length(brfss$RACE))
[1] 0.0001316560
...

sum(as.integer(is.na(sample1$RACE))/length(sample1$RACE)) [1] 0.565318

I can't think of a plausible cause for this problem. Does anyone have  
some idea why this might happen?