[gov2001-l] NA's in subsetted sample

30 Mar 2008

My problems with merge() from the earlier email seems to have been  
caused by a sample that included a lot of NA's for the sort variable,  
and I realized that this was a problem with the way I subset the  
sample. The sample used in the study is limited to women 40 or older  
who have not had a hysterectomy, so I limited the sample as follows:

#ORIGINAL DATA
brfss <- read.csv("brfss.csv")

#SUBSET
sample2=brfss[brfss$HADHYST2==2,]
sample1=sample2[brfss$AGE >= 40,]

The problem is, for some reason I have a much higher percentage of  
NA's for every variable in the subsetted sample. For example, note  
the differences in proportions of NA's below:

...

sum(as.integer(is.na(brfss$CTYCODE))/length(brfss$CTYCODE)) [1] 0.1380183
...

sum(as.integer(is.na(sample1$CTYCODE))/length(sample1$CTYCODE)) [1] 0.6219678
...
  sum(as.integer(is.na(brfss$AGE))/length(brfss$AGE))
[1] 0
...

sum(as.integer(is.na(sample1$AGE))/length(sample1$AGE)) [1] 0.5652472
...
  sum(as.integer(is.na(brfss$RACE))/length(brfss$RACE))
[1] 0.0001316560
...

sum(as.integer(is.na(sample1$RACE))/length(sample1$RACE)) [1] 0.565318

I can't think of a plausible cause for this problem. Does anyone have  
some idea why this might happen?

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

[gov2001-l] NA's in subsetted sample