Hello,
Posted the below to the MatchIt mailing list and was advised there
might also be interest here. Have modified it slightly with some
debugging information on an error that comes out of cem's att()
function when called with the treatment variable as a factor:
-----
Hello,
Firstly thank you for this very useful package. I came to it via
the CEM package in the course of trying to explore how some versions
of PSM would compare on some clinical data we are analyzing to
retrospectively make an estimate of how effective a certain therapy is
for moderately severe hospitalized Covid-19 cases. While doing this
there seemed to be some unexpected behavior with respect to whether
the treatment variable is included as an integer or as a factor, which
I do not think I understand after reading the package documentation,
vignettes, and the explanation of weights (at
https://docs.google.com/document/d/1xQwyLt_6EXdNpA685LjmhjO20y5pZDZYwe2qeNo…).
Briefly, when the treatment variable is a factor, the output of
matchit with the cem method appears to use the opposite convention for
assigning weights [i.e. *un*treated weights are 1 and treated weights
are presumably (m_C/m_T)*(m^s_T/m^s_C) ] from what comes out of cem
directly (regardless of whether the treatment variable is integer or
factor in the cem input data). So then after applying match.data() to
the matchit outputs, the results of subsequent regressions (have
tested lm and glm) with the corresponding weights also differ
depending on whether the treatment variable is coded as integer or as
a factor. Is this an expected behavior?
Here is a somewhat minimal example using the LaLonde data, with
apologies for its inelegance:
####
library(cem)
library(MatchIt)
data("LeLonde")
#set up a data frame with treatment as integer (Le) and another with
#treatment.f as factor (Le.f)
Le <- data.frame(na.omit(LeLonde))
Le.f <- Le
Le.f$treated <- as.factor(Le.f$treated)
colnames(Le.f)[which(colnames(Le.f) == 'treated')] <- 'treated.f'
#for simplicity will match only on age, re74, and q1; here make lists
#of the other variables to drop in the cem commands
LeLonde_vars_to_match <- c("age", "re74", "q1")
LeLonde_vars_to_drop <- setdiff(names(Le), LeLonde_vars_to_match)
LeLonde_vars_to_drop.f <- setdiff(names(Le.f), LeLonde_vars_to_match)
#use cem directly to match between treated/untreated on age, re74, and
#q1; first with treated as integer, second with treated.f as factor
mat.cem <- cem(treatment = "treated", data = Le, drop =
LeLonde_vars_to_drop, eval.imbalance = TRUE, keep.all = TRUE)
mat.f.cem <- cem(treatment = "treated.f", data = Le.f, drop =
LeLonde_vars_to_drop.f, eval.imbalance = TRUE, keep.all = TRUE)
identical(mat.cem$w, mat.f.cem$w) #is true, weights from cem do not
#depend on whether treatment variable is integer or factor
#use matchit with method cem to match between treated/untreated on
#age, re74, and q1; first with treated as integer, second with
#treated.f as factor
mat.mi <- matchit(treated ~ age + re74 + q1, Le, method = "cem")
mat.f.mi <- matchit(treated.f ~ age + re74 + q1, Le.f, method = "cem")
identical(mat.mi$weights, mat.f.mi$weights) #is false, weights from
#matchit with method cem do depend on whether treatment variable is
#integer or factor, seemingly by different choice of whether control or
#treated weights are set to 1, as the same data entries are selected by
#the match, as suggested by plot(match.data(mat.mi)$re78,
#match.data(mat.f.mi)$re78) being linear with slope 1
identical(mat.cem$w, mat.mi$weights) #is true, indicating that the
#convention from cem coincides with using matchit with method cem when
#the treatment variable is integer
#generate matched datasets, first with treated as integer, second with
#treated.f as factor
matched.data.mi <- match.data(mat.mi)
matched.data.f.mi <- match.data(mat.f.mi)
#compute the linear models for re78 ~ treated, first with treated as
#integer, second with treated.f as factor
mod.mi <- lm(re78 ~ treated, matched.data.mi, weights = matched.data.mi$weights)
mod.f.mi <- lm(re78 ~ treated.f, matched.data.f.mi, weights =
matched.data.f.mi$weights)
identical(mod.mi$coefficients["treated"],
mod.f.mi$coefficients["treated.f1"]) #is false, the estimates depend
#on whether the treatment variable is integer or factor
mod.mi.g <- glm(re78 ~ treated, data = matched.data.mi, weights =
matched.data.mi$weights)
mod.f.mi.g <- glm(re78 ~ treated.f, data = matched.data.f.mi, weights
= matched.data.f.mi$weights)
identical(mod.mi.g$coefficients["treated"],
mod.f.mi.g$coefficients["treated.f1"]) #is false, the estimates depend
#on whether the treatment variable is integer or factor
####
There is a separate peculiarity when using cem directly with the
treatment variable as a factor, where the att() command fails with
error messages saying that treated.f is not a factor, for instance
after running the above code:
####
est.f.cem <- att(mat.f.cem, re78 ~ treated.f, data = Le.f)
#Error: variable 'treated.f' was fitted with type "factor" but type
#"numeric" was supplied
#In addition: Warning message:
#In model.frame.default(Terms, newdata, na.action =
#na.action, xlev = object$xlevels) :
# variable 'treated.f' is not a factor
####
Debugging this it seems that in line 292 of att.R in the github repo,
a temporary data variable's treatment column is assigned 0, which
makes it a numeric type, and then this goes to the predict() function
along with the cem object which still expects the original factor
type, which I think is the root of the discrepancy. Maybe adding a
line there checking if the treatment variable is a factor, and if so
casting the temporary data's treatment column to a factor if that is
so, would be appropriate?
In the short term I think a reasonable solution is just not to cast
the treatment variable as a factor in matchit when using cem or when
using cem/att - am sorry if this was advised in the documentation
somewhere and I missed it - but it might be worth considering if
matchit should be able to tolerate having the treatment variable as a
factor as well (or raise a runtime alert about possibly different
output or potential errors).
Thank you,
Alberto