Hi all,
Did anyone encounter any issues with performance? We
ran into a running-time bottleneck and are stuck in
it.
Our dataset contains about 1mln entries. And iterating
through it with a few manipulations takes over
24 hours. :(
Does anyone have any general pointers about how to make
R code more efficient? For example, we gather that
doing things like dat[dat[['DATE']] == myDate,] are very
expensive operations. Is this true?
It's no surprise that R, just like MATLAB, exposes the
tradeoff of concise code and performance, but we need
to get this replication done somehow. Perhaps doing some
initial data-filtering tasks in C++ is a viable solution? :)
How do people usually deal with the problem of "too much
data"?
Thank you!!
-Alexei and Ben
Dear All,
Apologies for the long message, but we didn't think that our
questions would make sense otherwise. We're having some problems
getting optim to work with our model & we'd be grateful for advice.
The paper we're trying to replicate involves a twin study and
compares identical (MZ) and non-identical (DZ) twins in an attempt to
estimate the effects of genetic and environmental effects. Sets of
twins are modelled using a bivariate normal with identical and non-
identical twins having different but related distributions -
basically the model is that identical and non-identical twins share
genetic causes 100% and 50% respectively. Everything else
(environment) is assumed to be equivalent. The model is a simple one
- it assumes 3 latent causal variables: genetic, shared environment
and non-shared environment. We are (at least initially) interested in
the effects of individual traits (e.g., Speaking at age 7, as
measured by teacher assessment.) If we get past this, then we'll try
and replicate their longitudinal modeling. We are interested in the
proportions of these effects on the variance & covariance.
So we've created a log-likelihood function based on this (see R code
pasted below) and we want to produce a MLE for the three factors. But
optim doesn't work - it produces the following error message:
Error in solve.default(vcvm) :
Lapack routine dgesv: system is exactly singular
We've tried all the different methods that optim uses (except "L-BFGS-
B") and all (except "SANN") produce the same error message (sometimes
referring to vcvd.) We've left SANN going all night and it hasn't
produced an answer. Now vcvd and vcvm are variance-covariance
matrices and, our function does attempt to work out the inverse using
solve, but the way in which we have set them up should not allow
singular matrices to be produced (see the code.) However, we've tried
a range of different starting values & we get the same errors.
Our data has some odd features. After excluding some cases, there's
roughly 2000 identical or MZ twins and roughly 3500 non-identical or
DZ twins with lots of missing values. But many of the traits that
we're interested in have a very small scale - so Speaking at age 7 is
just a scale of 3 levels (1, 2 & 3) - and of course the trait for
many of the twins (about 75% of them) are exactly the same. We can
argue about whether this is a good model - but our initial job is to
replicate the results of the paper. There are several other traits
that are more straightforward (so 'g' or intelligence is measured as
z-scores.)
So the specific questions are:
1. Why is optim doing this and should we use something different? The
help file on optim suggests that optim might be problematic for
parameters with more than one dimension (which is really what we've
got here since we use the 3 variance parameters to create a variance-
covariance matrix)? The help files suggests using optimize, but it's
not at all clear how we should use optimize - the arguments don't
appear to us to be equivalent to optim (e.g., no par). Also we're
uninterested in mu - should we actually just work with the actual
mean for the trait from our data - which would presumably be faster.
2. Is there a way of avoiding the for loop in the code? It seems to
us that this will be really inefficient & optim certainly takes a
while to return the error message (> 2 minutes). We've tried using
apply with a function we define ourselves:
mu <- c(1,1)
vcv <- matrix(c(1,0,0,1),2,2)
D <- matrix(c(1,2,3,4),2,2)
muvcv.fun <- function(x,mu,vcv) (x-mu) %*% vcv %*% (x-mu)
apply(D,1,FUN = muvcv.fun(mu = mu, vcv = vcv))
But we get the following error message:
Error in muvcv.fun(mu = mu, vcv = vcv) :
argument "x" is missing, with no default
So we're unsure how the syntax for a parameter should be for a
function inside optim. We've tried various options that seem
intuitive / sensible to us, although doubtless we're missing
something obvious.
3. We have to then calculate the root mean square error of
approximation (RMSEA). In the original paper, this is calculated
using the software the authors use - Mx, which seems to be pretty
exclusively used for twins studies, although RMSEA is fairly
commonplace. The formula we've found for RMSEA is sqrt[(chi^2/df - 1)/
(N-1)] but it's not very clear how to use this. We think we have 6
parameters (the variances and covariances for the identical and non-
identical twins] - 3 for the three variance proportions that we're
estimating, hence 3 df. Then N is the total sample size. But to get
chi^2, do we take (obs-exp)^2/exp?
Thanks
Jeremy & Jill (Hohenstein)
Here's the R-code:
# Implement log-likelihood function for MZ & DZ twins - simple
comparison of a single trait based ACE model & bivariate normal
distribution
ll.dbnmbn <- function(par,Yd,Ym,Xd,Xm){
sa <- exp(par[2]) # Re-parameterize R -> R+ as sigma^2 genetic
(hereditary) variance
sc <- exp(par[3]) # Shared environment variance
se <- exp(par[4]) # Non-shared environment
mu <- par[1]
vcvd <- matrix(c(sa+sc+se,sa/2+sc,sa/2+sc,sa+sc+se),2,2) # varcov
matrix for DZ twins
vcvm <- matrix(c(sa+sc+se,sa+sc,sa+sc,sa+sc+se),2,2) # varcov matrix
for MZ twins
D <- rbind(Yd,Xd) # 2 x N(DZ) matrix of bivariate DZ twins trait scores
M <- rbind(Ym,Xm) # 2 x N(MZ) matrix of bivariate MZ twins trait scores
d <- 0
for (i in 1:length(Xd)){
d <- d + (D[,i]-mu)%*%solve(vcvd)%*%(D[,i]-mu) # sum((X-mu) x
varcov^-1 x (X-mu)) for DZ twins
}
m <- 0
for (i in 1:length(Xm)){
m <- m + (M[,i]-mu)%*%solve(vcvm)%*%(M[,i]-mu) # sum((X-mu) x
varcov^-1 x (X-mu)) for MZ twins
}
# Log-likelihood func based on bivariate normal - related but
different for DZ and MZ twins
out <- -length(Xd)*log(2*pi*sqrt(det(vcvd)))-0.5*d - length(Xm)*log
(2*pi*sqrt(det(vcvm)))-0.5*m
return(out)
}
fit.1 <- optim(par = c(0,2.3,1.5,13.5), ll.dbnmbn, Yd = Yd, Ym = Ym,
Xd = Xd, Xm = Xm, method="BFGS", control=list(fnscale = -1), hessian
= T)
Dr Jeremy Hodgen
Senior Lecturer in Mathematics Education
King's College London
Department of Education and Professional Studies
Franklin-Wilkins Building
Waterloo Bridge Wing
150 Stamford Street
London SE1 9NH
Tel: 020 7848 3102
Fax: 020 7848 3182
E-mail: jeremy.hodgen at kcl.ac.uk
I am using MacOS and I have been getting malloc errors while trying
various tasks. I've been working for about three hours just to solve
these problems, but it seems as though there isn't much else I can
do. I tried opening R with various startup commands from Terminal--
I've tried various values for the startup parameters with no success.
I'm not really sure where to go from here. Does anybody have any
suggestions?
My problems with merge() from the earlier email seems to have been
caused by a sample that included a lot of NA's for the sort variable,
and I realized that this was a problem with the way I subset the
sample. The sample used in the study is limited to women 40 or older
who have not had a hysterectomy, so I limited the sample as follows:
#ORIGINAL DATA
brfss <- read.csv("brfss.csv")
#SUBSET
sample2=brfss[brfss$HADHYST2==2,]
sample1=sample2[brfss$AGE >= 40,]
The problem is, for some reason I have a much higher percentage of
NA's for every variable in the subsetted sample. For example, note
the differences in proportions of NA's below:
> sum(as.integer(is.na(brfss$CTYCODE))/length(brfss$CTYCODE))
[1] 0.1380183
> sum(as.integer(is.na(sample1$CTYCODE))/length(sample1$CTYCODE))
[1] 0.6219678
> sum(as.integer(is.na(brfss$AGE))/length(brfss$AGE))
[1] 0
> sum(as.integer(is.na(sample1$AGE))/length(sample1$AGE))
[1] 0.5652472
> sum(as.integer(is.na(brfss$RACE))/length(brfss$RACE))
[1] 0.0001316560
> sum(as.integer(is.na(sample1$RACE))/length(sample1$RACE))
[1] 0.565318
I can't think of a plausible cause for this problem. Does anyone have
some idea why this might happen?
Hey list,
I'm trying to compute a Wald chi squared for a random effects logit
model. It works when I do a regular logit, but I get this error with RE
logit:
waldtest(t1c1, test=("Chisq")) ## library(lmtest)
Error in update.default(fm, update) : need an object with call component
In addition: Warning message:
In object$call : $ operator not defined for this S4 class, returning NULL
...?
Also, is it possible to calculate Spearman's Rho for logit/RE logit
models in R?
Thanks,
didi & shahrzad
Hi all,
We are trying to get "predict" to work after survey glm but we get:
Error in UseMethod("predict") : no applicable method for "predict"
any guesses why this might be happening?
Best,
Goodarz
I am merging to data frames based on common county codes assigned in
the same way by both datasets. Here is the code:
data <- merge(sample, arf, by.x="CTYCODE", by.y="f00012")
I do not get any errors or warnings, but the merged data frame is
just a list of variable names with no rows below them.
I haven't used merge() before, and I am having difficulty figuring
out what might cause this. It's kind of costly to experiment with a
bunch of different arguments as it takes like 20 minutes for the
thing to run each time. Does anybody have any advice?
Hi,
I am trying to do a fixed effects model in R and keep getting various error
messages. In stata my code is
xtreg y x1 x2.......x20, fe(z)
y is my outcome variable and x1-x20 explanatory variables, z is the level at
which i want to calculate fixed effects (here the district level)
I have tried this in zelig using model (ls.mixed) and in nlme but it takes
hours to calculate and usually i get strange error messages. i may have the
wrong syntax. any ideas on how to convert the command in STATA above into R?
thanks!
Aaka
Dear List,
I am running a logit model on the following data, and I have truncated my
data set (a panel data set with country year as the unit of analysis) from
the original 418 country-year observations (rows) to 275 observations
(rows). This is the data.4 that I specify in model.4 below:
model.4 <- svydesign(ids=idcode, strata=NULL, weights=NULL, data=data.4)
However, when I attempt to run the logit model in the survey package, I
enter this
#Running Model 1 with three individual year lags in DV included
multilag.logit1 <-
svyglm(dintensity~lag1yr1+lag2yr1+lag3yr1+polity2l+gdpenl+oil+lpop+epop+p9010.prime+provcont+eprovcont+lsbextex+lsgrtex+lsbextex*
p9010.prime+epop*p9010.prime+epop*lsgrtex+provcont*eprovcont, design=model.4,
family=quasibinomial(link="logit"))
and get the following error:
Error in `$<-.data.frame`(`*tmp*`, ".survey.prob.weights", value = c(
0.00239234449760766, :
replacement has 418 rows, data has 275
I am confused because nothing in my data is 418 rows long any more. Any
clue as to what R is trying to tell me?
Many thanks,
Cat
On Sat, Mar 29, 2008 at 1:12 PM, <johns18 at fas.harvard.edu> wrote:
> Does anyone know how to make R include missing data as its own row in
> contingency tables? thanks!
>
> _______________________________________________
> gov2001-l mailing list
> gov2001-l at lists.fas.harvard.edu
> http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l
>
>
--
Catherine Lena Kelly
Ph.D. Student
Harvard University Department of Government
Dear all,
We are encountering a problem with our code in a particular replication. We
can use this for one set of our data and get the (more or less) right
numbers:
out2<- matrix(NA,2,2)
out2[,1] <- c(34, 36)
varnames <- colnames(plomin)
rownames(out2) <- varnames[out2[,1]]
for ( i in 1:2){ # Calculate teacher assessment correlations for MZ twins at
7
out2[i,2] <- cor(mat.cmz1[,2*(16+i)],mat.cmz1[,2*(16+i)+1], use =
"complete")
}
out2
But when we try to use the same code with different column numbers for a
different portion of the data, we get an error:
out6<- matrix(NA,3,2)
> out6[,1] <- c(103,105,107)
> varnames <- colnames(plomin)
> rownames(out6) <- varnames[out6[,1]]
> for ( i in 1:3){ # Calculate teacher assessment correlations for MZ twins at
10
+ out6[i,3] <- cor(mat.cmz3[,2*(51+i)],
mat.cmz3[,2*(50+i)+1],use="complete")
+ }
Error in out6[i, 3] <- cor(mat.cmz3[, 2 * (51 + i)], mat.cmz3[, 2 * (50 + :
subscript out of bounds
+ out6
We've tried rewriting the code in that line in many different ways, such
that the elements are reordered, we've written it from scratch, we've taken
out all the spaces and we can't seem to get it to read beyond that '+' sign
in that line. Does anyone have any clue as to why we would get the error in
one but not the other?
Thanks.
Best,
Jill & Jeremy