and also check out the filehash package. I have never used it but heard that
it supposedly works well. Overall R isn't really designed for really large
datasets.
Alternatively, I would simply do all data manipulation in SAS which in my
experience beats all other stats programs when it comes to really large
datasets. 1mln is no big deal at all for SAS. You can then import the
estimation data into R and run just the models in R.
I think Stata should also still be ok with 1mls but it may choke.
Jens
-----Original Message-----
From: gov2001-l-bounces at
lists.fas.harvard.edu [mailto:gov2001-l-
bounces at
lists.fas.harvard.edu] On Behalf Of Alexei Colin
Sent: Thursday, March 27, 2008 4:45 PM
To: gov2001-l at
lists.fas.harvard.edu
Subject: [gov2001-l] Performance
Hi all,
Did anyone encounter any issues with performance? We
ran into a running-time bottleneck and are stuck in
it.
Our dataset contains about 1mln entries. And iterating
through it with a few manipulations takes over
24 hours. :(
Does anyone have any general pointers about how to make
R code more efficient? For example, we gather that
doing things like dat[dat[['DATE']] == myDate,] are very
expensive operations. Is this true?
It's no surprise that R, just like MATLAB, exposes the
tradeoff of concise code and performance, but we need
to get this replication done somehow. Perhaps doing some
initial data-filtering tasks in C++ is a viable solution? :)
How do people usually deal with the problem of "too much
data"?
Thank you!!
-Alexei and Ben
_______________________________________________
gov2001-l mailing list
gov2001-l at
lists.fas.harvard.edu
http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l