Jens,
Thank you for your reply! So, we are not running into
a memory bottleneck, but specifically into a running
time bottleneck. The dataset completely fits in the
2GB of our laptop memory (I checked, no swap is used).
So, we ended up importing the data into a SQL database
and using queries and PHP scripts with queries to calculate
values for our explanatory variables from the raw data. Most
transformations from raw data to explanatory variable are
complex. For some tasks running time is still very high,
so most of our time is going into optimization to squeeze
the best efficiency.
-Alexei and Ben
On 03/27/2008 04:02 PM, Jens Hainmueller wrote:
Alexei and Ben,
Look here:
http://user2007.org/program/posters/adler.pdf
and also check out the filehash package. I have never used it but heard that
it supposedly works well. Overall R isn't really designed for really large
datasets.
Alternatively, I would simply do all data manipulation in SAS which in my
experience beats all other stats programs when it comes to really large
datasets. 1mln is no big deal at all for SAS. You can then import the
estimation data into R and run just the models in R.
I think Stata should also still be ok with 1mls but it may choke.
Jens
-----Original Message-----
From: gov2001-l-bounces at
lists.fas.harvard.edu [mailto:gov2001-l-
bounces at
lists.fas.harvard.edu] On Behalf Of Alexei Colin
Sent: Thursday, March 27, 2008 4:45 PM
To: gov2001-l at
lists.fas.harvard.edu
Subject: [gov2001-l] Performance
Hi all,
Did anyone encounter any issues with performance? We
ran into a running-time bottleneck and are stuck in
it.
Our dataset contains about 1mln entries. And iterating
through it with a few manipulations takes over
24 hours. :(
Does anyone have any general pointers about how to make
R code more efficient? For example, we gather that
doing things like dat[dat[['DATE']] == myDate,] are very
expensive operations. Is this true?
It's no surprise that R, just like MATLAB, exposes the
tradeoff of concise code and performance, but we need
to get this replication done somehow. Perhaps doing some
initial data-filtering tasks in C++ is a viable solution? :)
How do people usually deal with the problem of "too much
data"?
Thank you!!
-Alexei and Ben
_______________________________________________
gov2001-l mailing list
gov2001-l at
lists.fas.harvard.edu
http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l
_______________________________________________
gov2001-l mailing list
gov2001-l at
lists.fas.harvard.edu
http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l