[gov2001-l] Performance

27 Mar 2008

Alexei and Ben,

Look here: http://user2007.org/program/posters/adler.pdf

and also check out the filehash package. I have never used it but heard that
it supposedly works well. Overall R isn't really designed for really large
datasets.

Alternatively, I would simply do all data manipulation in SAS which in my
experience beats all other stats programs when it comes to really large
datasets. 1mln is no big deal at all for SAS. You can then import the
estimation data into R and run just the models in R. 

I think Stata should also still be ok with 1mls but it may choke.

Jens

...
  -----Original Message-----
 From: gov2001-l-bounces at lists.fas.harvard.edu [mailto:gov2001-l-
 bounces at lists.fas.harvard.edu] On Behalf Of Alexei Colin
 Sent: Thursday, March 27, 2008 4:45 PM
 To: gov2001-l at lists.fas.harvard.edu
 Subject: [gov2001-l] Performance

 Hi all,

 Did anyone encounter any issues with performance? We
 ran into a running-time bottleneck and are stuck in
 it.

 Our dataset contains about 1mln entries. And iterating
 through it with a few manipulations takes over
 24 hours. :(

 Does anyone have any general pointers about how to make
 R code more efficient? For example, we gather that
 doing things like dat[dat[['DATE']] == myDate,] are very
 expensive operations. Is this true?

 It's no surprise that R, just like MATLAB, exposes the
 tradeoff of concise code and performance, but we need
 to get this replication done somehow. Perhaps doing some
 initial data-filtering tasks in C++ is a viable solution? :)
 How do people usually deal with the problem of "too much
 data"?

 Thank you!!

 -Alexei and Ben

 _______________________________________________
 gov2001-l mailing list
 gov2001-l at lists.fas.harvard.edu
 http://lists.fas.harvard.edu/mailman/listinfo/gov2001-l 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

[gov2001-l] Performance