what slows the imputation down so much?

List overview All Threads
Download

newer

older

error when setting priors in...

How to handle "conditional...

Mark Manger

14 Mar 2008 14 Mar '08

11:42 a.m.

Hi, I apologize in advance for the lengthy question, but it's representative of many issues I face when working with large panels of economic data, so I would be extremely grateful for your suggestions, best practices, experiences etc. I'm wondering what I could do to speed up the imputation of my rather large dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace, my imputations would run months. Memory is not the issue, rather I think that I have too many priors and/or too many missings on certain variables. See below, especially lnAid and lnFDI. Note that the missings are concentrated on certain T points (in early time points) rather than specific cross-sectional units. | Variable || || | Obs Mean Std. Dev. Min Max Polity || || | 168160 .8924833 6.955011 -10 10 Corruptlvl|| || | 157820 5.441431 1.799652 0 10 RuleofLaw || || | 157820 5.247434 2.204846 0 10 GovStab || ||| 157820 5.935454 2.064963 0 10 log of bilat. Aid | 76079 1.919392 2.338255 -2.302585 9.692112 log of FDI in host| 32080 3.918487 2.928901 -2.372018 10.98025 Capital openness|||| | 155200 -.2888318 1.379179 -1.766966 2.602508 Polcon V || || | 154320 .3490876 .3158385 0 .89 log of GDPcap_host| 154560 7.95649 1.053043 4.933741 10.48464 ||log of ||GDP_host | 166480 29.62135 3.04193 22.97718 43.12974 ||log of ||GDP_home | 147381 31.1144 2.128313 26.15253 37.36032 |If I don't set range priors, I get nonsensical values for most of the variables: negative GDP (real GDP, not negative log values), polity scores out of range, etc. I haven't even tried higher-order polynomials or interactions with cross-sectional units, although I would prefer to given that FDI exhibits a clear trend. Breaking up the dataset randomly into pieces by cross-sections doesn't improve speed. It seems that I have to make tradeoffs. What do you think would be the best thing to do, i.e. what is the most time-consuming issue for the EM algorithm? Constrain/shorten the sample to have a higher proportion of observed values on lnAid and lnFDI? Accept imputations that are out of range (probably not)? ||Break up the dataset "vertically" into one with Aid and one with the FDI variable, run two sets of imputations, and merge it again? Many thanks, Mark -- Mark S. Manger, PhD Assistant Professor Department of Political Science, McGill University mark.manger(a)mcgill.ca on leave 2007-08: Advanced Research Fellow, Program on US-Japan Relations Weatherhead Center for International Affairs Harvard University 61 Kirkland Street, Room 301 Cambridge, MA 02138 617-495-5998 - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

Show replies by date

Anders Schwartz Corr

14 Mar 14 Mar

12:34 p.m.

Mark, Are you running Amelia on hmdc condor? I found running 50 separate Amelia programs simultaneously on hmdc relatively quick and easy -- the first 5 imputed datasets were returned within two or three weeks, and then after that I got about 1 more condor egg ;) per day. I could start my data analysis with the first five datasets and then additional datasets were added as they came in. Anders Note to the powers that be: it would be useful to set condor to give streaming output. It is a little difficult to know what is going on with Amelia (can't take advantage of the verbose function) when this standard condor function isn't set. Thank you for such a great service hmdc! On Fri, 14 Mar 2008, Mark Manger wrote:

...

- Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

Gary King

10:16 p.m.

...

Hi, I apologize in advance for the lengthy question, but it's representative of many issues I face when working with large panels of economic data, so I would be extremely grateful for your suggestions, best practices, experiences etc. I'm wondering what I could do to speed up the imputation of my rather large dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace, my imputations would run months. Memory is not the issue, rather I think that I have too many priors and/or too many missings on certain variables. See below, especially lnAid and lnFDI. Note that the missings are concentrated on certain T points (in early time points) rather than specific cross-sectional units. | Variable || || | Obs Mean Std. Dev. Min Max Polity || || | 168160 .8924833 6.955011 -10 10 Corruptlvl|| || | 157820 5.441431 1.799652 0 10 RuleofLaw || || | 157820 5.247434 2.204846 0 10 GovStab || ||| 157820 5.935454 2.064963 0 10 log of bilat. Aid | 76079 1.919392 2.338255 -2.302585 9.692112 log of FDI in host| 32080 3.918487 2.928901 -2.372018 10.98025 Capital openness|||| | 155200 -.2888318 1.379179 -1.766966 2.602508 Polcon V || || | 154320 .3490876 .3158385 0 .89 log of GDPcap_host| 154560 7.95649 1.053043 4.933741 10.48464 | | log of ||GDP_host | 166480 29.62135 3.04193 22.97718 43.12974 | | log of ||GDP_home | 147381 31.1144 2.128313 26.15253 37.36032 | If I don't set range priors, I get nonsensical values for most of the variables: negative GDP (real GDP, not negative log values), polity scores out of range, etc. I haven't even tried higher-order polynomials or interactions with cross-sectional units, although I would prefer to given that FDI exhibits a clear trend. Breaking up the dataset randomly into pieces by cross-sections doesn't improve speed. It seems that I have to make tradeoffs. What do you think would be the best thing to do, i.e. what is the most time-consuming issue for the EM algorithm? Constrain/shorten the sample to have a higher proportion of observed values on lnAid and lnFDI? Accept imputations that are out of range (probably not)? | | Break up the dataset "vertically" into one with Aid and one with the | | FDI variable, run two sets of imputations, and merge it again? Many thanks, Mark -- Mark S. Manger, PhD Assistant Professor Department of Political Science, McGill University mark.manger(a)mcgill.ca on leave 2007-08: Advanced Research Fellow, Program on US-Japan Relations Weatherhead Center for International Affairs Harvard University 61 Kirkland Street, Room 301 Cambridge, MA 02138 617-495-5998 - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

- Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

Anders Schwartz Corr

15 Mar 15 Mar

3:50 a.m.

Hi, Thanks for the translation. When a parallel environment is running say 10 runs (the condor_submit default) of Amelia simultaneously, Amelia saves the datasets as outdata1.csv, outdata2.csv, etc., to a single file from which the condor (the computing cluster program) job was submitted. Is it possible that Amelia is overwriting other imputed datasets that one of the other 10 Amelia runs has already created? If so, maybe condor could create a separate directory for each of the 10 Amelia runs, and/or Amelia could check the directory it is saving into to make sure it isn't overwriting any already-existing files. Also, if I set Amelia to create 5 imputed datasets, and set a random-number generating seed so I can replicate the results exactly, and then set condor to run my single Amelia file 10 times, does that mean that condor will create 10 identical sets of 5 imputed datasets? But then only one of the 10 sets would be useful. Alternatively, I suppose Amelia programmers may have intended users to simply create as many directories and different Amelia.R files (in my case 50) as needed, and rerun the condor submit program 50 times for each of 50 random seeds the user sets in each of the 50 files. Fifty simultaneous Amelia runs may sound excessive, but for a relatively large dataset (mine is 17,000x100) it is almost necessary in order to get 10 or 20 imputed datasets back within the first month of the run. Again, thanks for all the excellent work by the people at HMDC. Anders On Fri, 14 Mar 2008, Gary King wrote:

...

Hi, I apologize in advance for the lengthy question, but it's representative of many issues I face when working with large panels of economic data, so I would be extremely grateful for your suggestions, best practices, experiences etc. I'm wondering what I could do to speed up the imputation of my rather large dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace, my imputations would run months. Memory is not the issue, rather I think that I have too many priors and/or too many missings on certain variables. See below, especially lnAid and lnFDI. Note that the missings are concentrated on certain T points (in early time points) rather than specific cross-sectional units. | Variable || || | Obs Mean Std. Dev. Min Max Polity || || | 168160 .8924833 6.955011 -10 10 Corruptlvl|| || | 157820 5.441431 1.799652 0 10 RuleofLaw || || | 157820 5.247434 2.204846 0 10 GovStab || ||| 157820 5.935454 2.064963 0 10 log of bilat. Aid | 76079 1.919392 2.338255 -2.302585 9.692112 log of FDI in host| 32080 3.918487 2.928901 -2.372018 10.98025 Capital openness|||| | 155200 -.2888318 1.379179 -1.766966 2.602508 Polcon V || || | 154320 .3490876 .3158385 0 .89 log of GDPcap_host| 154560 7.95649 1.053043 4.933741 10.48464 | | log of ||GDP_host | 166480 29.62135 3.04193 22.97718 43.12974 | | log of ||GDP_home | 147381 31.1144 2.128313 26.15253 37.36032 | If I don't set range priors, I get nonsensical values for most of the variables: negative GDP (real GDP, not negative log values), polity scores out of range, etc. I haven't even tried higher-order polynomials or interactions with cross-sectional units, although I would prefer to given that FDI exhibits a clear trend. Breaking up the dataset randomly into pieces by cross-sections doesn't improve speed. It seems that I have to make tradeoffs. What do you think would be the best thing to do, i.e. what is the most time-consuming issue for the EM algorithm? Constrain/shorten the sample to have a higher proportion of observed values on lnAid and lnFDI? Accept imputations that are out of range (probably not)? | | Break up the dataset "vertically" into one with Aid and one with the | | FDI variable, run two sets of imputations, and merge it again? Many thanks, Mark -- Mark S. Manger, PhD Assistant Professor Department of Political Science, McGill University mark.manger(a)mcgill.ca on leave 2007-08: Advanced Research Fellow, Program on US-Japan Relations Weatherhead Center for International Affairs Harvard University 61 Kirkland Street, Room 301 Cambridge, MA 02138 617-495-5998 - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

- Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

Gary King

7:41 a.m.

R and therefore Amelia doesn't have parallel processing built in. But the way it works, if you want m imputations, you can run Amelia m times in separate jobs and it will run m*100% faster. But you do need to tell your parallelization software (condor in this case) to run them each completely separately so they don't overwrite different imputations with the same file names. Best, Gary On Sat, 15 Mar 2008, Anders Schwartz Corr wrote:

...

Hi, I apologize in advance for the lengthy question, but it's representative of many issues I face when working with large panels of economic data, so I would be extremely grateful for your suggestions, best practices, experiences etc. I'm wondering what I could do to speed up the imputation of my rather large dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace, my imputations would run months. Memory is not the issue, rather I think that I have too many priors and/or too many missings on certain variables. See below, especially lnAid and lnFDI. Note that the missings are concentrated on certain T points (in early time points) rather than specific cross-sectional units. | Variable || || | Obs Mean Std. Dev. Min Max Polity || || | 168160 .8924833 6.955011 -10 10 Corruptlvl|| || | 157820 5.441431 1.799652 0 10 RuleofLaw || || | 157820 5.247434 2.204846 0 10 GovStab || ||| 157820 5.935454 2.064963 0 10 log of bilat. Aid | 76079 1.919392 2.338255 -2.302585 9.692112 log of FDI in host| 32080 3.918487 2.928901 -2.372018 10.98025 Capital openness|||| | 155200 -.2888318 1.379179 -1.766966 2.602508 Polcon V || || | 154320 .3490876 .3158385 0 .89 log of GDPcap_host| 154560 7.95649 1.053043 4.933741 10.48464 | | log of ||GDP_host | 166480 29.62135 3.04193 22.97718 43.12974 | | log of ||GDP_home | 147381 31.1144 2.128313 26.15253 37.36032 | If I don't set range priors, I get nonsensical values for most of | the variables: negative GDP (real GDP, not negative log values), polity scores out of range, etc. I haven't even tried higher-order polynomials or interactions with cross-sectional units, although I would prefer to given that FDI exhibits a clear trend. Breaking up the dataset randomly into pieces by cross-sections doesn't improve speed. It seems that I have to make tradeoffs. What do you think would be the best thing to do, i.e. what is the most time-consuming issue for the EM algorithm? Constrain/shorten the sample to have a higher proportion of observed values on lnAid and lnFDI? Accept imputations that are out of range (probably not)? | | Break up the dataset "vertically" into one with Aid and one with | | the | | FDI variable, run two sets of imputations, and merge it again? Many thanks, Mark -- Mark S. Manger, PhD Assistant Professor Department of Political Science, McGill University mark.manger(a)mcgill.ca on leave 2007-08: Advanced Research Fellow, Program on US-Japan Relations Weatherhead Center for International Affairs Harvard University 61 Kirkland Street, Room 301 Cambridge, MA 02138 617-495-5998 - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

- Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

Mark Manger

11:57 a.m.

Hi, I'm intrigued by an off-list comment Anders made. I'm setting priors because I know that certain variables are constrained in a particular way, e.g. I treat a variable coded as 0-10 as continuous, so if Amelia draws an 11.3 that doesn't seem to be a reasonable value. But are these values --mathematically speaking-- better to use if that's what the imputation spits out? My understanding is that whether I set a range or distribution prior, internally Amelia just draws from a normal distribution with specific s.d. and mean anyway rather than from a closed interval. Please correct me if I'm wrong about this. I would rather set distribution priors because I seem to be getting the Error in chol(copy.theta[c(FALSE, m[ss, ]), c(FALSE, m[ss, ])]) : the leading minor of order 5 is not positive definite every once in a while with range priors, but not with distribution priors. Cheers, Mark Mark S. Manger, PhD Assistant Professor Department of Political Science, McGill University mark.manger(a)mcgill.ca on leave 2007-08: Advanced Research Fellow, Program on US-Japan Relations Weatherhead Center for International Affairs Harvard University 61 Kirkland Street, Room 301 Cambridge, MA 02138 617-495-5998 Gary King wrote:

...

Mark, Are you running Amelia on hmdc condor? I found running 50

separate > Amelia programs simultaneously on hmdc relatively quick and easy -- the > first 5 imputed datasets were returned within two or three weeks, and > then after that I got about 1 more condor egg ;) per day. I could start > my data analysis with the first five datasets and then additional > datasets were added as they came in. Anders

> Note to the powers that be: it would be useful to set condor to

give > streaming output. It is a little difficult to know what is going on with > Amelia (can't take advantage of the verbose function) when this standard > condor function isn't set. Thank you for such a great service hmdc!

> On Fri, 14 Mar 2008, Mark Manger wrote: > > Hi, > > > I apologize in advance for the lengthy question, but it's > representative > of many issues I face when working with large panels of

economic > > data, so

> I would be extremely grateful for your suggestions, best

practices,

> experiences etc. > > > I'm wondering what I could do to speed up the imputation

of my rather

> large dataset (a panel of N 2120 x T 80 = 169600 obs). At this

pace, > > my

> imputations would run months. Memory is not the issue, rather

I think > > that

> I have too many priors and/or too many missings on certain

variables. > > See

> below, especially lnAid and lnFDI. Note that the missings are > concentrated > on certain T points (in early time points) rather than specific > cross-sectional units. > | > > Variable || || | Obs Mean Std.

Dev. Min > > Max

> Polity || || | 168160 .8924833

6.955011 -10 > > 10

> Corruptlvl|| || | 157820 5.441431

1.799652 0 > > 10

> RuleofLaw || || | 157820 5.247434

2.204846 0 > > 10

> GovStab || ||| 157820 5.935454

2.064963 0 > > 10

> log of bilat. Aid | 76079 1.919392 2.338255 -2.302585 > 9.692112 > log of FDI in host| 32080 3.918487 2.928901 -2.372018 > 10.98025 > Capital openness|||| | 155200 -.2888318 1.379179

-1.766966

> 2.602508 > Polcon V || || | 154320 .3490876

.3158385 0 > > .89

> log of GDPcap_host| 154560 7.95649 1.053043 4.933741 > 10.48464 > | | log of ||GDP_host | 166480 29.62135 3.04193

22.97718

> 43.12974 > | | log of ||GDP_home | 147381 31.1144 2.128313

26.15253

> 37.36032 > > > | If I don't set range priors, I get nonsensical values for

most of > > | the

> variables: negative GDP (real GDP, not negative log values),

polity > > scores

> out of range, etc. I haven't even tried higher-order

polynomials or

> interactions with cross-sectional units, although I would

prefer to > > given

> that FDI exhibits a clear trend. Breaking up the dataset

randomly > > into

> pieces by cross-sections doesn't improve speed. > > > It seems that I have to make tradeoffs. What do you think

would be > > the

> best thing to do, i.e. what is the most time-consuming issue

for the > > EM

> algorithm? > Constrain/shorten the sample to have a higher proportion of

observed

> values on lnAid and lnFDI? > Accept imputations that are out of range (probably not)? > | | Break up the dataset "vertically" into one with Aid and one

with > > | | the | > > | FDI

> variable, run two sets of imputations, and merge it again? > > > Many thanks, > > > Mark > > > -- > Mark S. Manger, PhD > Assistant Professor > Department of Political Science, McGill University > mark.manger(a)mcgill.ca > > > on leave 2007-08: > Advanced Research Fellow, Program on US-Japan Relations > Weatherhead Center for International Affairs > Harvard University > 61 Kirkland Street, Room 301 > Cambridge, MA 02138 > 617-495-5998 > - > Amelia mailing list served by Harvard-MIT Data Center > [Un]Subscribe/View Archive: > >

http://lists.gking.harvard.edu/?info=amelia

> > - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive:

http://lists.gking.harvard.edu/?info=amelia

- Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

Gary King

12:15 p.m.

Matt or James might have a better guess about the error message, but you're probably asking it to do something that's impossible. on the stats question, if you're using the imputations for something linear, like a variable in a regression or an explanatory variable in a logit or probit or something like that, then its best to just use the imputation as is. the only time you need to do more is if your analysis model requires it, such as for a binary dependent variable in a logit model or ordered variable for an ordered logit, etc. then you can round, or Amelia can do this or things like it for you by telling it what type of variable you have. Gary On Sat, 15 Mar 2008, Mark Manger wrote:

...

a translation: condor distributes jobs to a 200-node research computing cluster we have built at the Harvard-MIT Data Center, which is part of the Institute for Quantitative Social Science at Harvard. unfortunately, that system isn't open to the public (altho we are working at an open source version of the software we produced to create the cluster). Anders' point more generally is that Amelia can be used well in parallel environment. Gary On Fri, 14 Mar 2008, Anders Schwartz Corr wrote: > Mark, Are you running Amelia on hmdc condor? I found running 50 separate > Amelia programs simultaneously on hmdc relatively quick and easy -- the > first 5 imputed datasets were returned within two or three weeks, and > then after that I got about 1 more condor egg ;) per day. I could start > my data analysis with the first five datasets and then additional > datasets were added as they came in. Anders > > Note to the powers that be: it would be useful to set condor to give > streaming output. It is a little difficult to know what is going on with > Amelia (can't take advantage of the verbose function) when this standard > condor function isn't set. Thank you for such a great service hmdc! > > On Fri, 14 Mar 2008, Mark Manger wrote: > > > Hi, > > > > I apologize in advance for the lengthy question, but it's > > representative > > of many issues I face when working with large panels of economic > > data, so > > I would be extremely grateful for your suggestions, best practices, > > experiences etc. > > > > I'm wondering what I could do to speed up the imputation of my rather > > large dataset (a panel of N 2120 x T 80 = 169600 obs). At this pace, > > my > > imputations would run months. Memory is not the issue, rather I think > > that > > I have too many priors and/or too many missings on certain variables. > > See > > below, especially lnAid and lnFDI. Note that the missings are > > concentrated > > on certain T points (in early time points) rather than specific > > cross-sectional units. > > | > > Variable || || | Obs Mean Std. Dev. Min > > Max > > Polity || || | 168160 .8924833 6.955011 -10 > > 10 > > Corruptlvl|| || | 157820 5.441431 1.799652 0 > > 10 > > RuleofLaw || || | 157820 5.247434 2.204846 0 > > 10 > > GovStab || ||| 157820 5.935454 2.064963 0 > > 10 > > log of bilat. Aid | 76079 1.919392 2.338255 -2.302585 > > 9.692112 > > log of FDI in host| 32080 3.918487 2.928901 -2.372018 > > 10.98025 > > Capital openness|||| | 155200 -.2888318 1.379179 -1.766966 > > 2.602508 > > Polcon V || || | 154320 .3490876 .3158385 0 > > .89 > > log of GDPcap_host| 154560 7.95649 1.053043 4.933741 > > 10.48464 > > | | log of ||GDP_host | 166480 29.62135 3.04193 22.97718 > > 43.12974 > > | | log of ||GDP_home | 147381 31.1144 2.128313 26.15253 > > 37.36032 > > > > | If I don't set range priors, I get nonsensical values for most of > > | the > > variables: negative GDP (real GDP, not negative log values), polity > > scores > > out of range, etc. I haven't even tried higher-order polynomials or > > interactions with cross-sectional units, although I would prefer to > > given > > that FDI exhibits a clear trend. Breaking up the dataset randomly > > into > > pieces by cross-sections doesn't improve speed. > > > > It seems that I have to make tradeoffs. What do you think would be > > the > > best thing to do, i.e. what is the most time-consuming issue for the > > EM > > algorithm? > > Constrain/shorten the sample to have a higher proportion of observed > > values on lnAid and lnFDI? > > Accept imputations that are out of range (probably not)? > > | | Break up the dataset "vertically" into one with Aid and one with > > | | the | > > | FDI > > variable, run two sets of imputations, and merge it again? > > > > Many thanks, > > > > Mark > > > > -- > > Mark S. Manger, PhD > > Assistant Professor > > Department of Political Science, McGill University > > mark.manger(a)mcgill.ca > > > > on leave 2007-08: > > Advanced Research Fellow, Program on US-Japan Relations > > Weatherhead Center for International Affairs > > Harvard University > > 61 Kirkland Street, Room 301 > > Cambridge, MA 02138 > > 617-495-5998 > > - > > Amelia mailing list served by Harvard-MIT Data Center > > [Un]Subscribe/View Archive: > > http://lists.gking.harvard.edu/?info=amelia > > > - > Amelia mailing list served by Harvard-MIT Data Center > [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia > > - Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

- Amelia mailing list served by Harvard-MIT Data Center [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia

5895

days inactive

5896

days old

amelia@lists.gking.harvard.edu

Manage subscription

6 comments

3 participants

tags (0)

participants (3)

Anders Schwartz Corr
Gary King
Mark Manger