Hi Mithilesh,
Sorry, thought I had responded to this. It appears that R is running out of
memory for the dataset that you are using. Given the options, you are
trying to impute a dataset that is roughly 1300 variables (each level of a
nom get its own column in the amelia data matrix). There may also be some
perfect collinearities in the dataset. Is it possible to treat longitude
and latitude as continuous? This might help cut down on the number of
levels. In general, I would start with an even small set of covariates
(possibly only the continuous ones?) and see if amelia will run. Then add
variables through noms slowly to see what occurs. It's hard to know if it
is the number of covariates or the number of observations or the
combination that is causing problems in this case.
Cheers,
Matt
On Sun, Jan 31, 2016 at 1:47 AM, Mithilesh Kumar <mithileshk.in(a)gmail.com>
wrote:
Hi Matt,
I am using Amelia version 1.7.4.
Actually, I was taking logical variables ( one with 0 & 1 response) in
noms. I moved them to idvars to ignore it.
In some categorical variables there are more than 1000 levels. I removed
those variables ( one with more than 1000 levels).
I am using following variables(1498 levels) in imputations:
*Var Levels*
"destinationcountry" = 155,
"cartype" = 39,
"browser.x" = 13,
"interactionchannel" = 4,
"paymentmethod" = 7,
"segmentname" = 3,
"geo_country" = 146,
"geo_region" = 378,
"operating_system" = 62,
"browser.y" = 18,
"language" = 35,
"latitude" = 235,
"longitude" = 243,
"device_model_id" = 146
This time I am getting following memory limit error:
Error: cannot allocate vector of size 4.0 Gb
In addition: There were 30 warnings (use warnings() to see them)
Warnings:
1: In amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors, ...
:
You've set the polynomials of time to zero with no interaction with
the cross-sectional variable. This has no effect on the imputation.
2: In amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors, ...
:
The number of categories in one of the variables marked nominal has
greater than 10 categories. Check nominal specification.
12: In amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors,
... :
The variable NA is perfectly collinear with another variable in the data.
13: In ifelse(x[, i] == values[j], 1, 0) :
Reached total allocation of 8077Mb: see help(memory.size)
*Codes:*
dt.out <- amelia(x = dt, m = 3, idvars = c("device_unique_id",
"AirportTransaction", "status", "is_remarketing",
"post_click_conv",
"post_view_conv"), ts = "pickupdate", cs =
"destinationcountry", priors = NULL, lags = NULL, empri = 0.01*nrow(dt),
polytime = 0, intercs = FALSE, p2s = 2, incheck = TRUE, ords = NULL,
noms = c("cartype","browser.x",
"interactionchannel",
"paymentmethod", "segmentname", "geo_country",
"geo_region",
"operating_system", "browser.y", "language",
"latitude", "longitude",
"device_model_id"))
Regards,
On Sat, Jan 30, 2016 at 11:43 PM, Matt Blackwell <
mblackwell(a)gov.harvard.edu> wrote:
Hi Mithilesh,
My guess is that you might be asking too much of the data here. You are
including a separate quadratic function of time for each cross-sectional
unit in the data (polytime = 2, intercs=TRUE) and this might be problematic
if some of the characteristics of the cross-sectional unit are constant
within unit. Can you try to run Amelia with intercs = FALSE and see if (a)
things speed up and (b) if the error message disappears?
Also, what version of Amelia are you using? There was a bug with that
error message in previous versions, but should be fixed in 1.7.4.
Cheers,
Matt
On Sat, Jan 30, 2016 at 12:53 PM, Mithilesh Kumar <
mithileshk.in(a)gmail.com> wrote:
Hi Matt,
After running with ridge prior for 4 hours I am getting following error:
*Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels*
*In addition: There were 19 warnings (use warnings() to see them)*
I am using following code:
dt.out <- amelia(x = dt, m = 3, idvars = "device_unique_id", ts =
"pickupdate", cs = "destinationcountry",
priors = NULL, lags = NULL, empri = 0.01*nrow(dt), polytime = 2,
intercs = TRUE, p2s = 2, incheck = TRUE, ords = NULL,
noms = c("cartype", "AirportTransaction", "status",
"browser.x",
"interactionchannel", "paymentmethod",
"segmentname", "ip_address", "geo_country",
"geo_region",
"operating_system", "browser.y", "language",
"creative_freq", "creative_rec", "user_group_id",
"is_remarketing", "post_click_conv", "post_view_conv",
"advertiser_frequency", "advertiser_recency",
"latitude",
"longitude", "device_model_id"))
Regards,
On Sat, Jan 30, 2016 at 9:35 AM, Matt Blackwell <
mblackwell(a)gov.harvard.edu> wrote:
Hi Mithilesh,
It's not so much a limitation on the number of observations, but you
are asking a lot of Amelia here. If there are 28 categorical variables each
with more than 10 categories (and you have marked them so), then you adding
roughly 280 variables to the imputation model which is quite a few. But
that shouldn't be too bad, given the size of your data. It seems more
likely to be the extremely high missingness rate. You might try using the
ridge prior ("empri" argument in the amelia function). See section 4.7.1 of
vignette for more information about this setting:
https://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf
<https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_web_packages_Amelia_vignettes_amelia.pdf&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=EwICq0J5pL8CwgEJz8qkmauGonk0XmiLpxcYOEgk2a0&m=uEZ8qUv7U9gjWlMLKrTHFEkD3WeMo3tCAZqn7XKnGj8&s=sJ_wcTfgsvS3q8MLtKhFrLwQElq6TCoiEfXgMgKQwjo&e=>
Cheers,
Matt
~~~~~~~~~~~
Matthew Blackwell
Assistant Professor of Government
Harvard University
url:
http://www.mattblackwell.org
<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mattblackwell.org&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=EwICq0J5pL8CwgEJz8qkmauGonk0XmiLpxcYOEgk2a0&m=uEZ8qUv7U9gjWlMLKrTHFEkD3WeMo3tCAZqn7XKnGj8&s=std4gz2pQc2j7Th4J1LX3xAT4emsjOs2mjXiC8-Pb4w&e=>
On Fri, Jan 29, 2016 at 10:54 PM, Mithilesh Kumar <
mithileshk.in(a)gmail.com> wrote:
> I have 761,592 obs for 31 variables on users behaviours towards online
> ads. Out of 31 variables, 28 are categorical. Many cat. variables have more
> than 10 categories. I am using Amelia for missing data imputation.
>
> It's taking very long time. Are there other ways to do it fast? What's
> the Amelia limits on number of observations ?
>
> Is there any R-package which perform better on large dataset for
> missing data imputation?
>
> I checked for complete cases, there are only 172 complete cases which
> is very insignificant as compare to total dataset.
>
> --
> Mithilesh Kumar
>
>
>
>
> --
> Amelia mailing list served by HUIT
> [Un]Subscribe/View Archive:
>
http://lists.gking.harvard.edu/?info=amelia
> More info about Amelia:
http://gking.harvard.edu/amelia
> Amelia mailing list
> Amelia(a)lists.gking.harvard.edu
>
> To unsubscribe from this list or get other information:
>
>
https://lists.gking.harvard.edu/mailman/listinfo/amelia
>
--
Mithilesh Kumar
--
Mithilesh Kumar