Hi Mark, a better practice would be to put the transformed variable in
Amelia, get out the best possible imputations you can, do your analysis,
and then transform results to your quantity of interest. Clarify or Zelig
style analyses might help with that. Best of luck with your research,
Gary
--
*Gary King* - Albert J. Weatherhead III University Professor - Director,
IQSS <http://iq.harvard.edu/> - Harvard University
GaryKing.org - King(a)Harvard.edu - @KingGary <https://twitter.com/kinggary> -
617-500-7570 - Assistant <king-assist(a)iq.harvard.edu>: 617-495-9271
On Sat, Sep 15, 2018 at 4:33 PM Mark Seeto <markseeto(a)gmail.com> wrote:
> Dear Amelia group,
>
> Suppose my data set has a variable v that I want to include as a
> predictor variable in a regression model. Supoose that some
> transformation of v, for example, sqrt(v) or log(50 - v), looks more
> normally distributed than v does. However, to keep the interpretation
> of the model simpler, I want to include v itself as a predictor
> variable, not a transformation of v.
>
> What I had been doing previously was to use the "sqrts" or "logs"
> argument of amelia(), and then use v (not the transformed v) in the
> model. Or if a different transformation was required, I would create
> the transformed variable then impute (with v as an idvar) then
> back-transform, and use the back-transformed v in the model.
>
> Is this considered poor practice because I was using the transformed v
> for imputation but using v itself in the regression model? If it is,
> would I be better off simply imputing without using any transformation
> of v, assuming that v is the variable I want to include in the
> regression model?
>
> Thanks for any advice, and thanks to the Amelia team for all their work.
>
> Mark
> --
> Amelia mailing list served by HUIT
> [Un]Subscribe/View Archive: http://lists.gking.harvard.edu/?info=amelia
> More info about Amelia: http://gking.harvard.edu/amelia
> Amelia mailing list
> Amelia(a)lists.gking.harvard.edu
>
> To unsubscribe from this list or get other information:
>
> https://lists.gking.harvard.edu/mailman/listinfo/amelia
>
Dear Amelia group,
Suppose my data set has a variable v that I want to include as a
predictor variable in a regression model. Supoose that some
transformation of v, for example, sqrt(v) or log(50 - v), looks more
normally distributed than v does. However, to keep the interpretation
of the model simpler, I want to include v itself as a predictor
variable, not a transformation of v.
What I had been doing previously was to use the "sqrts" or "logs"
argument of amelia(), and then use v (not the transformed v) in the
model. Or if a different transformation was required, I would create
the transformed variable then impute (with v as an idvar) then
back-transform, and use the back-transformed v in the model.
Is this considered poor practice because I was using the transformed v
for imputation but using v itself in the regression model? If it is,
would I be better off simply imputing without using any transformation
of v, assuming that v is the variable I want to include in the
regression model?
Thanks for any advice, and thanks to the Amelia team for all their work.
Mark
Hello,
I'm using Amelia II to impute missing data in a longitudinal setting. I'm
running into similar warnings that others have noticed regarding a variable
being perfectly collinear with another variable. I have 65 variables and 5
are deemed to be perfectly collinear. The most logical thing to do is to
remove the 5 variables and continue with the imputation process, which I do
and the model converges fine. However, I'm wondering if it makes sense to
add these 5 variables back in *after *the imputation process (these 5
variables contain no missing data). I realize that it is ideal to have all
the variables included in the original imputation model to best estimate
the missing values. However, at first glance, it doesn't seem harmful to
add back in variables that are collinear. Adding back in collinear
features might seem weird, but I'll be analyzing the data with penalized
regression and would like to keep all of the original data in the model.
I'd appreciate any feedback!
Thanks!
Hello. I have an additional question on data transformation. I was thinking on first of all applying the EMB algorithm on my data and then - after having fulfilled all my missing values - I'd transform my data into ln (natural logarithm). However, I'm not sure anymore if this is the most consistent way to proceed because I read on "AMELIA II: A Program for Missing Data" (Honaker, King, and Blackwell; 2012) the following: "Any variable that will be in the analysis model should also be in the imputation model. THIS INCLUDES ANY TRANSFORMATIONS (...)." So, please could you advise me on which is the most consistent way to proceed: either (i) fulfill all missing values using EMB algorithm and only after that transform my data into ln; OR (ii) transform my data into ln and then subsequently use the EMB algorithm to fulfill all missing values? I'm looking forward to hearing from you. Many added thanks for your help. Michel
Hello. I'm trying to load a database from an R file (extension .RData) into Amelia II by using AmeliaView. However, when I choose the R file that contains my data to be loaded, it seems that Amelia II stops running because I cannot see any database loaded into it (Amelia's screen doesn't change at all), even waiting some time to see if the data eventually would be loaded, which never happened so far. So, could you please give me some guidance on what am I doing wrong? Just to give you some context about my database: it is a Cross-Sectional Time Series, consisting of the daily closing position of each stock market of the G-20 countries from 2003 to 2017, totaling around around 80,000 data points (i.e. around 4,000 values per country), out of which about 5% are missing values. Many thanks. Best wishes, Michel
Hi Akthem,
I believe this error is due to the computer or R running out of RAM. You
could try to see if the code runs without the parallel argument (probably
setting m = 1 to test one imputation). Sometimes parallel doesn't handle
large data sets well. If you get an error message there, then it might be
the case that some of the internal copying of the data.frame is causing RAM
to max out (we do try to minimize this). Let us know if that works for you.
Cheers,
Matt
~~~~~~~~~~~
Matthew Blackwell
Assistant Professor of Government
Harvard University
url: http://www.mattblackwell.org
On Wed, Jan 10, 2018 at 8:52 AM, Akthem Rehab <akthem(a)gmail.com> wrote:
> Hi All,
>
> I am using Amelia to impute a time series data set generated from sensors
> in an industrial setting. Doing that for 8 variables (I only picked
> continuous variables for imputation) and ~40M readings (a reading/second).
>
>
>
> Here is my Amelia code:
>
>
>
> Test <- amelia(Query1[1:2e6,], m=3, p2s=2, cs=NULL, ts=”TIME”, incheck =
> T, parallel = “snow”, ncpus = 3, collect = T,
>
> Idvars = c(“D78”, “D82”, “D83”), lags = “C0”, “C1”, “C5”,
> “C6”, “C16”, “C17”, “C18”, “C19”),
> leads = “C0”, “C1”, “C5”, “C6”, “C16”, “C17”, “C18”,
> “C19”))
>
>
>
> The code runs fine as long as the number of readings does not exceed
> ~1.2M. After that I receive the following error:
>
>
>
> Error in unserialize(node$con) : error reading from connection
>
>
>
> Some investigation shows that this has to do with the parallel workers. I
> noticed that the memory/worker does not exceed ~4GB and then goes back down
> before generating the error.
>
>
>
> I am running Windows Server 2016 with Oracle Distribution of R v 3.3.0.
> Amelia is version 1.7.4.
>
>
>
> I tried to troubleshoot with Oracle Community support before finding out
> that the issue also occurs when the data is a data.frame and not an
> ORE.Frame.
>
>
>
> Here is the link for the troubleshooting thread -
> https://community.oracle.com/thread/4109587
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__community.oracle.com_t…>
>
>
>
> Appreciate your support.
>
>
>
> Regards,
> Akthem
>
>
>
Hi All,
I am using Amelia to impute a time series data set generated from sensors in an industrial setting. Doing that for 8 variables (I only picked continuous variables for imputation) and ~40M readings (a reading/second).
Here is my Amelia code:
Test <- amelia(Query1[1:2e6,], m=3, p2s=2, cs=NULL, ts=”TIME”, incheck = T, parallel = “snow”, ncpus = 3, collect = T,
Idvars = c(“D78”, “D82”, “D83”), lags = “C0”, “C1”, “C5”, “C6”, “C16”, “C17”, “C18”, “C19”),
leads = “C0”, “C1”, “C5”, “C6”, “C16”, “C17”, “C18”, “C19”))
The code runs fine as long as the number of readings does not exceed ~1.2M. After that I receive the following error:
Error in unserialize(node$con) : error reading from connection
Some investigation shows that this has to do with the parallel workers. I noticed that the memory/worker does not exceed ~4GB and then goes back down before generating the error.
I am running Windows Server 2016 with Oracle Distribution of R v 3.3.0. Amelia is version 1.7.4.
I tried to troubleshoot with Oracle Community support before finding out that the issue also occurs when the data is a data.frame and not an ORE.Frame.
Here is the link for the troubleshooting thread - https://community.oracle.com/thread/4109587
Appreciate your support.
Regards,
Akthem
Hi,
I am using Amelia to simulate missing vote values for French elections
(before performing multiparty electoral data analysis using Clarify).
I need to make sure that after the simulation the sum of the votes for the
different parties (vFN + vPC + vPS + vUMP + vVerts) is below one (these are
the main parties and some much smaller parties are not included).
I thought the prior I generated (see below) would make the trick but it
does not seem to work.
And the thing is that in the imputed data I generate now, the sum of the
votes is very often above 1 which makes no sense (and generate issue then
on Clarify with the logistic transformation).
Any idea of how I could handle that?
Many thanks in advance,
Best,
Julia
Here is my code:
database <- read.dta13("rall.dta")
prior <- matrix(NA, nrow=nrow(database),ncol=5)
for (i in 1:nrow(database)){
v3 <- database$vFN[i]
v5 <- database$vPC[i]
v7 <- database$vPS[i]
v9 <- database$vUMP[i]
v11 <- database$vVerts[i]
prior[i,] <- c(i, 3, 0, 1 - v5+v7+v9+v11, 0.999999)
}
prior <- prior[!is.na(prior[,4]),]
a.out <- amelia(database, m = 5,ts="year", cs =
"district",priors=prior,lgstc=c("vFN","vPC","vPS","vUMP","vVerts"),bound=rbind(c(4,0,Inf),c(6,0,Inf),c(8,0,Inf),c(10,0,Inf),c(12,0,Inf)))
write.amelia(obj=a.out, file.stem = "R19932012/outdata", format = "dta")
Hi, I’m new to using Amelia. I’m trying to impute missing data for a time-series cross-sectional data, but I'm having trouble running amelia() the way I think I should. I would greatly appreciate some guidance.
I created a data.frame() that has 8 time points each for 260 participants and a single score column for which I’m trying to impute some missing data. The data frame has 2080 (i.e., 8*260) rows by 3 columns (“month”, “ID”, “score”).
With this, I tried to run the following command:
```
a.out <- amelia(data, ts="month", cs="ID", polytime=2, intercs=TRUE, p2s=2)
```
It reported (which I terminated part way through after receiving errors):
amelia starting
beginning prep functions
Variables used: score time.1 time.2 time.3 time.4 time.5 time.6 time.7 time.8 time.9 time.10 time.11 time.12 time.13 time.14 time.15 time.16 time.17 time.18 time.19 time.20 time.21 time.22 time.23 time.24 time.25 time.26 time.27 time.28 time.29 time.30 time.31 time.32 time.33 time.34 time.35 time.36 time.37 time.38 time.39 time.40 time.41 time.42 time.43 time.44 time.45 time.46 time.47 time.48 time.49 time.50 time.51 time.52 time.53 time.54 time.55 time.56 time.57 time.58 time.59 time.60 time.61 time.62 time.63 time.64 time.65 time.66 time.67 time.68 time.69 time.70 time.71 time.72 time.73 time.74 time.75 time.76 time.77 time.78 time.79 time.80 time.81 time.82 time.83 time.84 time.85 time.86 time.87 time.88 time.89 time.90 time.91 time.92 time.93 time.94 time.95 time.96 time.97 time.98 time.99 time.100 time.101 time.102 time.103 time.104 time.105 time.106 time.107 time.108 time.109 time.110 time.111 time.112 time.113 time.114 time.115 time.116 time.117 time.118 time.119 time.120 time.... <truncated>
running bootstrap
-- Imputation 1 --
setting up EM chain indicies
1(300713)! 2
error: inv_sympd(): matrix seems singular
(216)! 3
error: inv_sympd(): matrix seems singular
(208)!
Warning message:
In amelia.prep(x = x, m = m, idvars = idvars, empri = empri, ts = ts, :
You have a small number of observations, relative to the number, of variables in the imputation model. Consider removing some variables, or reducing the order of time polynomials to reduce the number of parameters.
I don’t understand the error. I also don’t understand how it determined the `time.x` variables—I know it has something to do with my number of participants but I don’t understand how. The warning message suggests I have too many variables because of this. When I tried using the “freetrade" dataset, it used way fewer `time.x` variables (i.e., 26) even though there were only 19 time points in the data set and didn’t have problems.
Could someone explain to me about the error or what may be the problem and what I should do to correct it?
Also, when using time series data, do I use amelia() differently whether the time variable is treated as chronological time (e.g., January, February, March, …) or time of onset (e.g., one month since birth, two months since birth, etc.)?
Please advise.
Best regards,
Lawrence
Hi, my name is Laura and I'm new using Amelia. I want to use Amelia to
complete precipitation and temperature data. But I can not find references
to studies that show such application.
Does anyone know of studies and examples of use cases of the Amelia program
for precipitation and temperature data?
I would appreciate your collaboration!
Best regards,
Laura Cabezas