Hi Ben,
You could use CEM to match on all the other variables and the use non to
match on sales year and exactly match on cem strata. That should accomplish
your goal I think.
Cheers,
Matt
On Friday, July 11, 2014, Ben Hoen <bhoen(a)lbl.gov> wrote:
Sorry, I missed this until now…
I could use the SIF, but I still run into the same issue, that I want to
avoid “edge effects” (my use of this terminology) where a sale date at one
end of a coarsening bin (say December 2010) cannot match with one very
close to the end of the next coarsening bin (say January 2011). I do
realize this is the same for all variables that are coarsened. But, I am
thinking that sale date is not like the other variables in that it is only
useful for its relative proximity to another. One could argue, of course,
that sale date has some absolute pertinence too (e.g., before, during or
after the housing bubble) but for my purposes I am more concerned with how
close (in days/weeks/months) the untreated sale date is to the treated.
For that reason a NN specification for this variable would be preferable.
Ben
Ben Hoen
LBNL
Office: 845-758-1896
Cell: 718-812-7589
*From:* Ariel Linden [mailto:ariel.linden@gmail.com
<javascript:_e(%7B%7D,'cvml','ariel.linden@gmail.com');>]
*Sent:* Wednesday, July 09, 2014 2:49 PM
*To:* 'Ben Hoen'; 'Matt Blackwell'
*Cc:* cem(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','cem@lists.gking.harvard.edu');>
*Subject:* RE: Can cem emulate nearest neighbor matching?
Have you considered using the Stata internal form (SIF) value of the date
instead? This is a continuous variable, so you could then use CEM’s
coarsening on that value….
*From:* Ben Hoen [mailto:bhoen@lbl.gov
<javascript:_e(%7B%7D,'cvml','bhoen@lbl.gov');>]
*Sent:* Wednesday, July 09, 2014 1:35 PM
*To:* 'Matt Blackwell'; 'Ariel Linden'
*Cc:* cem(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','cem@lists.gking.harvard.edu');>
*Subject:* Can cem emulate nearest neighbor matching?
Matt (& Ariel)
One more follow-up question...
I have been matching exactly on sale year, wanting to ensure that the
treated pv home sales occur in the same year as the matching comparables.
Ideally, I would not match exactly on sale year, but rather the difference
between the treated sale date and the comparable sale dates. So, say, for
example, I would restrict my comparables to only those with a sale date
within 6 months, either before or after (i.e., +/- 0.5 years). This is more
like a nearest neighbor matching, and would avoid missing the matches that
are within a few months of each other, but in different years.
Because this is such a different approach than coarsening the data, I had
not considered it, but now wonder if there was a work around *you* have
considered and could propose for this situation.
Thanks, in advance,
Ben
Ben Hoen
LBNL
Office: 845-758-1896
Cell: 718-812-7589
*From:* Matt Blackwell [mailto:m.blackwell@rochester.edu
<javascript:_e(%7B%7D,'cvml','m.blackwell@rochester.edu');>]
*Sent:* Tuesday, July 08, 2014 1:36 PM
*To:* Ariel Linden
*Cc:* Ben Hoen; cem(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','cem@lists.gking.harvard.edu');>
*Subject:* Re: [cem] Understaning CEM's use of a categorical variable and
#0
Hi Ben,
Ah, taking a look, I've figured it out. First, the one you want to use is
the "bgcode." This is for two reasons. First, as you've guessed, CEM
doesn't work with string variables, only numerics (I had forgotten this in
my last reply). Second, the "bgnum" variable also trips us Stata because of
the size of the numbers. In the CEM internals, Stata is treating two
numbers with the same scientific notation (1.2e+12) as the same. This is
why there are more matches with that version than the other. Thus, your
best bet is the "bgcode." Hope that helps and sorry for any confusion.
Cheers,
Matt
~~~~~~~~~~~
Matthew Blackwell
Assistant Professor of Government
Harvard University
url:
http://www.mattblackwell.org
<https://urldefense.proofpoint.com/v1/url?u=http://www.mattblackwell.org&k=p4Ly7qpEBiYPBVenR9G2iQ%3D%3D%0A&r=jLgdG6f%2BQq4pzHWI0S37ROhc5Jfy9q9oKEsPDdQXskc%3D%0A&m=LNXcbHHaUVLuM7D3giTHz7xBQsB5SmlsAYFM1hK24Ic%3D%0A&s=fdedd247b8d0c2d93ab45dc2da388a525ec066900235f1612f264fd88654f0c1>
On Tue, Jul 8, 2014 at 11:16 AM, Ariel Linden <ariel.linden(a)gmail.com
<javascript:_e(%7B%7D,'cvml','ariel.linden@gmail.com');>>
wrote:
Ben – quick response to your last question about missing values: if CEM
excludes units with missing values (Matt can clarify), you can generate a
missing value indicator for the given variable and use that in the matching
procedure. So you’d basically be matching on the pattern of missingness of
that variable. A more comprehensive approach would be to use –mi- to impute
missing values (CEM can be run on datasets that were generated for multiple
imputation – see Matt’s paper in the Stata Journal for discussion).
*From:* cem-bounces(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','cem-bounces@lists.gking.harvard.edu');>
[mailto:cem-bounces@lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','cem-bounces@lists.gking.harvard.edu');>]
*On
Behalf Of *Ben Hoen
*Sent:* Tuesday, July 08, 2014 10:30 AM
*To:* 'Matt Blackwell'
*Cc:* cem(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','cem@lists.gking.harvard.edu');>
*Subject:* Re: [cem] Understaning CEM's use of a categorical variable and
#0
Hi Matt (& Ariel)
I have dug a bit more into this and am even more confused. I am leaving,
for the time being, the issue of larger or smaller geographies, and instead
am focusing on variable form (e.g., text vs. numerical).
I ran an experiment to test a few things, and have attached the output
from that (as well as the sample dataset FYI, if you wanted to try to
duplicate the experiment).
The research question was the following: does cem care if a variable is
entered as a string, an ordinal long integer or a non-ordinal long integer
(assuming each variable has the same number of unique values)?
To test this I ran cem three times once each for blockgroup as a text
variable (bgname, used in cem1), as an encoded ordinal variable (bgcode,
used in cem2), and as a non-ordinal variable (bgnum, used in cem3). In
addition to the block group I include the variables we discussed
previously, namely sfla, age, acres and saleyear. In each case I am
matching exactly on blockgroup (and saleyear), by using the “(#0)” syntax.
Prior to this I used codebook to examine the three variables (and the
other variables in cem) and see that in each case the blockgroup variables
have 359 unique values and no missing values for the variable.
When running cem I find that cem1 has 333 strata, cem2 has 5283 strata,
and cem3 has 1510 strata. Of course, when there are more strata, there are
fewer matches, so cem1 produces 425 matching treated cases, cem2 277 and
cem3 384.
Why is it that cem is treating these three forms of the same variable so
differently?
Separately, I have an additional variable, which I left out, but which has
some missing values. If I show the breaks for this variables as: “(0 10 25
50 90)” does cem create an additional strata for missing values? If not,
is there a way to do this, while still maintaining (some) control of the
breaks?
Thanks, for all you help with this. cem is a great program and has aided
me in my work tremendously.
Ben
Ben Hoen
LBNL
Office: 845-758-1896
Cell: 718-812-7589
*From:* Matt Blackwell [mailto:m.blackwell@rochester.edu
<javascript:_e(%7B%7D,'cvml','m.blackwell@rochester.edu');>]
*Sent:* Monday, July 07, 2014 10:36 PM
*To:* Ben Hoen
*Cc:* cem(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','cem@lists.gking.harvard.edu');>
*Subject:* Re: [cem] Understaning CEM's use of a categorical variable and
#0
Hi Ben,
My immediate guess would be the missing data on the county variable, which
may be interacting strangely with the string variables. Maybe try two
things: 1) creating numeric versions of both and repeat the matches and 2)
try dropping the missing county observations and comparing the matches
then.
Cheers,
Matt
On Mon, Jul 7, 2014 at 10:17 PM, Ben Hoen <bhoen(a)lbl.gov
<javascript:_e(%7B%7D,'cvml','bhoen@lbl.gov');>> wrote:
Just realized that blockgroup and county are both strings. See below:
That likely is NOT what cem is looking for is it? Source of the problem?
(And yes, block group variable, which is the census number, is unique
across counties)
Ben
Ben Hoen
LBNL
Office: 845-758-1896
Cell: 718-812-7589
*From:* Matt Blackwell [mailto:m.blackwell@rochester.edu
<javascript:_e(%7B%7D,'cvml','m.blackwell@rochester.edu');>]
*Sent:* Monday, July 07, 2014 10:10 PM
*To:* Ben Hoen
*Cc:* cem(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','cem@lists.gking.harvard.edu');>
*Subject:* Re: [cem] Understaning CEM's use of a categorical variable and
#0
Hi Ben,
Hm, it definitely should produce more matches when you use county. One
possible issue that I can think of off the top of my head is this: is the
block group variable unique across counties/states? Or do the values of the
block group variable repeat? One thing to check is to see if what happens
if you exact match on both the county and the block group in a single
match.
Hope that helps! If it doesn't, definitely let us know.
Cheers,
Matt
~~~~~~~~~~~
Matthew Blackwell
Assistant Professor of Government
Harvard University
url:
http://www.mattblackwell.org
<https://urldefense.proofpoint.com/v1/url?u=http://www.mattblackwell.org&k=p4Ly7qpEBiYPBVenR9G2iQ%3D%3D%0A&r=jLgdG6f%2BQq4pzHWI0S37ROhc5Jfy9q9oKEsPDdQXskc%3D%0A&m=itjJht7%2BmFWNAbifa5uoLvqjPfdC8XDnUU48G8V8o%2BU%3D%0A&s=695d0f21125e2ab6cc12de157ff03933eff5e50e0ff113db3f710267505cf77e>
On Mon, Jul 7, 2014 at 9:36 PM, Ben Hoen <bhoen(a)lbl.gov
<javascript:_e(%7B%7D,'cvml','bhoen@lbl.gov');>> wrote:
Hi all,
I have been using the program cem in Stata (Version 13 MP, with Windows 7
Pro 64 bit), and thought I understood what it was doing well enough but
today something occurred which surprised (read worried) me, in that it
acted as I would NOT have expected it to.
I am trying to match target (i.e,, treated) homes to similar (i.e.,
"comparable") homes that do not have the treatment. In this case, the
"treatment" is whether the home does or does not have a photovoltaic energy
system (pv). I have 100 pv homes (treated), and ~ 5,000 non-pv homes
(comparable).
To match these homes I am using some basic characteristics of the home -
e.g., square feet of living space (sfla), size of the parcel (acres), age
of the home (age), as well as the year in which it sold (sale year) to
ensure the comparable home sold in the same year as the target home and,
finally, a geographic variable (such as the block group) to ensure the
comparable home is located in the same geography. For sale year and the
geogrpahy, they must match perfectly; i.e., the comparable homes must have
sold in the same year as the target (pv) home *and* also be located in
the same geography. For the purposes of this discussion those geographies
could be either the census block group (blockgroup) or the county (county).
All of the block groups fall within the counties, and there are many more
block groups than counties delineated in the data. For example, I have
approximately 30 block groups (each with at least one treated and one
comparable case) and 10 counties (each with at least one treated and one
comparable). In practice, though, in most geographies I have ~ 20-50 times
the number of pv homes available as comparables to match to.
Using the sample data and talking to local experts, I have established
appropriate cut points for my various characteristics and run a command
similar to the following, when blockgroup is used as the geography:
cem sfla(0 1000 2000 3000 5000) age(0 1 10 20 100) acres(0.05 0.15 0.5 1
10) saleyear(#0) blockgroup(#0) , treatment(pv)
And the following, when county is used as the geography:
cem sfla(0 1000 2000 3000 5000) age(0 1 10 20 100) acres(0.05 0.15 0.5 1
10) saleyear(#0) county(#0) , treatment(pv)
So, here's the confusing part:
I will have ~ 70 matching pv homes, and 300 comparable homes if blockgroup
is used, but only 20 matching pv homes, and 100 comparables homes if county
is used. In other words, when I allow a broader geography of comparables to
be drawn from, I get fewer matching cases. i would think the exact opposite
would be the case; if a cast a broader geographic net, I would have more
matches not less.
Any ideas why this would occur?
Thanks, in advance, for any insight you could offer.
Ben
Berkeley Lab
Ben Hoen
Staff Research Associate
Lawrence Berkeley National Laboratory
Office: 845-758-1896
Cell: 718-812-7589
bhoen(a)lbl.gov <javascript:_e(%7B%7D,'cvml','bhoen@lbl.gov');>
http://emp.lbl.gov/staff/ben-hoen
<https://urldefense.proofpoint.com/v1/url?u=http://emp.lbl.gov/staff/ben-hoen&k=AjZjj3dyY74kKL92lieHqQ%3D%3D%0A&r=wldobffzOUTOxpSiBCeaJ8koG11T3tB%2FizPx3rQIeN4%3D%0A&m=YOEeVogLM2TPKRP%2BPbYrnY%2FVTGm0ZObcn2JParSlHSs%3D%0A&s=9efd544f111d8f4f87d1c1fe71296892b9a4dd539a4458113a3e19e6c60267d3>
Visit our publications at:
http://emp.lbl.gov/reports/re
<https://urldefense.proofpoint.com/v1/url?u=http://emp.lbl.gov/reports/re&k=AjZjj3dyY74kKL92lieHqQ%3D%3D%0A&r=wldobffzOUTOxpSiBCeaJ8koG11T3tB%2FizPx3rQIeN4%3D%0A&m=YOEeVogLM2TPKRP%2BPbYrnY%2FVTGm0ZObcn2JParSlHSs%3D%0A&s=fe142ea1bc9393284c0f77085e541a15ef862edbd0cd78a36c396b1ec9e57573>
Sign up for our email list to receive publication notifications at:
https://spreadsheets.google.com/a/lbl.gov/spreadsheet/viewform?formkey=dGlF…
<https://urldefense.proofpoint.com/v1/url?u=https://spreadsheets.google.com/a/lbl.gov/spreadsheet/viewform?formkey%3DdGlFS1U1NFlUNzQ1TlBHSzY2VGZuN1E6MQ&k=AjZjj3dyY74kKL92lieHqQ%3D%3D%0A&r=wldobffzOUTOxpSiBCeaJ8koG11T3tB%2FizPx3rQIeN4%3D%0A&m=YOEeVogLM2TPKRP%2BPbYrnY%2FVTGm0ZObcn2JParSlHSs%3D%0A&s=69dbd2f0fc1d7f8a11f4740cd616c8153b61dafd188209081b767928df00cc0b>
-
--
cem Mailing List, served by HUIT
Send messages: cem(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','cem@lists.gking.harvard.edu');>
[un]subscribe Options:
http://lists.gking.harvard.edu/?info=cem
More information on cem:
http://gking.harvard.edu/cem
Cem mailing list
Cem(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','Cem@lists.gking.harvard.edu');>
To unsubscribe from this list or get other information:
https://lists.gking.harvard.edu/mailman/listinfo/cem
-
--
cem Mailing List, served by HUIT
Send messages: cem(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','cem@lists.gking.harvard.edu');>
[un]subscribe Options:
http://lists.gking.harvard.edu/?info=cem
More information on cem:
http://gking.harvard.edu/cem
Cem mailing list
Cem(a)lists.gking.harvard.edu
<javascript:_e(%7B%7D,'cvml','Cem@lists.gking.harvard.edu');>
To unsubscribe from this list or get other information:
https://lists.gking.harvard.edu/mailman/listinfo/cem