Can cem emulate nearest neighbor matching? - Cem

9 Jul 2014

Matt (& Ariel)

One more follow-up question...  

I have been matching exactly on sale year, wanting to ensure that the treated pv home
sales occur in the same year as the matching comparables.  Ideally, I would not match
exactly on sale year, but rather the difference between the treated sale date and the
comparable sale dates.  So, say, for example, I would restrict my comparables to only
those with a sale date within 6 months, either before or after (i.e., +/- 0.5 years). This
is more like a nearest neighbor matching, and would avoid missing the matches that are
within a few months of each other, but in different years.

Because this is such a different approach than coarsening the data, I had not considered
it, but now wonder if there was a work around you have considered and could propose for
this situation.

Thanks, in advance,

Ben

Ben Hoen

LBNL

Office: 845-758-1896

Cell: 718-812-7589

From: Matt Blackwell [mailto:m.blackwell@rochester.edu] 
Sent: Tuesday, July 08, 2014 1:36 PM
To: Ariel Linden
Cc: Ben Hoen; cem(a)lists.gking.harvard.edu
Subject: Re: [cem] Understaning CEM's use of a categorical variable and #0

Hi Ben, 

Ah, taking a look, I've figured it out. First, the one you want to use is the
"bgcode." This is for two reasons. First, as you've guessed, CEM
doesn't work with string variables, only numerics (I had forgotten this in my last
reply). Second, the "bgnum" variable also trips us Stata because of the size of
the numbers. In the CEM internals, Stata is treating two numbers with the same scientific
notation (1.2e+12) as the same. This is why there are more matches with that version than
the other. Thus, your best bet is the "bgcode." Hope that helps and sorry for
any confusion. 

Cheers,

Matt

~~~~~~~~~~~

Matthew Blackwell

Assistant Professor of Government

Harvard University

url: http://www.mattblackwell.org

On Tue, Jul 8, 2014 at 11:16 AM, Ariel Linden &lt;ariel.linden(a)gmail.com&gt; wrote:

Ben – quick response to your last question about missing values: if CEM excludes units
with missing values (Matt can clarify), you can generate a missing value indicator for the
given variable and use that in the matching procedure. So you’d basically be matching on
the pattern of missingness of that variable. A more comprehensive approach would be to use
–mi- to impute missing values (CEM can be run on datasets that were generated for multiple
imputation – see Matt’s paper in the Stata Journal for discussion).

From: cem-bounces(a)lists.gking.harvard.edu [mailto:cem-bounces@lists.gking.harvard.edu] On
Behalf Of Ben Hoen
Sent: Tuesday, July 08, 2014 10:30 AM
To: 'Matt Blackwell'

Cc: cem(a)lists.gking.harvard.edu
Subject: Re: [cem] Understaning CEM's use of a categorical variable and #0

Hi Matt (& Ariel)

I have dug a bit more into this and am even more confused.  I am leaving, for the time
being, the issue of larger or smaller geographies, and instead am focusing on variable
form (e.g., text vs. numerical).   

I ran an experiment to test a few things, and have attached the output from that (as well
as the sample dataset FYI, if you wanted to try to duplicate the experiment).

The research question was the following: does cem care if a variable is entered as a
string, an ordinal long integer or a non-ordinal long integer (assuming each variable has
the same number of unique values)?

To test this I ran cem three times once each for blockgroup as a text variable (bgname,
used in cem1), as an encoded ordinal variable (bgcode, used in cem2), and as a non-ordinal
variable (bgnum, used in cem3).  In addition to the block group I include the variables we
discussed previously, namely sfla, age, acres and saleyear.  In each case I am matching
exactly on blockgroup (and saleyear), by using the “(#0)” syntax.

Prior to this I used codebook to examine the three variables (and the other variables in
cem) and see that in each case the blockgroup variables have 359 unique values and no
missing values for the variable.

When running cem I find that cem1 has 333 strata, cem2 has 5283 strata, and cem3 has 1510
strata.  Of course, when there are more strata, there are fewer matches, so cem1 produces
425 matching treated cases, cem2 277 and cem3 384.

Why is it that cem is treating these three forms of the same variable so differently?

Separately, I have an additional variable, which I left out, but which has some missing
values.  If I show the breaks for this variables as: “(0 10 25 50 90)” does cem create an
additional strata for missing values?  If not, is there a way to do this, while still
maintaining (some) control of the breaks?

Thanks, for all you help with this.  cem is a great program and has aided me in my work
tremendously.

Ben

Ben Hoen

LBNL

Office: 845-758-1896

Cell: 718-812-7589

From: Matt Blackwell [mailto:m.blackwell@rochester.edu] 
Sent: Monday, July 07, 2014 10:36 PM
To: Ben Hoen
Cc: cem(a)lists.gking.harvard.edu
Subject: Re: [cem] Understaning CEM's use of a categorical variable and #0

Hi Ben, 

My immediate guess would be the missing data on the county variable, which may be
interacting strangely with the string variables. Maybe try two things: 1) creating numeric
versions of both and repeat the matches and 2) try dropping the missing county
observations and comparing the matches then. 

Cheers,

Matt

On Mon, Jul 7, 2014 at 10:17 PM, Ben Hoen &lt;bhoen(a)lbl.gov&gt; wrote:

Just realized that blockgroup and county are both strings.  See below:

That likely is NOT what cem is looking for is it?  Source of the problem?

(And yes, block group variable, which is the census number, is unique across counties)

Ben

Ben Hoen

LBNL

Office: 845-758-1896

Cell: 718-812-7589

From: Matt Blackwell [mailto:m.blackwell@rochester.edu] 
Sent: Monday, July 07, 2014 10:10 PM
To: Ben Hoen
Cc: cem(a)lists.gking.harvard.edu
Subject: Re: [cem] Understaning CEM's use of a categorical variable and #0

Hi Ben, 

Hm, it definitely should produce more matches when you use county. One possible issue that
I can think of off the top of my head is this: is the block group variable unique across
counties/states? Or do the values of the block group variable repeat? One thing to check
is to see if what happens if you exact match on both the county and the block group in a
single match. 

Hope that helps! If it doesn't, definitely let us know. 

Cheers,

Matt

~~~~~~~~~~~

Matthew Blackwell

Assistant Professor of Government

Harvard University

url: http://www.mattblackwell.org
<https://urldefense.proofpoint.com/v1/url?u=http://www.mattblackwell.org&k=p4Ly7qpEBiYPBVenR9G2iQ%3D%3D%0A&r=jLgdG6f%2BQq4pzHWI0S37ROhc5Jfy9q9oKEsPDdQXskc%3D%0A&m=itjJht7%2BmFWNAbifa5uoLvqjPfdC8XDnUU48G8V8o%2BU%3D%0A&s=695d0f21125e2ab6cc12de157ff03933eff5e50e0ff113db3f710267505cf77e>

On Mon, Jul 7, 2014 at 9:36 PM, Ben Hoen &lt;bhoen(a)lbl.gov&gt; wrote:

Hi all,

I have been using the program cem in Stata (Version 13 MP, with Windows 7 Pro 64 bit), and
thought I understood what it was doing well enough but today something occurred which
surprised (read worried) me, in that it acted as I would NOT have expected it to.

I am trying to match target (i.e,, treated) homes to similar (i.e.,
"comparable") homes that do not have the treatment. In this case, the
"treatment" is whether the home does or does not have a photovoltaic energy
system (pv). I have 100 pv homes (treated), and ~ 5,000 non-pv homes (comparable).

To match these homes I am using some basic characteristics of the home - e.g., square feet
of living space (sfla), size of the parcel (acres), age of the home (age), as well as the
year in which it sold (sale year) to ensure the comparable home sold in the same year as
the target home and, finally, a geographic variable (such as the block group) to ensure
the comparable home is located in the same geography. For sale year and the geogrpahy,
they must match perfectly; i.e., the comparable homes must have sold in the same year as
the target (pv) home and also be located in the same geography. For the purposes of this
discussion those geographies could be either the census block group (blockgroup) or the
county (county). All of the block groups fall within the counties, and there are many more
block groups than counties delineated in the data. For example, I have approximately 30
block groups (each with at least one treated and one comparable case) and 10 counties
(each with at least one treated and one comparable). In practice, though, in most
geographies I have ~ 20-50 times the number of pv homes available as comparables to match
to.

Using the sample data and talking to local experts, I have established appropriate cut
points for my various characteristics and run a command similar to the following, when
blockgroup is used as the geography:

cem sfla(0 1000 2000 3000 5000) age(0 1 10 20 100) acres(0.05 0.15 0.5 1 10) saleyear(#0)
blockgroup(#0) , treatment(pv)

And the following, when county is used as the geography:

cem sfla(0 1000 2000 3000 5000) age(0 1 10 20 100) acres(0.05 0.15 0.5 1 10) saleyear(#0)
county(#0) , treatment(pv)

So, here's the confusing part:

I will have ~ 70 matching pv homes, and 300 comparable homes if blockgroup is used, but
only 20 matching pv homes, and 100 comparables homes if county is used. In other words,
when I allow a broader geography of comparables to be drawn from, I get fewer matching
cases. i would think the exact opposite would be the case; if a cast a broader geographic
net, I would have more matches not less. 

Any ideas why this would occur?

Thanks, in advance, for any insight you could offer.

Ben
Berkeley Lab

Ben Hoen

Staff Research Associate 

Lawrence Berkeley National Laboratory

Office: 845-758-1896

Cell: 718-812-7589

bhoen(a)lbl.gov

http://emp.lbl.gov/staff/ben-hoen
<https://urldefense.proofpoint.com/v1/url?u=http://emp.lbl.gov/staff/ben-hoen&k=AjZjj3dyY74kKL92lieHqQ%3D%3D%0A&r=wldobffzOUTOxpSiBCeaJ8koG11T3tB%2FizPx3rQIeN4%3D%0A&m=YOEeVogLM2TPKRP%2BPbYrnY%2FVTGm0ZObcn2JParSlHSs%3D%0A&s=9efd544f111d8f4f87d1c1fe71296892b9a4dd539a4458113a3e19e6c60267d3>

Visit our publications at: 

http://emp.lbl.gov/reports/re
<https://urldefense.proofpoint.com/v1/url?u=http://emp.lbl.gov/reports/re&k=AjZjj3dyY74kKL92lieHqQ%3D%3D%0A&r=wldobffzOUTOxpSiBCeaJ8koG11T3tB%2FizPx3rQIeN4%3D%0A&m=YOEeVogLM2TPKRP%2BPbYrnY%2FVTGm0ZObcn2JParSlHSs%3D%0A&s=fe142ea1bc9393284c0f77085e541a15ef862edbd0cd78a36c396b1ec9e57573>

Sign up for our email list to receive publication notifications at: 

https://spreadsheets.google.com/a/lbl.gov/spreadsheet/viewform?formkey=dGlF…
<https://urldefense.proofpoint.com/v1/url?u=https://spreadsheets.google.com/a/lbl.gov/spreadsheet/viewform?formkey%3DdGlFS1U1NFlUNzQ1TlBHSzY2VGZuN1E6MQ&k=AjZjj3dyY74kKL92lieHqQ%3D%3D%0A&r=wldobffzOUTOxpSiBCeaJ8koG11T3tB%2FizPx3rQIeN4%3D%0A&m=YOEeVogLM2TPKRP%2BPbYrnY%2FVTGm0ZObcn2JParSlHSs%3D%0A&s=69dbd2f0fc1d7f8a11f4740cd616c8153b61dafd188209081b767928df00cc0b>

-
--
cem Mailing List, served by HUIT
Send messages: cem(a)lists.gking.harvard.edu
[un]subscribe Options: http://lists.gking.harvard.edu/?info=cem
More information on cem: http://gking.harvard.edu/cem
Cem mailing list
Cem(a)lists.gking.harvard.edu

To unsubscribe from this list or get other information:

https://lists.gking.harvard.edu/mailman/listinfo/cem

-
--
cem Mailing List, served by HUIT
Send messages: cem(a)lists.gking.harvard.edu
[un]subscribe Options: http://lists.gking.harvard.edu/?info=cem
More information on cem: http://gking.harvard.edu/cem
Cem mailing list
Cem(a)lists.gking.harvard.edu

To unsubscribe from this list or get other information:

https://lists.gking.harvard.edu/mailman/listinfo/cem