Re: [cem] Few matched strata (and individuals within strata) after implementing cem - Cem

5 Apr 2017

Sergio, the idea you describe in your first paragraph has been formalized
with this algorithm <http://projects.iq.harvard.edu/frontier/home>, and so
that's another option.  With CEM, you would decide how important it is to
get matches for each variable and coursen more for less important
variables.

Gary
--
*Gary King* - Albert J. Weatherhead III University Professor - Director,
IQSS <http://iq.harvard.edu/> - Harvard University
GaryKing.org - King(a)Harvard.edu - @KingGary <https://twitter.com/kinggary> -
617-500-7570 - Assistant &lt;king-assist(a)iq.harvard.edu&gt;du>: 617-495-9271

On Wed, Apr 5, 2017 at 12:32 PM, Sergio Salis &lt;Sergio.Salis(a)natcen.ac.uk&gt;
wrote:

...
  Hi Gary,

 Thanks very much for your advice. I understand the idea is trying
 different coarsening strategies (among those which make sense) for each
 variable and see which one produces the lowest imbalance, measured by means
 of the Multivariate L1 distance (the univariate imbalances should also be
 looked at individually). Is this correct?

 For variables like income and assets I guess it makes sense to use
 percentiles as there is no obvious value to create cut-off points. If so,
 shall I use

 cem var1 var 2 …. income(P1 P2 …. Pn) , treatment(treated)

 (where P1=value of the 1st percentile, P2=value of the 2nd percentile …..
 Pn=value of the last percentile)?

 (#10) will produce 10 equally sized bins but I am not sure whether equal
 size means equal base (e.g. bin 1 includes those with income between 1 to
 10, bin 2 those with income 11 to 20, etc.) or equal frequencies (in which
 case a bin defines a percentile). I am also not sure what Sturge's rule and
 Scott’s algorithm are, I cannot find any description in the Stata help file.

 Thanks again for your help< very much appreciated.

 Sergio

 *From:* Gary King [mailto:king@harvard.edu]
 *Sent:* 05 April 2017 16:06
 *To:* Sergio Salis
 *Cc:* cem(a)lists.gking.harvard.edu
 *Subject:* Re: [cem] Few matched strata (and individuals within strata)
 after implementing cem

 Hi Sergio, you can adjust the coarsening rather than using the defaults in
 CEM.  more coarse bins will generate more observations.  you want to make
 the choices based on the substance of the variables, and which ones are
 more important to match finely on

 Gary

 --

 *Gary King* - Albert J. Weatherhead III University Professor - Director,
 IQSS <http://iq.harvard.edu/> - Harvard University

 GaryKing.org - King(a)Harvard.edu - @KingGary <https://twitter.com/kinggary> -
 617-500-7570 <(617)%20500-7570> - Assistant &lt;king-assist(a)iq.harvard.edu&gt;du>:
 617-495-9271 <(617)%20495-9271>

 On Wed, Apr 5, 2017 at 10:04 AM, Sergio Salis &lt;Sergio.Salis(a)natcen.ac.uk&gt;
 wrote:

 Hi all,

 I’m considering using the cem Stata programme to evaluate the impact of a
 welfare-to-work programme in the UK. However, I have never used cem before
 so I am trying to understand some basic issues before proceeding with the
 estimation.

 The first thing I’d be interested in understanding is: How does one handle
 situations where after running cem the number of matched strata (and units
 within them) are very small?

 Applying the cem algorithm to data from a previous impact evaluation I get:

 Number of strata: 8883

 Number of matched strata: 132

              0     1

       All                            8208  1584

   Matched                     179   141

 Unmatched               8029  1443

 If I calculate the ATT using cem matched data I get an impact estimate
 which is positive (around 5ppts; based on 320 obs only) while using
 psmatch2 on all data (i.e. not only those in cem matched strata; around
 8,237 obs are used) with kernel weights I get an estimate of around
 -5.7ppts. This means I reach opposite conclusions about the impact of the
 programme of interest using cem and psmatch2.

 I understand the cem-based estimates are based on better matched data
 (i.e. produce less biased estimates) compared to my psmatch2 estimate with
 kernel weights) but this comes at the expense of external validity:
 inference on the initial population is made based on a very small subset of
 data (estimates based on cem are not statistically significant while my
 original estimate was highly significant). Any advice about how one can
 handle situations of this type?

 Many thanks,

 Sergio

 NatCen Social Research
 35 Northampton Square
 London EC1V 0AX
 020 7250 1866

 Visit our website. www.natcen.ac.uk
 Read our latest blog. http://www.natcen.ac.uk/blog
 Follow us. @NatCen <https://twitter.com/natcen>
 Email us. info(a)natcen.ac.uk

 NatCen Social Research is certificated to ISO/IEC 27001:2013 for
 Information Security Management Systems and to ISO 20252:2012, the
 international standard for market, opinion and social research.

 Company limited by guarantee. Registered in England No. 4392418. Charity
 registered in England and Wales (1091768) and in Scotland (SC038454).

 Confidentiality: The information in this email and any attachments are
 confidential and may include some that is legally privileged. It must not
 be disclosed to or used by persons other than the intended recipient. If
 received in error, please notify us immediately and then delete this
 document.
 Content: Any views or opinions expressed do not necessarily represent
 those of NatCen Social Research. Please note the content of this e-mail may
 be intercepted, monitored or recorded for compliance purposes. Sensitive
 personal data should not normally be transmitted by e-mail.
 Copyright: Copyright in this e-mail and any attachments created by NatCen
 Social Research belong to NatCen Social Research unless otherwise stated.
 Care: NatCen Social Research shall not be liable to the recipient or any
 third party for any loss or damage howsoever arising from this e-mail
 and/or its content, including loss or damage caused by virus. It is the
 responsibility of the recipient to ensure the opening or use of this
 message and any attachments shall not adversely affect systems or data.