In recent years, propensity score matching (PSM) has gained attention as a potential method for estimating the impact of public policy programs in the absence of experimental evaluations. In this study, we evaluate the usefulness of PSM for estimating the impact of a program change in an educational context (Tennessee’sStudent Teacher Achievement Ratio Project [Project STAR]).
Because Tennessee’s Project STAR experiment involved an effective random assignment procedure, the experimental results from this policy intervention can be used as a benchmark, to which we compare the impact estimates produced using propensity score matching methods. We use several different methods to assess these nonexperimental estimates of the impact of the program.
We try to determine “how close is close enough”, putting greatest emphasis on the question: Would the nonexperimental estimate have led to the wrong decision when compared to the experimental estimate of the program? We find that propensity score methods perform poorly with respect to measuring the impact of a reduction in class size on achievement test scores.
We conclude that further research is needed before policymakers rely on PSM as an evaluation tool.
Table 5: Project STAR regression adjusted estimates of program effect using experimental controls and nonexperimental comparison groups.Robust standard errors clustered at the classroom level in parentheses. Asterisk indicates statistical-significance at the 5% level. See Appendix B for discussion of the standard errors.
…How well do the propensity score matched estimates of the impact of class size approximate the experimental impact estimates? Table 5 gives us the opportunity to look at 11 different cases for which there are two sets of estimates of the impact of smaller class size on combined reading and math test scores, one generated by random assignment of students and teachers to a treatment group or control group and the other using a comparison group created by PSM.
Looking first at the experimental estimates, we can see that they vary considerably across the schools. For 7⁄11 schools, the impact on the test scores of the smaller class size is positive and statistically-significant. The impact estimates range from −10 percentile points to +24.23 Neither of the two cases where the estimate of the impact was negative in sign was statistically-significantly different from zero. For the nonexperimental (propensity score matched) estimates, there is also substantial variability across schools. For 5⁄11 schools, the impact on test scores of the smaller class size is positive and statistically-significant. The impact estimates range from −22 to +33 percentile points. None of the 4 cases where the estimate of the impact was negative in sign were statistically-significantly different from zero.
Now we move to a discussion of the differences between the experimental and non-experimental impact estimates. The magnitudes and signs of the differences between the experimental and the nonexperimental impacts are, in most cases, substantial. For only two of the schools (#28 and #51) are the experimental and nonexperimental impacts less than 10 percentile points apart, and for one of those (school #28), neither impact estimate was statistically-significantly different from zero. On the basis of such a casual examination, we would conclude that the nonexperimental propensity score matched estimates are not likely to give a reliable estimate of the “true” impact of smaller class sizes (of the magnitude in Project STAR) on achievement test scores.