“The Causes and Consequences of Test Score Manipulation: Evidence from the New York Regents Examinations”, Thomas S. Dee, Will Dobbie, Brian A. Jacob, Jonah Rockoff2019-07 ()⁠:

We show that the design and decentralized scoring of New York’s high school exit exams—the Regents Examinations—led to systematic manipulation of test scores just below important proficiency cutoffs.

Exploiting a series of reforms that eliminated score manipulation [by moving grading from the schools to centralized grading], we find:

heterogeneous effects of test score manipulation on academic outcomes. While inflating a score increases the probability of a student graduating from high school by about 17 percentage points, the probability of taking advanced coursework declines by roughly 10 percentage points.

We argue that these results are consistent with test score manipulation helping less advanced students on the margin of dropping out but hurting more advanced students that are not pushed to gain a solid foundation in the introductory material [cf. social promotion, affirmative action].

…Formal estimates suggest that teachers inflated more than 40% of scores that would have been just below the cutoffs on core academic subjects between the years 2004 and 2010, or ~6% of all tests taken during this time period. However, test score manipulation was reduced by ~80% in 2011 when the New York State Board of Regents ordered schools to stop rescoring exams with scores just below proficiency cutoffs and disappeared completely in 2012 when the Board ordered that Regents exams be graded by teachers from other schools in a small number of centrally administered locations. These results suggest that both rescoring policies and local grading are key factors in teachers’ willingness or ability to manipulate test scores around performance cutoffs.

We find that manipulation was present in all New York schools prior to the reforms, but that the extent of manipulation varied considerably across students and schools. We find higher rates of manipulation for black and Hispanic students, students with lower baseline scores, and students with worse behavioral records. Importantly, however, this is entirely due to the fact that these students are more likely to score close to the proficiency threshold—these gaps largely disappear conditional on a student scoring near a proficiency cutoff.

There is also notable across-school variation in rates of manipulation, ranging from 24% of “marginal” scores at the tenth percentile school to almost 60% of such scores at the ninetieth percentile school. This across-school variation in test score manipulation is not well explained by school-level demographics or characteristics, and there are several pieces of evidence suggesting that institutional incentives (eg. school accountability systems, teacher performance pay, and high school graduation rules) cannot explain either the across-school variation in manipulation or the system-wide manipulation. However, we do find evidence that the extent of manipulation within a school depended on the set of teachers within a school grading a particular exam. We argue that, taken together, these results suggest that “altruism” among teachers is an important motivation for teachers’ manipulation of test scores (ie. helping students avoid sanctions involved with failing an exam).

…While students on the margin of dropping out are “helped” by test score manipulation, we also find evidence that some students are “hurt” by this teacher behavior. Specifically, we find that having an exam score manipulated decreases the probability of taking the requirements for a more advanced high school diploma by 9.8 percentage points, a 26.6% decrease from the pre-reform mean, with larger effects for students with lower baseline test scores. As discussed in greater detail below, we find evidence suggesting that these negative effects stem from the fact that marginal students who are pushed over the threshold by manipulation do not gain a solid foundation to the introductory material that is required for more advanced coursework. These results are consistent with the idea that test score manipulation has heterogeneous effects on human capital accumulation.

Figure 1: Test Score Distributions for Core Regents Exams, 2004–6201014ya. Notes: This figure shows the test score distribution around the 55 and 65 score cutoffs for New York City high school test takers between 2004–6201014ya. Core exams include English Language Arts, Global History, US History, Math A/Integrated Algebra, and Living Environment. We include the first test in each subject for each student in our sample. Each point shows the fraction of test takers in a score bin with solid points indicating a manipulable score. The dotted line beneath the empirical distribution is a subject-by-year specific sixth-degree polynomial fitted to the empirical distribution excluding the manipulable scores near each cutoff. The shaded area represents either the missing or excess mass for manipulable scores as we define based on the scoring guidelines described in §III and detailed in online Appendix Table A3. Total manipulation is the fraction of test takers with manipulated scores. In-range manipulation is the fraction of test takers with manipulated scores normalized by the average height of the counterfactual distribution to the left of each cutoff. Standard errors are calculated using the parametric bootstrap procedure described in the text. See the online Data Appendix for additional details on the sample and variable definitions.
Figure 1: Test Score Distributions for Core Regents Exams, 20046201014ya. Notes: This figure shows the test score distribution around the 55 and 65 score cutoffs for New York City high school test takers between 20046201014ya. Core exams include English Language Arts, Global History, US History, Math A/Integrated Algebra, and Living Environment. We include the first test in each subject for each student in our sample. Each point shows the fraction of test takers in a score bin with solid points indicating a manipulable score. The dotted line beneath the empirical distribution is a subject-by-year specific 6th-degree polynomial fitted to the empirical distribution excluding the manipulable scores near each cutoff. The shaded area represents either the missing or excess mass for manipulable scores as we define based on the scoring guidelines described in §III and detailed in online Appendix Table A3. Total manipulation is the fraction of test takers with manipulated scores. In-range manipulation is the fraction of test takers with manipulated scores normalized by the average height of the counterfactual distribution to the left of each cutoff. Standard errors are calculated using the parametric bootstrap procedure described in the text. See the online Data Appendix for additional details on the sample and variable definitions.