The Iron Law Of Evaluation And Other Metallic Rules
Problems with social experiments and evaluating them, loopholes, causes, and suggestions; non-experimental methods systematically deliver false results, as most interventions fail or have small effects.
“The Iron Law Of Evaluation And Other Metallic Rules” is a classic review paper by American “sociologist Peter Rossi, a dedicated progressive and the nation’s leading expert on social program evaluation from the 1960s through the 1980s”; it discusses the difficulties of creating a useful social program, and proposed some aphoristic summary rules, including most famously:
The Iron law: “The expected value of any net impact assessment of any large scale social program is zero”
the Stainless Steel law: “the better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.”
It expands an earlier paper by Rossi (“Issues in the evaluation of human services delivery”, Rossi 197846ya), where he coined the first, “Iron Law”.
I provide an annotated HTML version with fulltext for all references, as well as a bibliography collating many negative results in social experiments I’ve found since Rossi’s paper was published (see also the closely-related Replication Crisis).
This transcript has been prepared from an original scan; all hyperlinks are my own insertion.
-
Citation: “The Iron Law Of Evaluation And Other Metallic Rules”, Rossi; Research in Social Problems and Public Policy (ISBN: 0-89232-560-7), volume 4 (198737ya), pages 3–20.
The Iron Law
Introduction
by Peter Rossi
[pg3]
Evaluations of social programs have a long history, as history goes in the social sciences, but it has been only in the last two decades that evaluation has come close to becoming a routine activity that is a functioning part of the policy formation process. Evaluation research has become an activity that no agency administering social programs can do without and still retain a reputation as modern and up to date. In academia, evaluation research has infiltrated into most social science departments as an integral constituent of curricula. In short, evaluation has become institutionalized.
There are many benefits to social programs and to the social sciences from the institutionalization of evaluation research. Among the more important benefits has been a considerable increase in knowledge concerning social problems and about how social programs work (and do not [pg4] work). Along with these benefits, however, there have also been attached some losses. For those concerned with the improvement of the lot of disadvantaged persons, families and social groups, the resulting knowledge has provided the bases for both pessimism and optimism. On the pessimistic side, we have learned that designing successful programs is a difficult task that is not easily or often accomplished. On the optimistic side, we have learned more and more about the kinds of programs that can be successfully designed and implemented. Knowledge derived from evaluations is beginning to guide our judgments concerning what is feasible and how to reach those feasible goals.
To draw some important implications from this knowledge about the workings of social programs is the objective of this paper. The first step is to formulate a set of “laws” that summarize the major trends in evaluation findings. Next, a set of explanations are provided for those overall findings. Finally, we explore the consequences for applied social science activities that flow from our new knowledge of social programs.
Some “Laws” Of Evaluation
A dramatic but slightly overdrawn view of two decades of evaluation efforts can be stated as a set of “laws”, each summarizing some strong tendency that can be discerned in that body of materials. Following a 19th Century practice that has fallen into disuse in social science1, these laws are named after substances of varying durability, roughly indexing each law’s robustness.
-
The Iron Law of Evaluation: “The expected value of any net impact assessment of any large scale social program is zero.”
The Iron Law arises from the experience that few impact assessments of large scale2 social programs have found that the programs in question had any net impact. The law also means that, based on the evaluation efforts of the last twenty years, the best a priori estimate of the net impact assessment of any program is zero, ie. that the program will have no effect.
-
The Stainless Steel Law of Evaluation: “The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.”
This law means that the more technically rigorous the net impact assessment, the more likely are its results to be zero—or not effect. Specifically, this law implies that estimating net impacts through randomized controlled experiments, the avowedly best approach to estimating net impacts, is more likely to show zero effects than other less rigorous approaches. [pg5]
-
The Brass Law of Evaluation: “The more social programs are designed to change individuals, the more likely the net impact of the program will be zero.”
This law means that social programs designed to rehabilitate individuals by changing them in some way or another are more likely to fail. The Brass Law may appear to be redundant since all programs, including those designed to deal with individuals, are covered by the Iron Law. This redundancy is intended to emphasize the especially difficult task in designing and implementing effective programs that are designed to rehabilitate individuals.
-
The Zinc Law of Evaluation: “Only those programs that are likely to fail are evaluated.”
Of the several metallic laws of evaluation, the zinc law has the most optimistic slant since it implies that there are effective programs but that such effective programs are never evaluated. It also implies that if a social program is effective, that characteristic is obvious enough and hence policy makers and others who sponsor and fund evaluations decide against evaluation.
It is possible to formulate a number of additional laws of evaluation, each attached to one or another of a variety of substances, varying in strength from strong, robust metals to flimsy materials. The substances involved are only limited by one’s imagination. But, if such laws are to mirror the major findings of the last two decades of evaluation research they would all carry the same message: The laws would claim that a review of the history of the last two decades of efforts to evaluate major social programs in the United States sustain the proposition that over this period the American establishment of policy makers, agency officials, professionals and social scientists did not know how to design and implement social programs that were minimally effective, let alone spectacularly so.
How Firm Are The Metallic Laws Of Evaluation?
How seriously should we take the metallic laws? Are they simply the social science analogue of poetic license, intended to provide dramatic emphasis? Or, do the laws accurately summarize the last two decades’ evaluation experiences?
First of all, viewed against the evidence, the iron law is not entirely rigid. True, most impact assessments confirm to the iron law’s dictates in showing at best marginal effects and all too often no effects at all. There are even a few evaluations that have shown effects in the wrong directions, [pg6] opposite to the desired effects. Some of the failures of large scale programs have been particularly disappointing because of the large investments of time and resources involved: Manpower retraining programs have not been shown to improve earnings or employment prospects of participants (Westat, 1976–4198044ya). Most of the attempts to rehabilitate prisoners have failed to reduce recidivism (Lipton, Martinson, and Wilks, 197549ya). Most educational innovations have not been shown to improve student learning appreciably over traditional methods (Raizen and Rossi, 198143ya).
But, there are also many exceptions to the iron rule! The “iron” in the Iron Law has shown itself to be somewhat spongy and therefore easily, although not frequently, broken. Some social programs have shown positive effects in the desired directions, and there are even some quite spectacular successes: the American old age pension system plus Medicare has dramatically improved the lives of our older citizens. Medicaid has managed to deliver medical services to the poor to the extent that the negative correlation between income and consumption of medical services has declined dramatically since enactment. The family planning clinics subsidize by the federal government were effective in reducing the number of births in areas where they were implemented (Cutright and Jaffe, 197747ya). There are also human services programs that have been shown to be effective, although mainly on small scale, pilot runs: for example, the Minneapolis Police Foundation experiment on the police handling of family violence showed that if the police placed the offending abuser in custody over night that the offender was less likely to show up as an accused offender over the succeeding six months (Sherman and Berk, 19843). A meta-evaluation of psychotherapy showed that on the average, persons in psychotherapy—no matter what brand—were a third of a standard deviation improved over control groups that did not have any therapy (Smith, Glass, and Miller, 198044ya). In most of the evaluations of manpower training programs, women returning to the labor force benefited positively compared to women who did not take the courses, even though in general such programs have not been successful. Even Head Start is now beginning to show some positive benefits after many years of equivocal findings. And so it goes on, through a relatively long list of successful programs.
But even in the case of successful social programs, the sizes of the net effects have not been spectacular. In the social program field, nothing has yet been invented which is as effective in its way as the smallpox vaccine was for the field of public health. In short, as is well known (and widely deplored) we are not on the verge of wiping out the social scourges of our time: ignorance, poverty, crime, dependency, or mental illness show great promise to be with us for some time to come.
The Stainless Steel Law appears to be more likely to hold up over a [pg7] large series of cases than the more general Iron Law. This is because the fiercest competition as an explanation for the seeming success of any program—especially human services programs—ordinarily is either self- or administrator-selection of clients. In other words, if one finds that a program appears to be effective, the most likely alternative explanation to judging the program as the cause of that success is that the persons attracted to that program were likely to get better on their own or that the administrators of that program chose those who were already on the road to recovery as clients. As the better research designs—particularly randomized experiments—eliminate that competition, the less likely is a program to show any positive net effect. So the better the research design, the more likely the net impact assessment is likely to be zero.
How about the Zinc Law of Evaluation? First, it should be pointed out that this law is impossible to verify in any literal sense. The only way that one can be relatively certain that a program is effective is to evaluate it, and hence the proposition that only ineffective programs are evaluated can never be proven.
However, there is a sense in which the Zinc law is correct. If the a priori, beyond-any-doubt expectations of decision makers and agency heads is that a program will be effective, there is little chance that the program will be evaluated at all. Our most successful social program, social security payments to the aged, has never been evaluated in a rigorous sense. It is “well known” that the program manages to raise the incomes of retired persons and their families, and “it stands to reason” that this increases in income is greater than what would have happened, absent the social security system.
Evaluation research is the legitimate child of skepticism, and where there is faith, research is not called upon to make a judgment. Indeed, the history of the income maintenance experiments bears this point out. Those experiments were not undertaken to find out whether the main purpose of the proposed program could be achieved: that is, no one doubted that payments would provide income to poor people—indeed, payments by definition are income, and even social scientists are not inclined to waste resources investigating tautologies. Furthermore, no one doubted that payments could be calculated and checks could be delivered to households. The main purpose of the experiment was to estimate the sizes of certain anticipated side effects of the payments, about which economists and policy makers were uncertain—how much of a work disincentive effect would be generated by the payments and whether the payments would affect other aspects of the households in undesirable ways—for instance, increasing the divorce rate among participants.
In short, when we look at the evidence for the metallic laws, the evidence appears not to sustain their seemingly rigid character, but the [pg8] evidence does sustain the “laws” as statistical regularities. Why this should be the case, is the topic to be explored in the remainder of this paper.
Is There Something Wrong With Evaluation Research?
A possibility that deserves very serious consideration is that there is something radically wrong with the ways in which we go about conducting evaluations. Indeed, this argument is the foundation of a revisionist school of evaluation, composed of evaluators who are intent on calling into question the main body of methodological procedures used in evaluation research, especially those that emphasize quantitative and particularly experimental approaches to the estimation of net impacts. The revisionists include such persons as Michael Patton (198044ya) and Ego Guba (198143ya). Some of the revisionists are reformed number crunchers who have seen the errors of their ways and have been reborn as qualitative researchers. Others have come from social science disciplines in which qualitative ethnographic field methods have been dominant.
Although the issue of the appropriateness of social science methodology is an important one, so far the revisionist arguments fall far short of being fully convincing. At the root of the revisionist argument appears to be that the revisionists find it difficult to accept the findings that most social programs, when evaluate for impact assessment by rigorous quantitative evaluation procedures, fail to register main effects: hence the defects must be in the method of making the estimates.4 This argument per se is an interesting one, and deserves attention: all procedures need to be continually re-evaluated. There are some obvious deficiencies in most evaluations, some of which are inherent in the procedures employed. For example, a program that is constantly changing and evolving cannot ordinarily be rigorously evaluated since the treatment to be evaluate cannot be clearly defined. Such programs either require new evaluation procedures or should not be evaluated at all.
The weakness of the revisionist approaches lies in their proposed solutions to these deficiencies. Criticizing quantitative approaches for their woodenness and inflexibility, they propose to replace current methods with procedures that have even greater and more obvious deficiencies. The qualitative procedures they propose are not exempt from issues of internal and external validity and ordinarily do not attempt to address these thorny problems. Indeed, the procedures which they advance as substitutes for the mainstream methodology are usually vaguely described, [pg9] constituting an almost mystical advocacy of the virtues of qualitative approaches, without clear discussion of the specific ways in which such procedures meet validity criteria. In addition, many appear to adopt program operator perspectives on effectiveness, reasoning that any effort to improve social conditions must have some effect, with the burden of proof placed on the evaluation researcher to find out what those effects might be.
Although many of their arguments concerning the woodenness of many quantitative researches are cogent and well taken, the main revisionist arguments for an alternative methodology are unconvincing: hence one must look elsewhere than to evaluation methodology for the reasons for the failure of social programs to pass muster before the bar of impact assessments.
Sources Of Program Failures
Starting with the conviction that the many findings of zero impact are real, we are led inexorably to the conclusion that the faults must lie in the programs. Three kinds of failure can be identified, each a major source of the observed lack of impact:
The first two types of faults that lead a program to fail stem from problems in social science theory and the third is a problem in the organization of social programs:
-
Faults in Problem theory: The program is built upon a faulty understanding of the social processes that give rise to the problem to which the social program is ostensibly addressed;
-
Faults in Program theory: The program is built upon a faulty understanding of how to translate problem theory into specific programs.
-
Faults in Program Implementation: There are faults in the organizations, resources levels and/or activities that are used to deliver the program to its intended beneficiaries.
Note that the term theory is used above in a fairly loose way to cover all sorts of empirically grounded generalized knowledge about a topic, and is not limited to formal propositions.
Every social program, implicitly or explicitly, is based on some understanding of the social problem involved and some understanding of the program. If one fails to arrive at an appropriate understanding of either, the program in question will undoubtedly fail. In addition, every program [pg10] is given to some organization to implement. Failures to provide enough resources, or to insure that the program is delivered with sufficient fidelity can also lead to findings of ineffectiveness.
Problem Theory
Problem theory consists of the body of empirically tested understanding of the social problem that underlies the design of the program in question. For example, the problem theory that was the underpinning for the many attempts at prisoner rehabilitation tried in the last two decades was that criminality was a personality disorder. Even though there was a lot of evidence for this viewpoint, it also turned out that theory is not relevant either to understanding crime rates or to the design of crime policy. The changes in crime rates do not reflect massive shifts in personality characteristics of the American population, nor does the personality disorder theory of crime lead to clear implications for crime reduction policies. Indeed, it is likely that large scale personality changes are beyond the reach of social policy institutions in a democratic society.
The adoption of this theory is quite understandable. For example, how else do we account for the fact that persons seemingly exposed to the same influences do not show the same criminal (or noncriminal) tendencies? But the theory is not useful for understanding the social distribution of crime rates by gender, socio-economic level, or by age.
Program Theory
Program theory links together the activities that constitute a social program and desired program outcomes. Obviously, program theory is also linked to problem theory, but is partially independent. For example, given the problem theory that diagnosed criminality is a personality disorder, a matching program theory would have as its aims personality change oriented therapy. But there are many specific ways in which therapy can be defined and at many different points in the history of individuals. At the one extreme of the lifeline, one might attempt preventive mental health work directed toward young children; at the other extreme, one might provide psychiatric treatment for prisoners or set up therapeutic groups in prison for convicted offenders.
Implementation
The third major source of failure is organizational in character and has to do with the failure to implement properly programs. Human services [pg11] programs are notoriously difficult to deliver appropriately to the appropriate clients. A well designed program that is based on correct problem and program theories may simply be implemented improperly, including not implementing any program at all. Indeed, in the early days of the War on Poverty, many examples were found of non-programs—the failure to implement anything at all.
Note that these three sources of failure are nested to some degree:
-
An incorrect understanding of the social problem being addressed is clearly a major failure that invalidates a correct program theory and an excellent implementation.
-
No matter how good the problem theory may be, an inappropriate program theory will lead to failure.
-
And, no matter how good the problem and program theories, a poor implementation will also lead to failure.
Sources of Theory Failure
A major reason for failures produce through incorrect problem and program theories lies in the serious under-development of policy related social science theories in many of the basic disciplines. The major problem with much basic social science is that social scientists have tended to ignore policy related variables in building theories because policy related variables account for so little of the variance in the behavior in question. It does not help the construction of social policy any to know that a major determinant of criminality is age, because there is little, if anything, that policy can do about the age distribution of a population, given a commitment to our current democratic, liberal values. There are notable exceptions to this generalization about social science: economics and political science have always been closely attentive to policy considerations; this indictment concerns mainly such fields as sociology, anthropology and psychology.
Incidentally, this generalization about social science and social scientists should warn us not to expect too much from changes in social policy. This implication is quite important and will be taken up later on in this paper.
But the major reason why programs fail through failures in problem and program theories is that the designers of programs are ordinarily amateurs who know even less than the social scientists! There are numerous examples of social programs that were concocted by well meaning amateurs (but amateurs nevertheless). A prime example are Community Mental Health Centers, an invention of the Kennedy administration, apparently [pg12] undertaken without any input from the National Institute of Mental Health, the agency that was given the mandate to administer the program. Similarly with Comprehensive Employment and Training Act (CETA) and its successor, the current Job Partnership Training Act (JPTA) program, both of which were designed by rank amateurs and then given over to the Department of Labor to run and administer. Of course, some of the amateurs were advised by social scientists about the programs in question, so the social scientists are not completely blameless.
The amateurs in question are the legislators, judicial officials, and other policy makers who initiate policy and program changes. The main problem with amateurs lies not so much in their amateur status but in the fact that they may know little or nothing about the problem in question or about the programs they design. Social science may not be an extraordinarily well developed set of disciplines, but social scientists do know something about our society and how it works, knowledge that can prove useful in the design of policy and programs that may have a chance to be successfully effective.
Our social programs seemingly are designed by procedures that lie somewhere in between setting monkeys to typing mindlessly on typewriters in the hope that additional Shakespearean plays will eventually be produced, and Edisonian trial-and-error procedures in which one tactic after another is tried in the hope of finding out some method that works. Although the Edisonian paradigm is not highly regarded as a scientific strategy by the philosophers of science, there is much to recommend it in a historical period in which good theory is yet to develop. It is also a strategy that allows one to learn from errors. Indeed, evaluation is very much a part of an Edisonian strategy of starting new programs, and attempting to learn from each trial.5
Problem Theory Failures
One of the more persistent failures in problem theory is to under-estimate the complexity of the social world. Most of the social problems with which we deal are generated by very complex causal processes involving interactions of a very complex sort among societal level, community level, and individual level process. In all likelihood there are biological level processes involved as well, however much our liberal ideology is repelled by the idea. The consequence of under-estimating the complexity of the problem is often to over-estimate our abilities to affect the amount and course of the problem. This means that we are overly optimistic about how much of an effect even the best of social programs can expect to achieve. It [pg13] also means that we under-design our evaluations, running the risk of committing Type II errors: that is, not having enough statistical power in our evaluation research designs to be able to detect reliably those small effects that we are likely to encounter.
It is instructive to consider the example of the problem of crime in our society. In the last two decades, we have learned a great deal about the crime problem through our attempts by initiating one social program after another to halt the rising crime rate in our society. The end result of this series of trials has largely failed to have [substantial] impacts on the crime rates. The research effort has yielded a great deal of empirical knowledge about crime and criminals. For example, we now know a great deal about the demographic characteristics of criminals and their victims. But, we still have only the vaguest ideas about why the crime rates rose so steeply in the period between 197054ya and 198044ya and, in the last few years, have started what appears to be a gradual decline. We have also learned that the criminal justice system has been given an impossible task to perform and, indeed, practices a wholesale form of deception in which everyone acquiesces. It has been found that most perpetrators of most criminal acts go undetected, when detected go unprosecuted, and when prosecuted go unpunished. Furthermore, most prosecuted and sentenced criminals are dealt with by plea bargaining procedures that are just in the last decade getting formal recognition as occurring at all. After decades of sub rosa existence, plea bargaining is beginning to get official recognition in the criminal code and judicial interpretations of that code.
But most of what we have learned in the past two decades amounts to a better description of the crime problem and the criminal justice system as it presently functions. There is simply no doubt about the importance of this detailed information: it is going to be the foundation of our understanding of crime; but, it is not yet the basis upon which to build policies and programs that can lessen the burden of crime in our society.
Perhaps the most important lesson learned from the descriptive and evaluative researches of the past two decades is that crime and criminals appear to be relatively insensitive to the range of policy and program changes that have been evaluated in this period. This means that the prospects for substantial improvements in the crime problem appear to be slight, unless we gain better theoretical understanding of crime and criminals. That is why the Iron Law of Evaluation appears to be an excellent generalization for the field of social programs aimed at reducing crime and leading criminals to the straight and narrow way of life. The knowledge base for developing effective crime policies and programs simply does not exist; and hence in this field, we are condemned—hopefully temporarily—to Edisonian trial and error.
[pg14]
Program Theory And Implementation Failures
As defined earlier, program theory failures are translations of a proper understanding of a problem into inappropriate programs, and program implementation failures arise out of defects in the delivery system used. Although in principle it is possible to distinguish program theory failures from program implementation failures, in practice it is difficult to do so. For example, a correct program may be incorrectly delivered, and hence would constitute a “pure” example of implementation failure, but it would be difficult to identify this case as such, unless there were some instances of correct delivery. Hence both program theory and program implementation failures will be discussed together in this section.
These kinds of failures are likely the most common causes of ineffective programs in many fields. There are many ways in which program theory and program implementation failures can occur. Some of the more common ways are listed below.
Wrong Treatment
This occurs when the treatment is simply a seriously flawed translation of the problem theory into a program. One of the best examples is the housing allowance experiment in which the experimenters attempted to motivate poor households to move into higher quality housing by offering them a rent subsidy, contingent on their moving into housing that met certain quality standards (Struyk and Bendick, 198143ya). The experimenters found that only a small portion of the poor households to whom this offer was made actually moved to better housing and thereby qualified for and received housing subsidy payments. After much econometric calculation, this unexpected outcome was found to have been apparently generated by the fact that the experimenters unfortunately did not take into account that the costs of moving were far from zero. When the anticipated dollar benefits from the subsidy were compared to the net benefits, after taking into account the costs of moving, the net benefits were in a very large proportion of the cases uncomfortably close to zero and in some instances negative. Furthermore, the housing standards applied almost totally missed the point. They were technical standards that often characterized housing as sub-standard that was quite acceptable to the households involved. In other words, these were standards that were regarded as irrelevant by the clients. It was unreasonable to assume that households would undertake to move when there was no push of dissatisfaction from the housing occupied and no substantial net positive benefit in dollar [pg15] terms for doing so. Incidentally, the fact that poor families with little formal education were able to make decisions that were consistent with the outcomes of highly technical econometric calculations improves one’s appreciation of the innate intellectual abilities of that population.
Right Treatment But Insufficient Dosage
A very recent set of trial policing programs in Houston, Texas and Newark, New Jersey exemplifies how programs may fail not so much because they were administering the wrong treatment but because the treatment was frail and puny (Police Foundation, 198539ya). Part of the goals of the program was to produce a more positive evaluation of local police departments in the views of local residents. Several different treatments were attempted. In Houston, the police attempted to meet the presumed needs of victims of crime by having a police officer call them up a week or so after a crime complaint was received to ask “how they were doing” and to offer help in “any way”. Over a period of a year, the police managed to contact about 230 victims, but the help they could offer consisted mainly of referrals to other agencies. Furthermore, the crimes in question were mainly property thefts without personal contact between victims and offenders, with the main request for aid being requests to speed up the return of their stolen property. Anyone who knows even a little bit about property crime in the United States would know that the police do little or nothing to recover stolen property mainly because there is no way they can do so. Since the callers from the police department could not offer any substantial aid to remedy the problems caused by the crimes in question, the treatment delivered by the program was essentially zero. It goes without saying that those contacted by the police officers did not differ from randomly selected controls—who had also been victimized but who had not been called by the police—in their evaluation of the Houston Police Department.
It seems likely that the treatment administered, namely expressions of concern for the victims of crime, administered in a personal face-to-face way, would have been effective if the police could have offered substantial help to the victims.
Counter-Acting Delivery System
It is obvious that any program consists not only of the treatment intended to be delivered, but it also consists of the delivery system and whatever is done to clients in the delivery of services. Thus the income maintenance experiments’ treatments consist not only of the payments, but the entire system of monthly income reports required of the clients, [pg16] the quarterly interviews and the annual income reviews, as well as the payment system and its rules. In that particular case, it is likely that the payments dominated the payment system, but in other cases that might not be so, with the delivery system profoundly altering the impact of the treatment.
Perhaps the most egregious example was the group counseling program run in California prisons during the 1960s (Kassebaum, Ward, and Wilner, 197252ya). Guards and other prison employees were used as counseling group leaders, in sessions in which all participants—prisoners and guards—were asked to be frank and candid with each other! There are many reasons for the abysmal failure6 of this program to affect either criminals’ behavior within prison or during their subsequent period of parole, but among the leading contenders for the role of villain was the prison system’s use of guards as therapists.
Another example is the failure of transitional aid payments to released prisoners when the payment system was run by the state employment security agency, in contrast to the strong positive effect found when run by researchers (Rossi, Berk, and Lenihan, 198044ya). In a randomized experiment run by social researchers in Baltimore, the provision of 3 months of minimal support payments lowered the re-arrest rate by 8 percent, a small decrement, but a [statistically]-significant one that was calculated to have very high cost to benefit ratios. When the Department of Labor wisely decided that another randomized experiment should be run to see whether YOAA—“Your Ordinary American Agency”—could achieve the same results, large scale experiments in Texas and Georgia showed that putting the treatment in the hands of the employment security agencies in those two states canceled the positive effects of the treatment. The procedure which produced the failure was a simple one: the payments were made contingent on being unemployed, as the employment security agencies usually administered unemployment benefits, creating a strong work disincentive effect with the unfortunate consequence of a longer period of unemployment for experimentals as compared to their randomized controls and hence a higher than expected re-arrest rate.
Pilot and Production Runs
The last example can be subsumed under a more general point—namely, given that a treatment is effective in a pilot test does not mean that when turned over to YOAA, effectiveness can be maintained. This is the lesson to be derived from the transitional aid experiments in Texas and Georgia and from programs such as The Planned Variation teaching demonstrations7. In the latter program leading teaching specialists were asked to develop versions of their teaching methods to be implemented in actual [pg17] school systems. Despite generous support and willing cooperation from their schools, the researchers were unable to get workable versions of their teaching strategies into place until at least a year into the running of the program. There is a big difference between running a program on a small scale with highly skilled and very devoted personnel and running a program with the lesser skilled and less devoted personnel that YOAA ordinarily has at its disposal. Programs that appears to be very promising when run by the persons who developed them, often turn out to be disappointments when turned over to line agencies.
Inadequate Reward System
The internally defined reward system of an organization has a strong effect on what activities are assiduously pursued and those that are characterized by “benign neglect”. The fact that an agency is directed to engage in some activity does not mean that it will do so unless the reward system within that organization actively fosters compliance. Indeed, there are numerous examples of reward systems that do not foster compliance.
Perhaps one of the best examples was the experience of several police departments with the decriminalization of public intoxication. Both the District of Columbia and Minneapolis—among other jurisdictions—rescinded their ordinances that defined public drunkenness as misdemeanors, setting up detoxification centers to which police were asked to bring persons who were found to be drunk on the streets. Under the old system, police patrols would arrest drunks and bring them into the local jail for an overnight stay. The arrests so made would “count” towards the department measures of policing activity. Patrolmen were motivated thereby to pick up drunks and book them into the local jail, especially in periods when other arrest opportunities were slight. In contrast, under the new system, the handling of drunks did not count towards an officer’s arrest record. The consequence: Police did not bring drunks into the new detoxification centers and the municipalities eventually had to set up separate service systems to rustle up clients for the detoxification systems.8
The illustrations given above should be sufficient to make the general point that the appropriate implementation of social programs is a problematic matter. This is especially the case for programs that rely on persons to deliver the service in question. There is no doubt that federal, state, and local agencies can calculate and deliver checks with precision and efficiency. There also can be little doubt that such agencies can maintain a physical infra-structure that delivers public services efficiently, even though there are a few examples of the failure of water and sewer systems on scales that threaten public health. But there is a lot of doubt that human [pg18] services that are tailored to differences among individual clients can be done well at all on a large scale basis.
We know that public education is not doing equally well in facilitating the learning of all children. We know that our mental health system does not often succeed in treating the chronically mentally ill in a consistent and effective fashion. This does not mean that some children cannot be educated or that the chronically mentally ill cannot be treated—it does mean that our ability to do these activities on a mass scale is somewhat in doubt
Conclusions
This paper started out with a recital of the several metallic laws stating that evaluations of social programs have rarely found them to be effective in achieving their desired goals. The discussion modified the metallic laws to express them as statistical tendencies rather than rigid and inflexible laws to which all evaluations must strictly adhere. In this latter sense, the laws simply do not hold. However, when stripped of their rigidity, the laws can be seen to be valid as statistical generalizations, fairly accurately representing what have been the end results of evaluations “on-the-average”. In short, few large-scale social programs have been found to be even minimally effective. There have been even fewer programs found to be spectacularly effective. There are no social science equivalents of the Salk vaccine.9
Where this conclusion the only message of this paper, then it would tell a dismal tale indeed. But there is a more important message in the examination of the reasons why social programs fail so often. In this connection, the paper pointed out two deficiencies:
First, policy relevant social science theory that should be the intellectual underpinning of our social policies and programs is either deficient or simply missing. Effective social policies and programs cannot be designed consistently until it is thoroughly understood how changes in policies and programs can affect the social problems in question. The social policies and programs that we have tested have been designed, at best, on the basis of common sense and perhaps intelligent guesses, a weak foundation for the construction of effective policies and programs.
In order to make progress, we need to deepen our understanding of the long range and proximate causation of our social problems and our understanding about how active interventions might alleviate the burdens of those problems. This is not simply a call for more funds for social science research but also a call for a redirection of social science research toward understanding how public policy can affect those problems.
Second, in pointing to the frequent failures in the implementation of [pg19] social programs, especially those that involve labor intensive delivery of services, we may also note an important missing professional activity in those fields. The physical sciences have their engineering counterparts; the biological sciences have their health care professionals; but social science has neither an engineering nor a strong clinical component. To be sure, we have clinical psychology, education, social work, public administration, and law as our counterparts to engineering, but these are only weakly connected with basic social science. What is apparently needed is a new profession of social and organizational engineering devoted to the design of human services delivery systems that can deliver treatments with fidelity and effectiveness.
In short, the double message of this paper is an argument for further development of policy relevant basic social science and the establishment of the new profession of social engineer.
References
-
Cutright, P. and F.S. Jaffe 197747ya: Impact of Family Planning Programs on Fertility: The U.S. Experience. New York: Praeger
-
Guba & Lincoln 198143ya, Effective Evaluation: Improving the Usefulness of Evaluation Results Through Responsive and Naturalistic Approaches
-
Kassebaum, G., D. Ward, and D. Wilner 197153ya: Prison Treatment and Parole Survival. New York: John Riley.
-
Lipton, D., R. Martinson, and L. Wilks 197549ya: The Effectiveness of Correctional Treatment. New York: Praeger. [pg20]
-
Patton, M. 198044ya: Qualitative Evaluation Methods. Beverly Hills, CA: Sage Publications.
-
Police Foundation 198539ya: Evaluation of Newark and Houston Policing Experiments10. Washington, DC.
-
Raizen, S.A. and P.H. Rossi (eds.) 198044ya: Program Evaluation in Education: When? How? To What Ends? Washington, DC: National Academy Press.
-
Rossi, P.H., R.A. Berk and K.J. Lenihan 198044ya: Money, Work and Crime. New York: Academic.
-
Sherman, L.W. and R.A. Berk 198440ya: “Deterrent effects of arrest for domestic assault”. American Sociological Review 49: 261–271.
-
Smith, M.L., G.V. Glass, and T.I. Miller 198044ya: The Benefits of Psychotherapy: An Evaluation. Baltimore: The Johns Hopkins University Press.
-
Struyk, R.J. and M. Bendick 198143ya: Housing Vouchers for the Poor. Washington, DC: The Urban Institute.
-
Westat, Inc. 1976–4198044ya: Continuous Longitudinal Manpower Survey, Reports 1–10 (CLMS). Rockville, MD: Westat, Inc.11
See Also
External Links
-
Smith 201113ya, “Epidemiology, genetics and the ‘Gloomy Prospect’: embracing randomness in population health research and practice”
-
Jensen 196955ya, “How Much Can We Boost IQ and Scholastic Achievement?”; related:
-
Bereiter & Engelmann 196658ya, Teaching disadvantaged children in the preschool
-
The Coleman Report, 196658ya;“Does Attendance in Private Schools Predict Student Outcomes at Age 15? Evidence From a Longitudinal Study”, 2018; “Still No Effect of Resources, Even in the New Gilded Age?”, 2016; “No, US school funding is actually somewhat progressive”
-
Racial isolation in the public schools; a report: “Effects of Compensatory Education in Majority-Negro Schools” (on Higher Horizons, the Banneker Project, and the All Day Neighborhood School Program)
-
Bereiter, C., & Engelmann, S. “An academically oriented preschool for disadvantaged children: Results from the initial experimental group”. In D. W . Brison & J. Hill (Eds.), Psychology and early childhood education. Ontario Institute for Studies in Education, 196856ya. No. 4. Pp. 17–3
-
Gates & Taylor 192599ya, “An experimental study of the nature of improvement resulting from practice in mental function”
-
Gordon & Wilkerson 196658ya, Compensatory education for the disadvantaged
-
Hodges & Spicker 196757ya, “The effects of preschool experiences on culturally deprived children”
-
Goslin 196757ya, Teachers and Testing
-
Clark 196361ya, “Educational stimulation of racially disadvantaged children”
-
Powledge 196757ya, To change a child: A report on the Institute for Developmental Studies
-
Reymert & Hinton 194084ya, “The effect of a change to a relatively superior environment upon the IQs of one hundred children”
-
Wargo et al 197153ya, “Further examination of exemplary programs for educating disadvantaged children”; Stickney 197747ya, “The Fading Out of Gains in ‘Successful’ Compensatory Education Programs”
-
U.S. Commission on Civil Rights 196757ya. Racial isolation in the public schools. Vol. 1e
-
Vandenberg, S. G. “The nature and nurture of intelligence”. In D. C. Glass (Ed.), Biology and Behavior: Genetics, ed 1968
-
Vernon 195470ya, “Symposium on the effects of coaching and practice in intelligence tests”
-
Wrightstone et al 196460ya, Evaluation of Higher Horizons Programs for Underprivileged Children
-
-
“Stop Trying to Save the World: Big ideas are destroying international development”
-
“Money And School Performance: Lessons from the Kansas City Desegregation Experiment”, Ciotti 199826ya; “America’s Richest School Serves Low-Income Kids. But Much of Its Hershey-Funded Fortune Isn’t Being Spent”
-
“Improving Teaching Effectiveness: Final Report: The Intensive Partnerships for Effective Teaching Through 2015–2016”, et al 2018 ( “Study: Multi-Year Gates Experiment to Improve Teacher Effectiveness Spent $575 Million, Didn’t Make an Impact”)
-
Moving to Opportunity (et al 2006 ) & “Moving to Opportunity or Isolation? Network Effects of a Randomized Housing Lottery in Urban India”, et al 2015
-
Georgia Land Lotteries: “Shocking Behavior: Random Wealth in Antebellum Georgia and Human Capital Across Generations”, Bleakley & Ferrie 201311ya; “Up from Poverty? The 1832192ya Cherokee Land Lottery and the Long-run Distribution of Wealth”, Bleakley & Ferrie 201311ya; “The child quality-quantity tradeoff, England, 1780–1001880144ya: a fundamental component of the economic theory of growth is missing”, 2016; “Land lotteries, long-term wealth, and political selection”, 2019
-
“Wealth, Health, and Child Development: Evidence from Administrative Data on Swedish Lottery Players”, et al 2016
-
What Money Can’t Buy: Family Income and Children’s Life Chances, 1997
-
“The Intergenerational Effects Of A Large Wealth Shock: White Southerners After The Civil War”, et al 2019
-
“The Ticket To Easy Street? The Financial Consequences Of Winning The Lottery”, et al 2011
-
“Long-Run Effects of Lottery Wealth on Psychological Well-Being”, et al 2020 (“According to our estimate, an after-tax prize of $100,000 improves life satisfaction by 0.037 standard deviation (SD) units… For happiness and mental health, our rescaled estimates are about one-third the magnitude of the corresponding gradients estimated in cross-sectional data.”); “Association Between Lottery Prize Size and Self-reported Health Habits in Swedish Lottery Players”, Östling et al 2020; “When a Town Wins the Lottery: Evidence from Spain”, Kent & Martínez-2022
-
“Bankruptcy Rates among NFL Players with Short-Lived Income Spikes”, et al 2015
-
RAND Health Insurance Experiment; Oregon Medicaid health experiment
-
Opportunity NYC; “Disappointing findings on Conditional Cash Transfers as a tool to break the poverty cycle in the United States: Family Rewards 2.0”
-
“The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs”, 1987
-
Seeing Like a State, 1998
-
Promises I Can Keep: Why Poor Women Put Motherhood Before Marriage, 2005
-
“Randomized Controlled Trials Commissioned by the Institute of Education Sciences Since 200222ya: How Many Found Positive Versus Weak or No Effects?”, Coalition for Evidence-Based Policy 201311ya (~10% have any effect)12
-
“Most Rigorous Large-Scale Educational RCTs are Uninformative: Should We Be Concerned?”, Lortie-2019
-
“The impact of neighbourhood deprivation on adolescent violent criminality and substance misuse: A longitudinal, quasi-experimental study of the total Swedish population”, Sariaslan et al 201311ya; “Childhood family income, adolescent violent criminality and substance misuse: quasi-experimental total population study”, et al 2014; “Does Population Density and Neighborhood Deprivation Predict Schizophrenia? A Nationwide Swedish Family-Based Study of 2.4 Million Individuals”, et al 2014
-
“Investing in schools: capital spending, facility conditions, and student achievement”, et al 2015
-
“Non-cognitive deficits and young adult outcomes: the long-run impacts of an universal child care program”, et al 2015
-
“School Desegregation and Black Achievement: an integrative review”, 1985
-
“Effects of the Tennessee Prekindergarten Program on children’s achievement and behavior through third grade”, et al 2018 ( commentary; commentary, with author comments); “Effects of a statewide pre-kindergarten program on children’s achievement and behavior through sixth grade”, et al 2022
-
“The Long Run Consequences of Living In A Poor Neighborhood”, 2003
-
“Are Neighborhood Health Associations Causal? A 10-Year Prospective Cohort Study With Repeated Measurements”, 2014; “Does neighbourhood deprivation cause poor health? Within-individual analysis of movers in a prospective cohort study”, 2015; “Using Internal Migration to Estimate the Causal Effect of Neighborhood Socioeconomic Context on Health: A Longitudinal Analysis, England, 1995–13200816ya”, et al 2017
-
“The Geography of Family Differences and Intergenerational Mobility”, et al 2017
-
“School Starting Age and the Crime-Age Profile”, et al 2015
-
“Born on the wrong day? School entry age and juvenile crime”, 2016
-
“Does High Self-esteem Cause Better Performance, Interpersonal Success, Happiness, or Healthier Lifestyles?”, Baumeister et al 200321ya (“The Man Who Destroyed America’s Ego: How a rebel psychologist challenged one of the 20th century’s biggest-and most dangerous-idea”; “‘It was quasi-religious’: the great self-esteem con—In the 1980s, Californian politician John Vasconcellos set up a task force to promote high self-esteem as the answer to all social ills. But was his science based on a lie?”)
-
“‘Scared straight’ and other juvenile awareness programs for preventing juvenile delinquency”, et al 2013
-
“When Helping Hurts”: Cabot’s 193589ya Cambridge-Somerville Youth Study
-
“Evaluations of road accident blackspot treatment: A case of the iron law of evaluation studies?”, 1997
-
“Why certain systematic reviews reach uncertain conclusions”, 2003
-
“What works?—questions and answers about prison reform”, Martinson 197450ya; The effectiveness of correctional treatment: A survey of treatment evaluation studies, Lipton et al 197549ya (“The Debate on Rehabilitating Criminals: Is It True that Nothing Works?”); The rehabilitation of criminal offenders: Problems and prospects, Sechrest et al 197945ya; “CDATE: updating The Effectiveness of Correctional Treatment 25 years later”, Lipton 199529ya/“The Effects of Behavioral/Cognitive-Behavioral Programs on Recidivism”, Pearson et al 200222ya; “Large randomized trial finds disappointing effects for federally-funded programs to facilitate the re-entry of prisoners into the community. A new approach is needed”; “Effectiveness of psychological interventions in prison to reduce recidivism: a systematic review and meta-analysis of randomised controlled trials”, et al 2021
-
“Juvenile Delinquency Treatment: A Meta-Analytic Inquiry into the Variability of Effects”, Lipsey 199232ya (Meta-Analysis for Explanation: A Casebook, Cook et al 199430ya)
-
“SNAP Benefits and Crime: Evidence from Changing Disbursement Schedules”, 2017
-
“The Economics of Scale Up”, et al 2017
-
“What Really Happened At The School Where Every Graduate Got Into College”
-
“Scale-Up Experiments”, et al 2018
-
“Persistence and Fadeout in the Impacts of Child and Adolescent Interventions”, et al 2017
-
“Beware the pitfalls of short-term program effects: They often fade”
-
“Boosting School Readiness: Should Preschool Teachers Target Skills or the Whole Child?”, et al 2018
-
“Shoeing the Children: the impact of the TOMS Shoe donation program in rural El Salvador”, et al 2016
-
“500 Life-Saving Interventions and Their Cost-Effectiveness”, et al 1995
-
“Delayed impact of fair machine learning”, et al 2018 ( discussion)
-
“What Do Workplace Wellness Programs Do? Evidence from the Illinois Workplace Wellness Study”, et al 2018
-
“What You Should Know About Megaprojects and Why: An Overview”, 2014
-
“Do children benefit from internet access? Experimental evidence from a developing country”, et al 2018
-
“Cognitive and Non-Cognitive Costs of Daycare [Age] 0–2 for Children in Advantaged Families”, et al 2019
-
performance contracting:
-
Educational Performance Contracting: An Evaluation of an Experiment, Gramlich & Koshel 197549ya (ISBN: 0815732392)
-
“The Dozen Things Experimental Economists Should Do (More of)”, et al 2019
-
“Alcoholics Anonymous: Much More Than You Wanted To Know”; “They Were Promised Coding Jobs in Appalachia. Now They Say It Was a Fraud. Mined Minds came into West Virginia espousing a certain dogma, fostered in the world of start-ups and TED Talks. Students found an erratic operation”
-
“Why do humans reason? Arguments for an argumentative theory”, Mercier & Sperber 201113ya; “How Gullible are We? A Review of the Evidence from Psychology and Social Science”, 2017
-
“A national experiment reveals where a growth mindset improves achievement”, et al 2019
-
“What we can learn from five naturalistic field experiments that failed to shift commuter behavior”, 2019
-
“Beliefs About Human Intelligence in a Sample of Teachers and Nonteachers”, 2020
-
“Health Recommendations and Selection in Health Behaviors”, 2020
-
“Technology and educational choices: Evidence from a one–laptop–per–child program (OLPC)”, 2020 (from 2019)
-
“Improving Public Sector Management at Scale? Experimental Evidence on School Governance India”, 2020
-
“RCTs to Scale: Comprehensive Evidence from Two Nudge Units”, Della2020; “Bottlenecks for Evidence Adoption”, Dellaet al 2024
-
“From Natural Variation to Optimal Policy? The Importance of Endogenous Peer Group Formation”, et al 2013
-
“Long-term Health and Social Outcomes in Children and Adolescents Placed in Out-of-Home Care”, et al 2021
-
“Nothing Scales”, Jason Kerwin
-
“Do Labor Market Policies have Displacement Effects? Evidence from a Clustered Randomized Experiment”, et al 2013
-
“Texting Students and Study Supporters (Project SUCCESS): Evalution Report”, et al 2020
-
“Criminalizing Poverty: The Consequences of Court Fees in a Randomized Experiment”, et al 2022 (“relief from fees does not affect new criminal charges, convictions, or jail bookings after 12 months.”)
-
“Can behavioral interventions be too salient? Evidence from traffic safety messages”, 2022
-
“Reproducibility in the Social Sciences”, et al 2022; “It pays to be ignorant: A simple political economy of rigorous program evaluation”, 2002
-
“Why Men Are Hard to Help”, Richard V. Reeves
-
“The consequences of job search monitoring for the long-term unemployed: Disability instead of employment?”, De et al 2023
-
“The Unintended Consequences of Academic Leniency”, et al 2023 ( Twitter)
-
“The Long-Term Effects of Income for At-Risk Infants: Evidence from Supplemental Security Income”, et al 2023
-
“D.C. sent $10,800 to dozens of new moms. Here’s how it changed their lives.”
-
“LeBron James Opened a School That Was Considered an Experiment. It’s Showing Promise. The inaugural class of third and fourth graders at the school has posted extraordinary results on its first set of test scores.”, 2019; “LeBron’s School Has Everything We’re Told Students Need. It’s Still Failing Them. Not one ‘I Promise’ 8th grader has ever passed Ohio’s math test, and English and science scores are dismal. The question is, why?”, 2024
-
“Effort Traps: Socially Structured Striving and the Reproduction of Disadvantage”, 2024
-
“Live Aid: The Terrible Truth”, Robert 1986
-
“The inefficacy of land titling programs: homesteading in Haiti, 1933–17195074ya”, 2024
-
eg. the Iron law of wages, Iron Law of Oligarchy / Pournelle’s Iron Law of Bureaucracy / Schwartz’s Iron law of institutions, or the Iron law of prohibition; Aaron Shaw offers a collection of 33 other laws, some (all?) of which seem to be real. –Editor↩︎
-
Note that the law emphasizes that it applied primarily to “large scale” social programs, primarily those that are implemented by an established governmental agency covering a region or the nation as a whole. It does not apply to small scale demonstrations or to programs run by their designers.↩︎
-
One is reminded of the old philosophy saying that one man’s modus ponens is another man’s modus tollens. –Editor↩︎
-
Unfortunately, it has proven difficult to stop large scale programs even when evaluations prove them to be ineffective. The federal job training programs seem remarkably resistant to the almost consistent verdicts of ineffectiveness. This limitation on the Edisonian paradigm arises out of the tendency for large scale programs to accumulate staff and clients that have extensive stakes in the program’s continuation.↩︎
-
This is a complex example in which there are many competing explanations for the failure of the program. In the first place, the program may be a good example of the failure of problem theory since the program was ultimately based on a theory of criminal behavior as psychopathology. In the second place, the program theory may have been at fault for employing counseling as a treatment. This example illustrates how difficult it is to separate out the three sources of program failures in specific instances.↩︎
-
Rossi greatly undersells Project Follow Through here: it was not merely an educational experiment but one of the largest ever run, and, similar to the Office of Economic Opportunity’s “performance contracting” experiment, almost all of the interventions failed (and were harmful), with the exception of the perennially-unpopular Direct Instruction intervention.↩︎
-
See also Goodhart’s law. –Editor↩︎
-
Or iodization or, much more speculatively, the banning of leaded gasoline. –Editor↩︎
-
It’s unclear what book this is; WorldCat & Amazon & Google Books have no entry for a book named “Evaluation of Newark and Houston Policing Experiments”, and Google returns only Rossi’s paper. The Police Foundation website lists 2 reports for 198539ya: “Neighborhood Police Newsletters: Experiments in Newark and Houston” (executive summary, technical report, appendices) and “The Houston Victim Recontact Experiment” (executive summary, technical report, appendices). Possibly these were published together in a print form and this is what Rossi is referencing? –Editor↩︎
-
This appears to be a reference to 10 separate publications. CLMS #1–7’s data and the #8 report are available online; I have not found #9–10.↩︎
-
It is worth contrasting this striking estimate of the effect usually being zero in the IES’s RCTs as a whole with the far more sanguine estimates one sees derived from academic publications in 1993’s “The Efficacy of Psychological, Educational, and Behavioral Treatment: Confirmation From Meta-Analysis” (and to a much lesser extent, et al 2003’s “One Hundred Years of Social Psychology Quantitatively Described”). One man’s modus ponens…↩︎