In his 1966 book, Rosenthal demonstrated the importance of experimenter effects in behavioral research. After lucid discussion of the experimenter as biased observer and interpreter of data, and of the effects of relatively permanent experimenter attributes on subjects' responses, a series of ingenious experiments was reported showing the effects of experimenter expectancy on both human and animal behavior. Many sound suggestions were then offered on the control and reduction of self-fulfilling prophecies in psychological research. To herald the generality and potential importance of such phenomena, the book closed dramatically with a preliminary analysis of data on teacher expectancy effects and pupil IQ gains in elementary school. Those closing pages (pp. 410–413) have since been expanded by Rosenthal and Jacobson for Psychological Reports (1966,
19, 115–118) and Scientific American (1968, 218(4), 19–23) and now appear as a not quite fully grown, glossy-backed Pygmalion.
The first 60 pages of Pygmalion recount in interesting and readable form the nature of self-fulfilling prophecies and then introduce the reader to some educational problems of disadvantaged children, including discussion of teachers' possible roles in these problems. The remainder of the book is essentially a report of original research and must be reviewed as such. It is the considered opinion of this reviewer that the research would have been judged unacceptable if submitted to an APA journal in its present form. Despite its award-winning experimental design, the study suffers from serious measurement problems and inadequate data analysis. Its reporting, furthermore, appears to violate the spirit of Rosenthal's own earlier admonitions to experimenters and stands as a casebook example of many of Darrell Huff's (How to Lie with Statistics. New York: Norton, 1954) admonitions to data analysts.
The study involved fast, medium, and slow reading classrooms at each grade from first through sixth in a single elementary school, "Oak" School, in South San Francisco. During May 1964, while Ss were in Grades K through 5, the "Harvard Test of Inflected Acquisition" was administered as part of a "Harvard-NSF Validity Study." As described to teachers, the new instrument purported to identify "bloomers" who would probably experience an unusual forward spurt in academic and intellectual performance during the following year. Actually, the measure was Flanagan's Tests of General Ability (TOGA), chosen as a nonlanguage group intelligence test that would provide verbal and reasoning subscores as well as a total IQ. TOGA was judged appropriate for the study because it would probably be unfamiliar to the teachers and because it offers three forms, for Grades K-2, 3–4, and 5–6, all of similar style and content. As school began in Fall 1964, a randomly chosen 20% of the Ss were designated as "spurters." Each of the 18 teachers received a list of from one to nine names, identifying those spurters who would be in his class. TOGA was then readministered in January 1965, May 1965, and May 1966. In addition to the main comparison between experimental and control Ss within each grade level and reading ability track, contrasts were also planned for sex and Mexican/nonMexican subgroups. The complete experiment could be characterized as a 2×2×2×3×6×4 factorial design, with repeated measures on the last factor, but the full analysis of this table was neither planned nor possible, due to incomplete data and empty cells. Rosenthal and Jacobson chose to obtain simple gain scores from the pretest to the third testing, called the "basic" posttest, and to make their primary comparisons with these. The main statistical computations were two-and three-way analyses of variance, using the unweighted means approximation to overcome problems of unequal cell frequencies. Also provided were supplemental analyses of data from the second and fourth TOGA administrations as well as grades in various school subjects, teacher ratings of classroom behavior, and a substudy of achievement test scores. The results were interpreted as showing "…that teachers' favorable expectations can be responsible for gains in their pupils' IQs and, for the lower grades, that these gains can be quite dramatic" (p. 98).
A complete methodological critique is impossible from information available in the book, even though an appendix and several extensive footnotes are included, but the authors have cooperated completely in providing the reviewer and his colleagues access to the original data and permission to reanalyze them. While details of the reanalysis cannot be given here, a full report is planned. Many methodological issues can be elaborated here, however, from a review of the book alone.
Problems began with the decision to rely solely on TOGA. The test does not have adequate norms for the youngest children, especially for children from lower socio-economic backgrounds. It was administered to separate classes by the teachers themselves; this adds considerable uncertainty about standardization of procedure. All computations were based on IQ scores; the more meaningful raw scores were neither used nor provided for reanalysis, even though they would be preferable for most data analysis purposes. These concerns loom large as one examines data tables in the appendix. In Table A-3, pretest reasoning IQ means for Grade 1 (tested in K) are 47.19 and 30.79 for 16 middle and 19 low track control Ss, and 54.00 and 53.50 for 4 and 2 experimental Ss, respectively. The average for all first grade children is 58. Were these children actually functioning at imbecile and low moron levels? More likely, the test was not functioning at this age level. TOGA does not have norms below an IQ of 60. To obtain IQ scores as low as these, given reasonably distributed ages, raw scores would have to represent random or systematically incorrect responding. Presumably the published conversion tables were extrapolated, even into the chance score range. In the original data cards, one S with a pretest reasoning IQ of 17 had posttest IQs of 148, 110, and 112. Another showed reasoning IQs of 18, 44, 122, and 98. In the opposite direction, still another S had successive verbal IQs of 183, 166, 221, and 168 though TOGA does not have norms above 160. Many other IQs are equally strange. Readers should wonder why other mental ability information, already available from the school or obtainable without undue additional effort, was not used along with TOGA.
Tables 7–1, 7–3, and 7–4, the main results, show that the difference between experimental and control groups in mean gain from first to third testing was essentially zero for all grades except the first two, where the experimental group gain apparently did exceed that of the control group. But the authors correlated grade level with mean difference and say, "We find increasing expectancy advantage as we go from the sixth to the first grade" (p. 74). Examining these mean differences for both the total and part-scores, and noting the substantially larger standard deviations for reasoning, it is evident that the reasoning subscores in Grades 1 and 2 provided the principal effect. These are precisely the scores whose meaning is most questionable.
Other aspects of the analysis are also troublesome. Although about 20% of the initially-tested Ss were lost to the experiment, the effect of this loss on IQ averages in various subgroups is not dealt with directly. No mention is made that unweighted means analysis requires homogeneous variances and random cell-size fluctuations. Data provided in appendix tables indicate that at least the variance assumption may well have been violated in many instances. Heterogeneity is most apparent in the earliest two grades where ratios of variance for experimental and control groups frequently provide F's of 20 or more. Statistical tests are thus sometimes too conservative and sometimes too liberal. In any event, pooled error terms seem generally unjustified. Throughout the book, p values ranging from .20 to .00002 are used fallaciously as if they were a measure of strength of effect. The authors rely on simple gain tests (i.e. tests of the difference between difference scores), even though many mean pretest differences between treatment groups equal or exceed obtained posttest differences. And it is simply not true that "…. post-test-only measures are less precise than the change or gain scores…" (p. 108). The important repeated measures aspect of the design is ignored. There is no hint that the regression of posttests on pretest for different treatment groups should be of interest to anyone or that the four testings should be of equal importance in the analysis. Results for the other testings, incidentally, are markedly different from those obtained with the "basic" posttest. Also, since transfer of Ss between ability tracks is not discussed, the reader is permitted the dubious assumption that no students changed track across the study's two-year span, even though some IQs changed more than 100 points!
Finally, the reporting style is appalling. It is too easy for the unwary reader to pick choice findings and figures from the book but too difficult for him to verify the data and analyses on which they are based. Comparisons between text and appendix tables are hampered by use of different subgroupings of the data and absence of intermediate analysis of variance tables. Score distributions are not given. Graphs and tables are frequently misleading: some show only differences between difference scores, where basic data are not available in the book (e.g., Table 7–6); some fail to indicate the small sample sizes on which impressive percentages are based (e.g., Fig. 7–2); some use microscopic scales to overemphasize practically insignificant differences (e.g., Fig. 8–1); and some display floating zero points and elastic scales, making comparisons from one graph to another difficult (e.g., Figs. 9–3 through 9–6). Text and tables, as well as different parts of the text, are not always in close agreement. Statements like "When the entire school benefited as in total IQ and reasoning IQ, all three tracks benefited…" (p. 78) are used in describing results. In reconciling results with those of other studies, however, statements like "The finding that only the younger children profited after one year from their teacher's favorable expectations helps us to understand better the [negative] results of two other experimenters…" (p. 84).
The book closes with three potentially useful though cursory chapters. The first takes some glib steps toward meeting specific methodological criticisms. It also offers speculation on possible processes of intentional and unintentional influence between the teachers and students but fails to face the full realization that the teachers could not remember, and reported hardly having glanced at, the names on the original lists of "bloomers." Another discusses more general methodological aspects of Hawthorne and expectancy studies, including helpful design suggestions. The last provides a capsule summary and some general implications. It is here that the abstraction process from basic data to general statements is most apparent and it is here that the inadequacy of almost any statistical summary of these data should be clearly specified. But it is not. One fears that the experimenters have convinced themselves, in the course of the analysis and of the book, that what they believed all along is true without further question.
The social importance of the problem studied in this book cannot be overestimated. While social and behavioral scientists are always responsible for the proper conduct and reporting of research, nowhere should this responsibility be more keenly felt and exercised than in work bearing directly on urgent and volatile social issues. With these thoughts in mind, the reviewer notes with growing alarm comments in the popular press such as the following:
"Here may lie the explanation of the effects of socio-economic status on schooling. Teachers of a higher socio-economic status expect pupils of a lower socio-economic status to fail" (Robert Hutchins, Success in Schools, San Francisco Chronicle, August 11, 1968, p. 2);
"Jose, a Mexican American boy…moved in a year from being classed as mentally retarded to above average. Another Mexican American child, Maria, moved…from 'slow learner' to 'gifted child,'…The implications of these results will upset many school people, yet these are hard facts" (Herbert Kohl, Review of Pygmalion in the Classroom, The New York Review of Books, September 12, 1968, p. 31);
"The findings raise some fundamental questions about teacher training. They also cast doubt on the wisdom of assigning children to classes according to presumed ability, which may only mire the lowest groups into self-confining ruts" (Time, September, 20, 1968, p. 62).
Still further comment has appeared in the
Saturday Review (October 19, 1968) and in a special issue of
The Urban Review (September, 1968) devoted solely to the topic of expectancy, which includes a selection from
Pygmalion. The inclusion of such selections in books of readings is, of course, also inevitable.
Considering these releases, expecting wider coverage from less responsible news media in the future, and imagining the irate faces at countless school board and PTA meetings, one recalls the words of Huff (p. 45):
"The fault is in the filtering-down processes from the researcher through the sensational or ill-informed writer to the reader who fails to miss the figures that have disappeared in the process."
Teacher expectancy may be a powerful phenomenon which, if understood, could be used to gain much of positive value in education. Rosenthal and Jacobson will have made an important contribution if their work prompts others to do sound research in this area. But their study has not come close to providing adequate demonstration of the phenomenon or understanding of its process. Pygmalion, inadequately and prematurely reported in book and magazine form, has performed a disservice to teachers and schools, to users and developers of mental tests, and perhaps worst of all, to parents and children whose newly gained expectations may not prove quite so self-fulfilling.