DebatingRacialPreference - Stereotype Threat Research

STEREOTYPE THREAT:
WIDESPREAD, FUNDAMENTAL MISREPRESENTATION
OF IMPORTANT RACIAL RESEARCH

Curtis Crawford - March 2004

A January 2004 report in American Psychologist, the flagship journal of the American Psychological Association (APA), charges fundamental misrepresentation of important racial research. The accusations, leveled by Paul Sackett, Chaitra Hardison and Michael Cullen, involve newspapers, television, magazines, scholarly journals, and psychology textbooks. The allegedly misrepresented research was conducted by Claude Steele and Joshua Aronson in 1995, concerning the impact of "stereotype threat" on the Black/White test score gap.

Steele and Aronson hypothesized that the emotional pressure of a negative racial stereotype concerning intellectual ability would lower Black performance on exams like the SAT. In consequence, the scores of Black students on such exams would understate their skills, and thereby overstate the gap between White and Black intellectual ability.

To investigate this hypothesis, they gave a series of verbal reasoning tests, in which some test-takers were given instructions that "primed" the stereotype, while others were not. The priming consisted of describing the quiz as a measure of cognitive ability. The participants were Black and White Stanford University sophomores, whose verbal SAT scores (from prior testing) averaged 603 for the Blacks and 655 for the Whites.

The experimental results: When the stereotype was primed, Blacks did less well than the Whites with similar SAT scores. When the stereotype was not primed, Black performance equaled that of the Whites with similar SAT scores. (Steele, C.M. & Aronson, J. (1995), "Stereotype Threat and the Intellectual Test Performance of African Americans," Journal of Personality and Social Psychology 69, pp 797-811)

The first result clearly supports the researchers' hypothesis. It had predicted that the presence of "stereotype threat" would lower Black performance, thus widening the Black-White gap. And indeed, when the stereotype was primed, the Black participants did much worse than the White participants with similar SATs. However, the second result seems to contradict the hypothesis. It had predicted that the absence of stereotype threat would improve Black performance, narrowing the Black-White gap. In fact, when the stereotype was not primed, the Black participants did no better than the Whites with similar SATs. The size of the racial gap in the participants' intellectual skills (as indicated by their SAT scores) was duplicated in the experimental test results.

But this is not, according to the Sackett team, how the results have most often been described. As a first example, they cite the PBS program, Frontline, in which viewers were told:

At Stanford University, psychology professor Claude Steele has spent several years investigating the 150-point score gap between Whites and Blacks on standardized tests. Was the cause class difference, lower incomes, poorer schools, or something else? . . . In research conducted at Stanford, Steele administered a difficult version of the Graduate Record Exam, a standardized test like the SAT. To one set of Black and White sophomores, he indicated that the test was an unimportant research tool, to other groups that the test was an accurate measure of their verbal and reasoning ability. Blacks who believed the test was merely a research tool did the same as Whites. But Blacks who believed the test measured their abilities did half as well. Steele calls the effect "stereotype threat." (From "Secrets of the SAT," written by M. Chandler, broadcast 10/4/99, Boston: WGBH.)

Frontline gave a fair description of the experimental procedure, but misrepresented the outcome. As we have seen, the actual results were that the Blacks who believed that the test measured their abilities did much less well than the Whites whose SAT scores were similar, and the Blacks who thought the test merely a research tool did about the same as the Whites whose SAT scores were similar. The qualification in bold type makes all the difference. Its omission is the heart of the misrepresentations alleged. Leave it out, and you create the false impression that removing "stereotype threat" removed the racial test score gap among the participants. A sensational result! Restore the qualification, and you show that removing the threat left the test score gap intact.

The misrepresentation in Frontline might be dismissed as the kind of mistake television programs often make concerning technical issues. But in the very next example, the American Psychologist article arraigns a publication of the American Psychological Association. Richard McCarty, the APA's Executive Director for Science at the time, featured the Steele/Aronson research in his April 2001 Monitor on Psychology column. According to the Sackett team, McCarty "asserted that when the test was not labeled as a measure of intelligence, African American students performed just as well as White students." [Emphasis added.] Like Frontline, he failed to note that only after being adjusted for differences in the participants' prior SAT performance did Black and White scores come out equal.

Frontline and McCarty are the first of many to be fingered and copiously quoted. The Sackett team conducted an electronic search of popular media and scientific journals, and inspected recently-published introductory psychology textbooks. They found 16 media articles that explicitly described the Steele/Aronson results, 14 (88%) of which incorrectly signified that the racial performance gap disappeared when the "stereotype threat" was removed. One example (Newsweek, 11/6/95, p 82):

In another experiment, when Blacks were told that they were taking a test that would evaluate their intellectual skills, they scored below Whites. Blacks who were told that the test was a laboratory problem-solving task that was not diagnostic of ability scored about the same as Whites.

Among scientific journals, 11 articles and chapters were located, of which 10 (that's 91%) "incorrectly asserted that subgroup differences disappeared in the nonstereotype-threat condition." For example (Wolfe, C.T. & Spencer, S.J. (1996). "Stereotypes and Prejudice: Their Overt and Subtle Influence in the Classroom," American Behavioral Scientist 40, 176-185):

Steele and Aronson (1995) found that when African American and white college students were given a difficult test of verbal ability presented as a diagnostic test of intellectual ability, African Americans performed more poorly on the tests than Whites. However, in another condition, when the exact same test was presented as simply a laboratory problem-solving exercise, African Americans performed equally as well as Whites on the test. One simple adjustment to the situation (changing the description of the test) eliminated the performance differences between Whites and African Americans.

Out of 27 introductory textbooks, 9 did not discuss "stereotype threat," while 9 others limited their treatment of the Steele/Aronson experiment to a correct description of within-group effects. Of the remaining 9, which compared Blacks and Whites, 5 (56%) falsely reported that the racial groups did equally well when the "stereotype threat" was removed. Thus (Kosslyn, S.M & Rosenberg, R.S. (2001). Psychology: The Brain, the Person, the World. Boston: Allyn & Bacon, at p 284):

African-Americans and Whites did equally well when told that the test was simply a laboratory experiment, but African-American students did much worse than Whites when they thought the test measured intelligence.

To the Sackett examples let me add the following, from the well-known book by William Bowen and Derek Bok, The Shape of the River: Long-Term Consequences of Considering Race in College and University Admissions, (1998) Princeton, NJ: Princeton University Press, at p 81:

When black students were assured that the tests they were taking were not used to measure their ability, their performance no longer fell below that of whites. [Again, the all-important qualification, "with similar SAT scores," is absent, nor does it appear elsewhere in the book's report of this research.]

The misrepresented experimental result was simple: remove "stereotype threat" and Black performance matches that of Whites who have similar SAT scores. How could the fundamental transformation of this outcome documented by the Sackett team have occurred? Simply from ignorance-of the significance of adjusting for SAT scores in reporting results? Wishful thinking-that the Black/White test score gap does not represent real inequalities in academic skills? Preferential racism-willingness to discredit academic requirements that impede or discourage minorities?

How do Steele and Aronson feel about the treatment of their work? When describing it in detail, they have repeatedly included the all-important qualification, "controlling" or "adjusting for SAT scores." Are they concerned if, as charged, this qualification is frequently omitted by reporters and scholars? Have they pointed to, and repudiated, clear examples of its misrepresentation?

In 1999, they joined several authors in an article that includes a brief summary of their 1995 findings. (Aronson, J., et al. (1999), "When White men can't do math," Journal of Experimental Social Psychology 35, p 30) The summary (quoted here in full) unaccountably omits the crucial qualification:

Steele and Aronson (1995) found, for example, that African-American college students were dramatically affected by stereotype threat conditions: they performed significantly worse than Whites on a standardized test when the test was presented as a diagnosis of their intellectual abilities, but about as well as Whites when the same test was presented as a nonevaluative problem solving task.

In their reply to the Sackett article, Steele and Aronson are not responsive (American Psychologist, same issue). They misrepresent its conclusions, reasons and evidence. They attempt repeatedly to change the subject. They regret, but minimize the importance of, any "mischaracterizations" of their research that might lead people to overestimate the potency of "stereotype threat." But they fail to acknowledge the core of the complaint: the omission in reports of their experiment of the qualification concerning SAT scores. Do they agree that this omission creates the false impression that their experimental removal of "stereotype threat" removed the participants' racial test score gap? No answer. Do they agree that this impression is completely false? No answer. By never uttering the alleged offense, Steele and Aronson avoid having to deny, defend or repudiate it.

According to the "misrepresentations" criticized by the Sackett team, when there was no "stereotype threat" in the Steele/Aronson experiment, the racial gap in the participants' test performance disappeared. If this had actually happened, the implications could be enormous. It could mean, if the participants were proved through further experiments to be typical, that the Black/White difference in the SAT and other verbal ability test scores is illusory. The test gap would be simply an administrative artifact, an accidental psychological effect of such tests on Black test-takers, not a valid measure of their skills.

As shown above, this description of the experimental results is false. Instead, what Steele and Aronson found was that when the stereotype was primed, Black performance sank, but White performance did not. When the stereotype was not primed, Black performance equaled that of Whites who had similar SAT scores.

The widespread misrepresentation of the results hides an experimental surprise that pleads for further investigation. The fact that Black performance (absent "stereotype threat") equaled that of Whites with similar SAT scores has a striking implication. It suggests that these Black students were not under "stereotype threat" when they took their SATs. For if they had been, their SAT scores would presumably have understated their true ability. And so, with the threat removed in this experiment, they would have done better than Whites whose SAT scores were similar. As things turned out, the SAT scores of the Black participants accurately predicted their performance in the experiment.

A clear grasp of this result prompts questions that beg for answers. If the Black participants in this study were not affected by "stereotype threat" when taking their SATs, were they typical or exceptional? Would further research demonstrate that some or even most Blacks actually are under significant "stereotype threat" when doing the SAT or similar exams? The 1995 experiment raised the possibility that "stereotype threat" affects Black SAT scores, but provided no proof as to whether, how often or how much.

Nevertheless, it did imply a methodology for testing whether Black scores on the SAT and similar exams have been reduced and thereby rendered unreliable by "stereotype threat": Choose Black and White students who have taken such an exam, and give them questions like those on the exam. Use participants' scores on the exam under scrutiny as a basis for comparison. Prime a stereotype that is negative for Blacks with half the students, using the others as controls. Then proceed as in the 1995 Steele/Aronson experiment.

If, holding the previous scores equal, the primed Black students do much worse, their previous scores were probably not lowered by "stereotype threat." If the Black students who are not primed do about the same as the White students with similar previous scores, the absence of "stereotype threat" in the previous exam is confirmed. If, on the contrary, the unprimed Black students do substantially better than the White students with similar previous scores, the presence of "stereotype threat" in the previous exam is indicated.

The 1995 experiment indicated that "stereotype threat" can lower Black scores in examinations of intellectual ability, but failed to show that it has. However, the findings suggest a way of testing whether and in what proportion Black students taking such exams have been performing under "stereotype threat," and how much their scores have been affected. Has this method been put to work, and what are the results? On these exciting questions, both the Sackett article and the Steele/Aronson rebuttal are silent.