Personality psychologists wallow in effect size; the ubiquitous correlation coefficient, Pearson’s r, is central to nearly every research finding they report. As a consequence, discussions of relationships between personality variables and outcomes are routinely framed by assessments of their strength. For example, a landmark paper reviewed predictors of divorce, mortality, and occupational achievement, and concluded that personality traits have associations with these life outcomes that are as strong as or stronger than traditional predictors such as socio-economic status or cognitive ability (Roberts et al., 2007). This is just one example of how personality psychologists routinely calculate, care about, and even sometimes worry about the size of the relationships between their theoretical variables and their predicted outcomes.
Social psychologists, not so much. The typical report in experimental social psychology focuses on p-level, the probability of the magnitude of the difference between experimental groups occurring if the null hypothesis of no difference were to be true. If this probability is .05 or less, then: Success! While effect sizes (usually Cohen’s d or, less often, Pearson’s r) are reported more often they they used to be – probably because the APA Publication Manual explicitly requires it (a requirement not always enforced) – the emphasis of the discussion of the theoretical or even the practical importance of the effect typically centers around whether it exists. The size simply doesn’t matter.
Is this description an unfair caricature of social psychological research practice? That’s what I thought until recently. Even though the typical statistical education of many experimentally-oriented psychologists bypasses extensive discussion of effect size in favor of the ritual of null-hypothesis testing, I assumed that the smarter social psychologists grasped that an important part of scientific understanding involves ascertaining not just whether some relationship between two variables “exists,” but how big that relationship is and how it compares to various benchmarks of theoretical or practical utility.
It turns out I was wrong. I recently had an email exchange with a prominent social psychologist who I greatly respect. [i] I was shocked, therefore, when he wrote the following[ii]:
…the key to our research… [is not] to accurately estimate effect size. If I were testing an advertisement for a marketing research firm and wanted to be sure that the cost of the ad would produce enough sales to make it worthwhile, effect size would be crucial. But when I am testing a theory about whether, say, positive mood reduces information processing in comparison with negative mood, I am worried about the direction of the effect, not the size (indeed, I could likely change the size by using a different manipulation of mood, a different set of informational stimuli, a different contextual setting for the research — such as field versus lab). But if the results of such studies consistently produce a direction of effect where positive mood reduces processing in comparison with negative mood, I would not at all worry about whether the effect sizes are the same across studies or not, and I would not worry about the sheer size of the effects across studies. This is true in virtually all research settings in which I am engaged. I am not at all concerned about the effect size (except insofar as very small effects might require larger samples to find clear evidence of the direction of the effect — but this is more of a concern in the design phase, not in interpreting the meaning of the results). In other words, I am yet to develop a theory for which an effect size of r = .5 would support the theory, but an effect size of r = .2 (in the same direction) would fail to support it (if the effect cannot be readily explained by chance). Maybe you have developed such theories, but most of our field has not.
To this comment, I had three reactions.
First, I was startled by the claim that social psychologists don’t and shouldn’t care about effect size. I began my career during the dark days of the Mischelian era, and the crux of Mischel’s critique was that personality traits rarely correlate with outcomes greater than .30. He never denied that the correlations were significant, mind you, just that they weren’t big enough to matter to anybody on either practical or theoretical grounds. Part of the sport was to square this correlation, and state triumphantly (and highly misleadingly) that therefore personality only “explains” “9% of the variance.” Social psychologists of the era LOVED this critique[iii]! Some still do. Oh, if only one social psychologist had leapt to personality psychology’s defense in those days, and pointed out that effect size doesn’t matter as long as we have the right sign on the correlation… we could have saved ourselves a lot of trouble (Kenrick & Funder, 1988).
Second, I am about 75% joking in the previous paragraph, but the 25% that’s serious is that I actually think that Mischel made an important point – not that .30 was a small effect size (it isn’t), but that effect size should be the name of the game. To say that an effect “exists” is a remarkably simplistic statement that on close examination means almost nothing. If you work with census data, for example, EVERYTHING — every comparison between two groups, every correlation between any two variables — is statistically significant at the .000001 level. But the effect sizes are generally teeny-tiny, and of course lots of them don’t make any sense either (perhaps these should be considered “counter-intuitive” results). Should all of these findings be taken seriously?
Third, if the answer is no, then we have to decide how big an effect is in fact worth taking seriously. And not just for purposes of marketing campaigns! If, for example, a researcher wants to say something like “priming effects can overwhelm our conscious judgment” (I have read statements like that), then we need to start comparing effect sizes. Or, if we are just going to say that “holding a hot cup of coffee makes you donate more money to charity” (my favorite recent forehead-slapping finding) then the effect size is important for theoretical, not just practical purposes, because a small effect size implies that a sizable minority is giving LESS money to charity, and that’s a theoretical problem, not just a practical one. More generally, the reason a .5 effect size is more convincing, theoretically, than a .2 effect size is that the theorist can put less effort into explaining why so many participants did the opposite of what the theory predicted.
Still, it’s difficult to set a threshold for how big is big enough. As my colleague pointed out in a subsequent e-mail – and as I’ve written myself, in the past — there are many reasons to take supposedly “small” effects seriously. Psychological phenomena are determined by many variables, and to isolate one that has an effect on an interesting outcome is a real achievement, even though in particular instances it might be overwhelmed by other variables with opposite influences. Rosenthal and Rubin (1982) demonstrated how a .30 correlation was enough to be right, about two times out of three. Ahadi and Diener (1989) showed that if just a few factors affect a common outcome, the maximum size of the effect of any one of them is severely constrained. In a related vein, Abelson (1985) calculated how very small effect sizes – in particular, the relationship between batting average and performance in a single at-bat – can cumulate fairly quickly into large differences in outcomes (or ballplayer salaries). So far be it from me to imply that a “small” effect, by any arbitrary standard, is unimportant.
Now we are getting near the crux of the matter. Arbitrary standards – whether the .05 p-level threshold or some kind of minimum credible effect size – are paving stones on the road to ruin. Personality psychologists routinely calculate and report their effect sizes, and as a result have developed a pretty good view of what these numbers mean and how to interpret them. Social psychologists, to this day, still don’t pay much attention to effect sizes so haven’t developed a base of experience for evaluation. This is why my colleague Dan Ozer and I were able to make a splash as young beginning researchers, simply by pointing out that, for example, the effect size of the distance of the victim on obedience in the Milgram study was in the .30’s (Funder & Ozer, 1983). The calculation was easy, even obvious, but apparently nobody had done it before. A meta-analysis by Richard et al. (2003) found that the average effect size of published research in experimental social psychology is r = .21. This finding remains unknown, and probably would come as a surprise, to many otherwise knowledgeable experimental researchers.
But this is what happens when the overall attitude is that “effect size doesn’t matter.” Judgment lacks perspective, and we are unable to separate that which is truly important from that which is so subtle as to be virtually undetectable (and, in some cases, notoriously difficult to replicate).
My conclusion, then, is that effect size is important and the business of science should be to evaluate it, and its moderators, as accurately as possible. Evaluating effect sizes is and will continue to be difficult, because (among other issues) they may be influenced by extraneous factors, because apparently “small” effects can cumulate into huge consequences over time, and because any given outcome is influenced by many different factors, not just one or even a few. But the solution to this difficulty is not to regard effect sizes as unimportant, much less to ignore them altogether. Quite the contrary, the more prominence we give to effect sizes in reporting and thinking about research findings, the better we will get at understanding what we have discovered and how important it really is.
–
References
Abelson, R. P. (1985). “A variance explanation paradox: When a little is a lot.” Psychological Bulletin, 97, 129–133.
Ahdadi, S., & Diener, E. (1989). Multiple determinants and effect size. Journal of Personality and Social Psychology, 56, 398-406.
Funder, D.C., & Ozer, D.J. (1983). Behavior as a function of the situation. Journal of Personality and Social Psychology, 44, 107-112.
Kenrick, D.T., & Funder, D.C. (1988). Profiting from controversy: Lessons from the person-situation debate. American Psychologist, 43, 23-34.
Nisbett, R.E., (1980). The trait construct in lay and professional psychology. In L. Festinger (Ed.), Retrospections on social psychology (pp. 109-130). New York: Oxford University Press.
Richard, F.D., Bond, C.F., Jr., & Stokes-Zoota, J.J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331-363.
Roberts, B.W., Kuncel, N.R., Shiner, R., Caspi, A., & Goldberg L.R. (2007). The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives in Psychological Science, 2, 313-345.
Rosenthal, R., & Rubin, D.B. (1982). A simple, general-purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166-169.
[i] We served together for several years on a grant review panel, a bonding experience as well as a scientific trial by fire, and I came to admire his incisive intellect and clear judgment.
[ii] I obtained his permission to quote this passage but, understandably, he asked that he not be named in order to avoid being dragged into a public discussion he did not intend to start with a private email.
[iii] See, e.g., Nisbett, 1980, who raised the “personality correlation” to .40 but still said it was too small to matter. Only 16% of the variance, don’t you know.