Does (effect) Size Matter?

Personality psychologists wallow in effect size; the ubiquitous correlation coefficient, Pearson’s r, is central to nearly every research finding they report.  As a consequence, discussions of relationships between personality variables and outcomes are routinely framed by assessments of their strength.  For example, a landmark paper reviewed predictors of divorce, mortality, and occupational achievement, and concluded that personality traits have associations with these life outcomes that are as strong as or stronger than traditional predictors such as socio-economic status or cognitive ability (Roberts et al., 2007).  This is just one example of how personality psychologists routinely calculate, care about, and even sometimes worry about the size of the relationships between their theoretical variables and their predicted outcomes.

Social psychologists, not so much.  The typical report in experimental social psychology focuses on p-level, the probability of the magnitude of the difference between experimental groups occurring if the null hypothesis of no difference were to be true.   If this probability is .05 or less, then: Success!  While effect sizes (usually Cohen’s d  or, less often, Pearson’s r) are reported more often they they used to be – probably because the APA Publication Manual explicitly requires it (a requirement not always enforced) – the emphasis of the discussion of the theoretical or even the practical importance of the effect typically centers around whether it exists.  The size simply doesn’t matter.

Is this description an unfair caricature of social psychological research practice?  That’s what I thought until recently.  Even though the typical statistical education of many experimentally-oriented psychologists bypasses extensive discussion of effect size in favor of the ritual of null-hypothesis testing, I assumed that the smarter social psychologists grasped that an important part of scientific understanding involves ascertaining not just whether some relationship between two variables “exists,” but how big that relationship is and how it compares to various benchmarks of theoretical or practical utility.

It turns out I was wrong.  I recently had an email exchange with a prominent social psychologist who I greatly respect. [i] I was shocked, therefore, when he wrote the following[ii]:

 …the key to our research… [is not] to accurately estimate effect size. If I were testing an advertisement for a marketing research firm and wanted to be sure that the cost of the ad would produce enough sales to make it worthwhile, effect size would be crucial. But when I am testing a theory about whether, say, positive mood reduces information processing in comparison with negative mood, I am worried about the direction of the effect, not the size (indeed, I could likely change the size by using a different manipulation of mood, a different set of informational stimuli, a different contextual setting for the research — such as field versus lab). But if the results of such studies consistently produce a direction of effect where positive mood reduces processing in comparison with negative mood, I would not at all worry about whether the effect sizes are the same across studies or not, and I would not worry about the sheer size of the effects across studies. This is true in virtually all research settings in which I am engaged. I am not at all concerned about the effect size (except insofar as very small effects might require larger samples to find clear evidence of the direction of the effect — but this is more of a concern in the design phase, not in interpreting the meaning of the results). In other words, I am yet to develop a theory for which an effect size of r = .5 would support the theory, but an effect size of r = .2 (in the same direction) would fail to support it (if the effect cannot be readily explained by chance). Maybe you have developed such theories, but most of our field has not.

To this comment, I had three reactions.

First, I was startled by the claim that social psychologists don’t and shouldn’t care about effect size. I began my career during the dark days of the Mischelian era, and the crux of Mischel’s critique was that personality traits rarely correlate with outcomes greater than .30. He never denied that the correlations were significant, mind you, just that they weren’t big enough to matter to anybody on either practical or theoretical grounds. Part of the sport was to square this correlation, and state triumphantly (and highly misleadingly) that therefore personality only “explains” “9% of the variance.”  Social psychologists of the era LOVED this critique[iii]! Some still do. Oh, if only one social psychologist had leapt to personality psychology’s defense in those days, and pointed out that effect size doesn’t matter as long as we have the right sign on the correlation… we could have saved ourselves a lot of trouble (Kenrick & Funder, 1988).

Second, I am about 75% joking in the previous paragraph, but the 25% that’s serious is that I actually think that Mischel made an important point – not that .30 was a small effect size (it isn’t), but that effect size should  be the name of the game.  To say that an effect “exists” is a remarkably simplistic statement that on close examination means almost nothing.  If you work with census data, for example, EVERYTHING — every comparison between two groups, every correlation between any two variables — is statistically significant at the .000001 level. But the effect sizes are generally teeny-tiny, and of course lots of them don’t make any sense either (perhaps these should be considered “counter-intuitive” results). Should all of these findings be taken seriously?

Third, if the answer is no, then we have to decide how big an effect is in fact worth taking seriously. And not just for purposes of marketing campaigns! If, for example, a researcher wants to say something like “priming effects can overwhelm our conscious judgment” (I have read statements like that), then we need to start comparing effect sizes. Or, if we are just going to say that “holding a hot cup of coffee makes you donate more money to charity” (my favorite recent forehead-slapping finding) then the effect size is important for theoretical, not just practical purposes, because a small effect size implies that a sizable minority is giving LESS money to charity, and that’s a theoretical problem, not just a practical one.  More generally, the reason a .5 effect size is more convincing, theoretically, than a .2 effect size is that the theorist can put less effort into explaining why so many participants did the opposite of what the theory predicted.

Still, it’s difficult to set a threshold for how big is big enough.  As my colleague pointed out in a subsequent e-mail – and as I’ve written myself, in the past — there are many reasons to take supposedly “small” effects seriously.  Psychological phenomena are determined by many variables, and to isolate one that has an effect on an interesting outcome is a real achievement, even though in particular instances it might be overwhelmed by other variables with opposite influences.  Rosenthal and Rubin (1982) demonstrated how a .30 correlation was enough to be right, about two times out of three.  Ahadi and Diener (1989) showed that if just a few factors affect a common outcome, the maximum size of the effect of any one of them is severely constrained.  In a related vein, Abelson (1985) calculated how very small effect sizes – in particular, the relationship between batting average and performance in a single at-bat – can cumulate fairly quickly into large differences in outcomes (or ballplayer salaries).  So far be it from me to imply that a “small” effect, by any arbitrary standard, is unimportant.

Now we are getting near the crux of the matter.  Arbitrary standards – whether the .05 p-level threshold or some kind of minimum credible effect size – are paving stones on the road to ruin.  Personality psychologists routinely calculate and report their effect sizes, and as a result have developed a pretty good view of what these numbers mean and how to interpret them.  Social psychologists, to this day, still don’t pay much attention to effect sizes so haven’t developed a base of experience for evaluation. This is why my colleague Dan Ozer and I were able to make a splash as young beginning researchers, simply by pointing out that, for example, the effect size of the distance of the victim on obedience in the Milgram study was in the .30’s (Funder & Ozer, 1983).  The calculation was easy, even obvious, but apparently nobody had done it before.  A meta-analysis by Richard et al. (2003) found that the average effect size of published research in experimental social psychology is r = .21.  This finding remains unknown, and probably would come as a surprise, to many otherwise knowledgeable experimental researchers.

But this is what happens when the overall attitude is that “effect size doesn’t matter.”  Judgment lacks perspective, and we are unable to separate that which is truly important from that which is so subtle as to be virtually undetectable (and, in some cases, notoriously difficult to replicate).

My conclusion, then, is that effect size is important and the business of science should be to evaluate it, and its moderators, as accurately as possible.  Evaluating effect sizes is and will continue to be difficult, because (among other issues) they may be influenced by extraneous factors, because apparently “small” effects can cumulate into huge consequences over time, and because any given outcome is influenced by many different factors, not just one or even a few.  But the solution to this difficulty is not to regard effect sizes as unimportant, much less to ignore them altogether.  Quite the contrary, the more prominence we give to effect sizes in reporting and thinking about research findings, the better we will get at understanding what we have discovered and how important it really is.

References

Abelson, R. P. (1985). “A variance explanation paradox: When a little is a lot.” Psychological Bulletin, 97, 129–133.

Ahdadi, S., & Diener, E. (1989). Multiple determinants and effect size. Journal of Personality and Social Psychology, 56, 398-406.

Funder, D.C., & Ozer, D.J. (1983). Behavior as a function of the situation. Journal of Personality and Social Psychology, 44, 107-112.

Kenrick, D.T., & Funder, D.C. (1988). Profiting from controversy: Lessons from the person-situation debate. American Psychologist, 43, 23-34.

Nisbett, R.E., (1980). The trait construct in lay and professional psychology. In L. Festinger (Ed.), Retrospections on social psychology  (pp. 109-130). New York: Oxford University Press.

Richard, F.D., Bond, C.F., Jr., & Stokes-Zoota, J.J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331-363.

Roberts, B.W., Kuncel, N.R., Shiner, R., Caspi, A., & Goldberg L.R. (2007). The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives in Psychological Science, 2, 313-345.

Rosenthal, R., & Rubin, D.B. (1982). A simple, general-purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166-169.

[i] We served together for several years on a grant review panel, a bonding experience as well as a scientific trial by fire, and I came to admire his incisive intellect and clear judgment.

[ii] I obtained his permission to quote this passage but, understandably, he asked that he not be named in order to avoid being dragged into a public discussion he did not intend to start with a private email.

[iii] See, e.g., Nisbett, 1980, who raised the “personality correlation” to .40 but still said it was too small to matter.  Only 16% of the variance, don’t you know.

18 thoughts on “Does (effect) Size Matter?

  1. I agree that effect size matters but I completely disagree with the premise that social psychologists don’t care about it. A failure to report effect sizes in experiments has more to with sloppy statistical practices. Never had a conversation about effect size not mattering and I don’t know any social psychologists under 40 who would claim that personality doesn’t matter.

  2. Good point. I don’t really know how many people — social psychologists or otherwise — think that effect sizes aren’t important and it’s an interesting possibility that there is a generational divide, as you say.

  3. Great post! I think this articulates what I see as one of the remaining major distinctions between the two sub-disciplines. (People could criticize personality researchers for being too cavalier about internal validity and hand-wavy about causal inference but that is something for another day!).

    A few other observations:

    1. My impression is that effect sizes are sporadically reported in many journals but not widely interpreted. Researchers will report some sort of d-metric effect size estimate or some eta-squared estimate but explicit interpretation of these coefficients in the context of the study or the existing literature is another matter. We (Kashy, Donnellan, Ackerman, & Russell, 2009) evaluated all PSPB papers published in a 6 month interval in 2007. There was decent reporting of some effect size measures but we noted this: “Very few authors explicitly discussed the magnitude of the effects they observed” (p. 1134). I bet this still holds true in 2013.

    Worse, it seems as if many effect size estimates are treated with little to no skepticism if/when they are reported. A d of 1.0 or .80 is pretty substantial compared to most “sturdy” effects in the field. So when this kind of estimate is attached to say a two group experiment with 15 people per condition, it might raise some concerns. I went back to some of the fraudulent papers by Stapel and Sanna and computed effect sizes. Many were very large (mostly because these frauds did not think they needed to fake data from a large number of participants – something that is pretty damning when you think about it!). The conclusion that the effect sizes were implausibly large is a theme in the Stapel report. This suggests problems with how seriously effect sizes are treated by the research community consuming these papers.

    2. I also question whether effect sizes are taken seriously when one considers the sample sizes used in some of the high profile studies published in top social journals and places like Psychological Science. They are just too darn small to provide much precision for the effect size estimate. The CIs often range from huge effects to something just weak enough to register as signal from noise. [If researchers did a power analysis in advance, you would have to assume that they expected to find very large effects.]

    So hopefully your post gets researchers of all stripes interested enough in effect sizes to: 1) report and interpret effect size estimates consistently; 2) want to estimate effect sizes with precision and thus use larger samples; and 3) treat some estimates with skepticism because they seem implausibly large and usually also estimated with poor precision.

  4. Pingback: Does (effect) Size Matter? | pigee

  5. Nice post. Both you and your social psychologist respondent are correct. In my view, one of the biggest (if not the biggest) differences between personality and social psychologists is that personality psychologists are much more frequently attempting to quantify the “true” value (e.g., exactly how intelligent is John?; what is the exact relationship between intelligence and outcome X) than are social psychologists, who are more frequently interested in testing whether a hypothesized relationship between two variables exists at all. We know people vary in intelligence and that intelligence predicts some behavior, and we want to know the exact numbers. We don’t know if people will be more easily persuaded when they are in a positive mood than not. The interest in effect sizes follows the nature of the questions. When social psychologists care about true values (e.g., do attitudes predict behavior? answer = .30 or so), they also care about effect sizes.

    As another example, measuring personality traits and measuring attitudes is very much the same task. But the goals of the people studying the two topics are usually different. The personality psychologist wants to develop the best measure of the trait possible that will provide the “true” level of the trait. More often than not, attitudes researchers are trying to understand the situational (and sometimes personal) variables and cognitive processes that influence the expressed extremity of the attitude. They don’t care if the absolute level of the attitude measured is “correct.” They only care that they can push it around in theoretically predicted ways. Both groups of researchers could learn a lot from one another.

    It’s difficult to say that even small effect sizes matter (and they do!) and then condemn social psychologists for not being particularly interested in the exact size of their effects when they are doing the sort of theoretical work described by your prominent social psychologist. It’s also much more difficult to build robust theoretical models about the relationships among variables if you don’t know anything about effect sizes. I see know reason to devalue either NHST (yes, NHST) or effect size estimation. Both statistics provide useful information that can help us understand our data.

    • A late response to your interesting post: You are correct that “both statistics [NHST and effect size] provide useful information,” because both statistics provide the SAME information. Given a study’s N, the p-level of its NHST finding, and its effect size, you can compute any one of these given the other two. It’s just arithmetic. Thus, when an experimental researcher reports a critical value of p (e.g., a finding is reported because it was less than .05) he/she is also, implicitly, automatically, and unavoidably, going on record as to the minimum effect size he/she is willing to accept as showing that the finding “exists.” The reason: if the effect size had been smaller, the researcher would not have reported it.
      A surprising implication of this logic — if one maintains that the goal of research is to show whether an effect “exists,” not to estimate its size — is that further studies that find smaller effects than would have been required to attain significance in the original study must be deemed to show that the effect does NOT exist, since they are too small to have been reported in the original context. Which goes to the heart, of course, of the whole problem with the simplisitic “exists/doesn’t exist” dichotomy.

      • David, of course, NHST will provide an effect size. The question I was responding to was which piece of information should form the basis for judgments of publication. Do you need a certain p-value, do you need a certain effect size, neither, both?

      • Right, except that you can’t decide “which” criterion to use for publication because one implies the other. Effect size is a function of the p-level and the N. This is true whether you actually compute the effect size or not.

      • But the point is that it is the p-value that determines publication, not the effect size that happens to coincide with the p-value. In current usage, we use the p-value, regardless of effect size or sample size (for the most part). So, you are not using the effect size to determine publication. Alternatively, we could make the effect size the criterion and then also report the p-value that happens to coincide with that effect size.

    • “In current usage, we use the p-value, regardless of effect size or sample size (for the most part).” Precisely. And thus you have put your finger on the problem. Those other two numbers exist, even if they are ignored.

      • The use of any of these criteria create problems. And that’s the problem. Changing which number(s) to use will just change which people are unhappy with the criteria.

  6. I don’t think the division is between personality and social psychology so much as between research topics that use quantifiable/behavioral vs. arbitrary outcomes. It is easier to judge when an effect on lifetime income or free throw percentage is trivial, than when an effect on a seven-point self-report emotion scale is. Perhaps we need a language of examples so that we can better talk about the size of theoretical effects, and confidence intervals around them.

  7. Brent and Roger have touched upon this already, but I think it’s REALLY important to look at confidence intervals. Point estimates can be extremely misleading when power is low, and I get frustrated sometimes that people get to ‘claim’ effect sizes that they didn’t earn. d = .80 is very different with a +/- .75 confidence interval than with a +/- .10 95% confidence interval. We shouldn’t give the same credit for the effect size either way.

    The focus on point estimates is what has led to the common misconception that large samples are ‘bad’ because they allow you to detect trivial effects. The implication is that if you keep your sample size small, you will only detect large (and therefore meaningful) effects. But of course when you consider the huge confidence intervals around those effect size estimates, you realize that many of those ‘large’ effects may not be that large at all. Maybe everyone should have to report not only their effect size point estimate, but the bottom end of their, say, 50% CI. This would give the reader the “best guess” (point estimate) and the value that we can be 75% sure the true effect size is at least as big as (bottom of the 50% CI).

    Of course this is a little idealistic. Step 1: Convince people that effect sizes are important. Step 2: Get people to calculate and report effect sizes. Step 3: Get people to calculate and report confidence intervals.

  8. Pingback: Why big effects are more important than small effects | Fifth Estate

  9. “When small effects are impressive” – Psychological Bulletin, 1992.
    “How hard is hard science, how soft is soft science” – American Psychologist, 1987.
    “The relationship of validity coefficients to the practical effectiveness of tests in selection: Discussion and tables” Journal of Applied Psychology, 1939.

    No further improvement during the last 20 years of psychologists’ understanding of how to evaluate effect sizes, even a decline?

  10. Pingback: I don’t care about effect sizes — I only care about the direction of the results when I conduct my experiments | The Trait-State Continuum

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s