NSF Gets an Earful about Replication

I spent last Thursday and Friday (February 20 and 21) at an NSF workshop concerning the replicability of research results. It was chaired by John Cacioppo and included about 30 participants including such well-known contributors to the discussion as Brian Nosek, Hal Pashler, Eric Eich, and Tony Greenwald, to name a few.  Participants also included officials from NIH, NSF, the White House Office on Science and Technology and at least one private foundation. I was invited, I presume, in my capacity as Past-President of SPSP and chair of an SPSP task force on research practices which recently published a report on non-retracted PSPB articles by investigators who retracted articles elsewhere, and a set of recommendations for research and educational practice, which was just published in PSPR.

Committees, task forces and workshops – whatever you call them – about replicability issues have become almost commonplace.  The SPSP Task Force was preceded by a meeting and report sponsored by the European Association of Personality Psychology, and other efforts have been led by APS, the Psychonomic Society and other organizations.  Two symposia on the subject were held at the SPSP meeting in Austin just the week before.  But this discussion was perhaps special, because it is the first (to my knowledge) to be sponsored by the US government, with the explicit purpose of seeking advice about what NSF and other research agencies should do.  I think it is fair to say: When replication is discussed in a meeting with representatives from NIH, NSF and the White House, the issue is on the front burner!

The discussion covered several themes, some of which are more familiar than others. From my scribbled notes, I can list a few that received particular attention.

1.      It’s not just – or even especially – about psychology.  I was heartened to see that the government representatives saw the bulk of problems with replication as lying in fields such as molecular biology, genetics, and medicine, not in psychology.  Psychology has problems too, but is widely viewed as the best place to look for solutions since the basic issues all involve human behavior.  It makes me a bit crazy when psychologists say (or sometimes shout) that everything is fine, that critics of research practices are “witch hunting,” or that examining the degree to which our science is replicable is self-defeating.  Quite the reverse: psychology is being looked to as the source of the expertise that can improve all of science.  As a psychologist, I’m proud of this.

2.      Preregistration.  Widespread enthusiasm early on Thursday for the idea of pre-registering hypotheses waned by Friday afternoon.  In his concluding remarks, Tony Greenwald listed it among suggestions that he found “doubtful,” in part because it could increase pressure on investigators to produce the results they promised, rather than be open to what the data are really trying to tell them.  Greg Francis also observed that pre-registration buys into the idea that the aim of science is simply to dichotomously confirm or disconfirm findings, an idea that is the source of many of our problems to begin with.  In a pithy phrase, he said we should “measure effects, not test them.”

3.      The tradeoff between “ground-breaking,” innovative studies and solid science.  In a final editorial, the previous editor of Psychological Science described common reasons for “triaging” (rejecting without review) submitted articles.  It seems he and his associate editors categorized these reasons with musical references, which they found amusing.    One of the reasons was (and I quote) the “Pink Floyd rejection: Most triaged papers were of this type; they reported work that was well done and useful, but not sufficiently groundbreaking. So the findings represented just another brick in the wall of science” (emphasis in the original).  I thought this label was appalling even before I saw Greg Francis’s article, now in-press at Psychonomic Bulletin and Review, that concludes that among articles published during this editor’s term that Francis reviewed, 82% had problems he concluded showed signs “that unsuccessful findings were suppressed, the experiments or analyses were improper, or that the theory does not properly account for the data.” Well, they weren’t bricks in the wall; that’s for sure!

It seems clear that grant panels and journal editors have traditionally overvalued flashy findings, especially counter-intuitive ones.  This particular editor is not alone in his value system. But “counter intuitive” means prima facie implausible, and such findings should demand stronger evidence than the small N’s and sketchily-described methods that so often seem to be their basis.  More broadly, when everybody is trying to be ground-breaking at the same time, who will produce the scientific “bricks” on which knowledge can securely rest?  Given this state of affairs, it is not surprising that so many findings fail to hold up under scrutiny, and that the cutest findings are, as a rule, the most vulnerable.

But while everybody around the table wistfully agreed that solid research and replication deserved more respect than it typically receives, a Stanford Dean who was there pointed out the obvious (and I paraphrase): “Look: nobody is going to get tenure at Stanford [or, I would add, from any other ambitious research university] from replicating other people’s findings.”

Science needs to move forward or die; science needs a strong base of reliable fact from which to proceed.  The tension between these two needs is not an easy one to resolve.

4.      Do we look forward or look back?  I was glad Greg Francis was at the table because his recent article puts this issue front and center.  He named names!  Out there in the twitterverse, I’ve seen complaints that this does more harm than good.  It makes people defensive, it probably tars some people unfairly (e.g., a grad student co-author on an article p-hacked by her advisor), and in general naming names causes an uproar of a sort not exactly conducive to sober scientific discussion.  Hal Pashler ran into exactly the same issue a couple of years ago when he and a student uncovered – and named – studies reporting implausibly large “voodoo correlations” in fMRI research.
It is uncomfortable to accuse someone of having published inaccurate, unreliable or possibly even fraudulent research.  It is even more uncomfortable, I am sure, to be the person accused.  It is probably for this reason that we hear so many calls to “look forward, not back.”  This was a specific statement of both the EAPP report and the SPSP report, to name just two, and such comments were heard at the NSF meeting as well:  Let’s clean up our act and do right from now on.  Let’s not go back and make people uncomfortable and defensive.

I understand this view and have even on occasion expressed it.  But it does worry me.  If we think the findings enshrined in our journals are important – and presumably we do; it’s the output of the whole scientific enterprise in which we are engaged – then don’t we also have an obligation to clean out the findings that turn out to be unreliable or just plain wrong?  And to do this, don’t we have to be specific?  It’s hard to avoid concluding that the answer is “yes.”  But who will do it?  Do we want people to build careers looking for p-hacked articles in the published literature?  Do we want to damage the reputations and possibly careers of people who were just doing “business as usual” the way they were taught?  Do you want to do this?  I know I don’t.

5.      What can the government do?  Addressing this question was the whole ostensive reason for the conference.  Suggestions fell into three broad categories:

a.      Reform grant review practices.  NSF reviewers are, at present, explicitly required to address whether a given grant proposal is “potentially transformative.”  They are not explicitly required to address whether the findings on which the research is based can be deemed reliable, or whether the proposed research includes any safeguards to insure the replicability of its findings.  Maybe they should.

b.      Fund replications.  Before the meeting, one colleague told me that I should inform NSF that “if you pay us to do it, we’ll do it.”  In other words, funds should be available for replication research.  Maybe an institute should be set up to replicate key findings, or grant proposals to perform a program of important replications should receive a sympathetic hearing, or maybe ordinary grants should include a replication study or two.  None of this happens now.

c.      Fund research on research.  I was not expecting this theme to emerge as strongly as it did.  It seems like NSF may be interested in looking at ways to support “meta-science,” such as the development of techniques to detect fraudulent data, or the effect of various incentive structures on scientific behavior.  Psychology includes a lot of experts on statistical methodology as well as on human behavior; there seems to be an opportunity here.

6.      Unintended consequences.  Our meeting chair was concerned about this issue, and rightly so.  He began the meeting by recalling the good intentions that underlay the initiation of IRB’s to protect human subjects, and the monsters of mission creep, irrelevant regulation, and bureaucratic nonsense so many have become.  (These are my words, not his.)  Some proposed reforms entail the danger of going down the same road.  Even worse:  as we enshrine confidence intervals over p-values, or preregistration over p-hacking, or thresholds of permissible statistical power or even just N, do we risk replacing old mindless rituals with new ones?

7.      Backlash and resistance.  This issue came up only a couple of times and I wish it had gotten more attention.  It seemed like nobody at the table (a) denied there was a replicability problem in much of the most prominent research in the major journals or (b) denied that something needed to be done.  As one participant said, “we are all drinking the same bath water.”  (I thought this phrase usually referenced Kool-Aid, but never mind.)  Possibly, nobody spoke up because of the apparent consensus in the room.  In any case, there will be resistance out there.  And we need to watch out for it.

I expect the resistance to be of the passive-aggressive sort.  Indeed, that’s what we have already seen.  The sudden awarding of APA’s major scientific award to someone in the midst of a replication controversy seems like an obvious thumb in the eye to the reform movement.  To underline the point, a prominent figure in social psychology (he actually appears in TV commercials!) tweeted that critics of the replicability of the research would never win such a major award.  Sadly, he was almost certainly correct.

One of Geoff Cumming’s graduate students, Fiona Fidler, recently wrote a thesis on the history of null hypothesis significance testing.  It’s a fascinating read and I hope will be turned into a book soon. One of its major themes is that NHST has been criticized thoroughly and compellingly many times over the years.  Yet it persists, even though – and, ironically, perhaps because – it has never really been explicitly defended!  Instead, the defense of NHST is largely passive.  People just keep using it.  Reviewers and editors just keep publishing it; granting agencies keep giving money to researchers who use it.  Eventually the critiques die down.  Nothing changes.

That could happen this time too.  The defenders of the status quo rarely actively defend anything. They aren’t about to publish articles explaining why NHST tells you everything you need to know, or arguing that effect sizes of r = .80 in studies with an N of 20 represent important and reliable breakthroughs, or least of all reporting data to show that major counter-intuitive findings are robustly replicable.   Instead they will just continue to publish each others’ work in all the “best” places, hire each other into excellent jobs and, of course, give each other awards.  This is what has happened every time before.

Things just might be different this time.  Doubts about statistical standard operating procedure and the replicability of major findings are rampant across multiple fields of study, not just psychology.  And, these issues have the attention of major scientific studies and even the US Government.  But the strength of the resistance should not be underestimated.

This meeting was very recent and I am still mulling over my own reactions to these and other issues.  This post is, at best, a rough first draft.  I welcome your reactions, in the comments section or elsewhere in the blogosphere/twitterverse.

Why I Decline to do Peer Reviews (part one): Re-reviews

Like pretty much everyone fortunate enough to occupy a faculty position in psychology at a research university, I am frequently asked to review articles submitted for publication to scientific journals. Editors rely heavily on these reviews in making their accept/reject decisions. I know: I’ve been an editor myself, and I experienced first-hand the frustrations in trying to persuade qualified reviewers to help me assess the articles that flowed over my desk in seemingly ever-increasing numbers. So don’t get me wrong: I often do agree to do reviews – around 25 times a year, which is probably neither much above nor below the average for psychologists at my career stage. But sometimes I simply refuse, and let me explain one reason why.

The routine process of peer review is that the editor reads a submitted article, selects 2 or 3 individuals thought to have reasonable expertise in the topic, and asks them for reviews. After some delays due to reviewers’ competing obligations, trips out of town, personal emergencies or – the editor’s true bane – lengthy failures to respond at all, the requisite number of reviews eventually arrive. In a very few cases, the editor reads the reviews, reads the article, and accepts it for publication. In rather more cases, the editor rejects the article. The authors of the remaining articles get a letter inviting them to “revise and resubmit.” In such cases, in theory at least, the reviewers and/or the editor see a promising contribution. Perhaps a different, more informative statistic could be calculated, an omitted relevant article cited, or a theoretical derivation explained more clearly. But the preliminary decision clearly is – or should be – that the research is worth publishing; it could just be reported a bit better.

What happens then? What should happen, in my opinion, is that the author(s) complete their revision, the editor reads it, perhaps refreshing his or her memory by rereading the reviewers’ comments, and then makes a final accept/reject decision. After all, the editor is – presumably and hopefully – at least somewhat cognizant of the topic area and perhaps truly expert. The reviewers were selected for their specific expertise and have already commented on the article, sometimes in great detail. Armed with that, the editor should not find it too difficult to make a final decision.

Too often, this is not what happens. Instead, the editor sends the revised paper out for further review! Usually, this entails sending it to the same individuals who reviewed the article once already. Sometimes – in a surprisingly common practice that every author deeply loathes – the editor also sends the revised article to new reviewers. Everyone then weighs in, looking to make sure their favorite comments were addressed, and making further comments for further revision. The editor reads the reviews and perhaps now makes a final decision, but sometimes not. Yes, the author may be asked to revise and resubmit yet again – and while going back to the old reviewers a third time (plus yet new reviewers) is less likely, it is far from unheard of.

What is the result of this process? What an editor who acts this way would no doubt say is, the article is getting better and the journal is as a result publishing better science. Perhaps. But there are a few other results:

  1. The editor has effectively dodged much of the responsibility for his/her editorial decision. Some editors add up the reviewers’ verdicts as if they were votes; some insist on unanimous positive verdicts from long lists of reviewers; in every case the editor can and often does point to the reviewers – rather than to him or herself – as the source of negative comments or outcomes.
  2. The review process has been extended to be epically long. It is, sadly, not especially unusual for this process of reiteration to take a year or more. Any review process shorter than several months is considered lightning-fast for most journals in psychology.
  3. The reviewers have been given the opportunity to micro-manage the paper. They can and often do demand that new references be inserted (sometimes articles written by the reviewers themselves), theoretical claims be toned down (generally ones the reviewers disagree with), and different statistics be calculated. Reviewers may even insist that whole sections be inserted, removed, or completely rewritten.
  4. (As a result of point 3): The author is driven to, in a phrase we have all heard, “make the reviewers happy.” In an attempt to be published, the author will (a) insert references he/she does not actually think are germane, (b) make theoretical statements different from what he/she actually believes to be correct and (c) take out sections he or she thought was important, add sections he or she thinks are actually irrelevant, and rephrase discussions using another person’s words. The author’s name still goes on the paper, but the reviewers have become, in effect, co-authors. In a final bit of humiliating obsequience, the “anonymous reviewers” may be thanked in a footnote. This expressed gratitude is not always 100% sincere.

These consequences are all bad, but the worst is number 4. A central obligation of every scientist – of every scholar in every field, actually – is to say what one really thinks. (If tenure has a justification, this is it.) And yet the quest to “make the reviewers happy” leads too many authors to say things they don’t completely believe. At best, they are phrasing things differently than they would prefer, or citing a few articles that they don’t really regard as relevant. At worst, they are distorting their article into an incoherent mish-mash co-written by a committee of anonymous reviewers — none of whom came up with the original idea for the research, conducted the study, or is held accountable for whether the article finally published is right or wrong.
So that’s why, on the little box at the bottom of the peer review sheet that asks, “Would you be willing to review a revision of this article?” I check “no.” Please, editors: Evaluate the article that was submitted. If it needs a few minor tweaks, give the author a chance to make them. If it needs more than that, reject it. But don’t drag out the review process to the end of time, and don’t let a panel of reviewers – no matter how brilliant – co-author the article. They should write their own.

Don’t blame Milgram

I’m motivated to write this post because of a new book that, according to an NPR interview with its author, attacks the late Stanley Milgram for having misled us about the human propensity to obey.  He overstated his case, she claims, and also conducted unethical research.

The Milgram obedience studies of the 1960’s are probably the most famous research in the history of social psychology.  As the reader almost certainly knows, subjects were ordered to give apparently harmful – perhaps even fatal – electric shocks to an innocent victim (who was, fortunately, an unharmed research assistant).  The studies found that a surprising number of ordinary people followed orders to the hilt .

Accounts of these studies in textbooks and in popular writings usually make one of two points, and often both.  (1)  Milgram showed that anybody, or almost anybody, would obey orders to harm an innocent victim if the orders came from someone in an apparent position of authority.  (2) Milgram showed that the “power of the situation” overwhelms the “power of the person”; the experimenter’s orders were so strong that they overwhelmed personal dispositions and individual differences.  Both of these points are, indeed, dead wrong.  But their promulgation is not Milgram’s fault.

Consider each point, and what Milgram said (or didn’t say) about them.

1. Anybody, or almost anybody, would obey orders to harm an innocent victim.

Why this is wrong.  Because empirically it is wrong.  Milgram ran many variations in on his basic procedure and, to his credit, reported the data in full in his 1974 book.  (Not all social psychologists are so forthcoming.)  Across 18 experimental conditions, compliance ranged from 93% (when the participant did not have to administer shocks personally) to 0% (when two authorities gave contradictory orders, when the experimenter was the victim, and when the victim demanded to be shocked). In the two most famous conditions, when the experimenter was present and the victim could be heard but not seen, the obedience rates were 63% (at Yale) and 48% (when the setting was ostensibly “Research Associates of Bridgeport”). Across all conditions the average rate of compliance was 37.5% (Milgram 1974, Tables 2, 3, 4 and 5).  It is certainly reasonable to argue that this rate is surprising, and high enough to be troubling.  But 40% is far from everybody, or almost everybody.  Disobedience, even in the Milgram study, was a common occurrence.

Why the mistake is not Milgram’s fault.  Perhaps Milgram said some things that made people overestimate the rate of obedience he showed; I don’t know and I haven’t gone back to his original writings to check.  However, I doubt that the recent criticism that he misleadingly made people think that “anybody could be a Nazi” is fair, for a couple of reasons.  One reason is that he very clearly laid out the data from all of his experimental conditions in his definitive book, which allowed the calculations that are summarized above (and taken from Krueger & Funder, 2004, footnote 1).  Milgram hid nothing.

The second reason I don’t blame Milgram is that I had the opportunity to see him in person, just once, in about 1980.  (He was giving a talk at the Claremont Graduate School when I was a new faculty member at Harvey Mudd College.)  Milgram noted that his own famous movie about his research – a black & white classic still shown in many introductory psychology classes – begins with a subject who disobeys the experimenter. Milgram said he did that on purpose.  He feared that the message of his research would be taken to be that disobedience is impossible.  He wanted to counter that at the outset, he said, by showing just how it’s done:  Keep saying no.

It’s a reality-film classic, from before the genre even existed.  You see the balding, middle-aged white guy subject, wearing an office-worker’s shirt with rolled-up sleeves, become increasingly disturbed as the victim’s complaints escalate.  When he resists continuing to administer shocks, the experimenter says “you have no other choice, teacher, you must continue.”  It is a truly thrilling cinematic moment when the subject crosses his arms, leans back, and replies, “oh, I have a lot of choice.”

2.  Milgram’s study shows that the power of the situation overwhelms the power of the person, personality, or individual differences.

Why this is wrong (a):  First, the statement is empirically wrong because of the data summarized above.  Years ago, Lee Ross (1977) wrote about the complications in separating out “situational” from “dispositional” (or personal) causation.  He pointed out that to say “he ate it because it was chocolate-coated” sounds like a situational cause, but is precisely equivalent to saying “he ate it because he can’t resist chocolate,” which sounds like a dispositional cause.  The way out of this dilemma, Ross pointed out – in a resolution that has been widely accepted ever since – is that situational causation can be attributed only when everybody or almost everybody in a situation does the same thing.  Dispositional causation is indicated when people differ in their responses to the same situation.  In other words, if a response is made by 0% or 100% of the people in a situation (or close to these numbers), then you can fairly say the situation was the cause.  As this number gets closer to 50%, you have to attribute some causal power to personal, individual differences.  Recall again the overall obedience number across all the conditions of the Milgram studies;  37.5%.  Even in the most famous, Yale/victim-in-the-next-room condition, the obedience rate of 63% is much closer to 50 than to 100.

Why this is wrong (b):  The claim that Milgram showed the situation is more powerful than dispositions was incoherent to begin with.  His study included not one, but two situational forces: (1) the experimenter saying “continue” and (2) the victim saying “stop.”  It also included two dispositional forces: (1) an individual’s disposition to obey and (2) the individual’s disposition to be compassionate, kind, and therefore disobedient.  In the (less than 100%) of cases where the subjects obey to the hilt, four explanations are therefore conceivable (see Funder & Fast, 2010):

  1. The situational force to obey was stronger than the situational force to be compassionate.
  2. The dispositional force to obey was stronger than the dispositional force to be compassionate.
  3. The situational force to obey was stronger than the dispositional force to be compassionate.
  4. The dispositional force to obey was stronger than the situational force to be compassionate.

Explanations (1) and (2) both make sense, and in fact – as Ross would have noted – are exactly equivalent in meaning (and equally disturbing).  Explanation 4 is heresy.  It would use the Milgram study as a demonstration of how dispositions overwhelm situations!  But explanation 3 is equivalently incoherent, and it is the conventional one found in almost every social psychology textbook.

Why the mistake is not Milgram’s fault.  Milgram himself noted the interindividual variation in his subject’s responses, and said that it was important to find out their basis, though he didn’t succeed in finding it (for that, see recent research by David Gallardo-Pujol and his colleagues, reference below).  His movie also includes a graphic representation of what was listed above as explanation (1).  He described the competing demands of the experimenter and the victim as “fields of force,” noting that his experimental manipulations showed that as you got closer to the experimenter you were more likely to respond to his demands to obey, and as you got closer to the victim, you were more likely to respond to his demands to break off.  (The picture below is from the title slide, but the diagram was used later in the film to illustrate competing pressures from two situational forces).

Obedience

So don’t blame Milgram.  He was one of the most creative social psychologists in history, and his research program on obedience and other topics continues to be instructive (see Blass, 2004 for a fascinating personal history and summary of his work).

References

Gallardo-Pujol, D.; Orekhova, L.; & Benet-Martínez, V. (in preparation). Under Pressure to Obey and Conform: When the Power of the Situation is not enough. University of Barcelona

Ross, L. (1977). The intuitive psychologist and his shortcomings: Distortions in the attribution process. In L. Berkowitz (Ed.), Advances in experimental social psychology (vol. 10). New York: Academic Press.

Update 9/8/13

Some useful additional information concerning personality correlates of obedience, courtesy of Jonathan Cheek (thanks, Jon):

“Personality psychologists may appreciate the original study ELMS, A. C., & MILGRAM, S. (1966). PERSONALITY CHARACTERISTICS ASSOCIATED WITH OBEDIENCE AND DEFIANCE TOWARD AUTHORITATIVE COMMAND. Journal of Experimental Research in Personality, 1, 282-289, which is more accessibly summarized in Elms, A. C. (2009). Obedience lite. American Psychologist, 64, 32-36.”

One example of how to do it

In a previous post, I wrote about the contentious atmosphere that so often surrounds replication studies, and fantasized a world in which one might occasionally see replication researchers and the original authors come together in “a joint effort to share methods, look at data together, and come to a collaborative understanding of an important scientific issue.”  Happily one example that comes close to this ideal has been recently accepted for publication in Psychological Science — the same journal that published the original paper.  The authors of both the original and replication studies appear to have worked together to share information about procedures and analyses, which while perhaps not a full collaboration, is at least cooperation of a sort that’s seen too rarely.  The result was that the original, intriguing finding did not replicate; two large new studies obtained non-significant findings in the wrong direction.  The hypothesis that anxiously attached people might prefer warm foods when their attachment concerns are activated was provocative, to say the least.  But it seems to have been wrong.

With this example now out there, I hope others follow the same path towards helping the scientific literature perform the self-correcting process that, in principle, is its principal distinctive advantage.  I also hope that, one of these days, an attempt to independently replicate a provocative finding will actually succeed!  Now that would be an important step forward.

Sanjay Srivastava  and Job van Wolferen have also commented on this replication study.

UPDATE, April 15:  Via Eric Eich, the editor of Psychological Science who accepted the paper, discussed above, by LeBel and Campbell that failed replicate the study by Matthew Vess.  Eich offered Vess the opportunity to publish a rejoinder and this is what Vess said:

Thank you for the opportunity to submit a rejoinder to LeBel and Campbell’s commentary. I have, however, decided not to submit one. While I am certainly dismayed to see the failed attempts to reproduce a published study of mine, I am in agreement with the journal’s decision to publish the replication studies in a commentary and believe that such decisions will facilitate the advancement of psychological science and the collaborative pursuit of accurate knowledge. LeBel and Campbell provide a fair and reasonable interpretation of what their findings mean for using this paradigm to study attachment and temperature associations, and I appreciated their willingness to consult me in the development of their replication efforts. Once again, thank you for the opportunity.

Hats off to Matthew Vess.  Imagine if everyone whose findings were challenged responded in such a civil and science-promoting manner.  What a wonderful world it would be.

A Replication Initiative from APS

Several of the major research organizations in psychology, including APA, EAPP (European Association of Personality Psychology) and SPSP, have been talking about the issue of replicability of published research, but APS has made the most dramatic move so far to actually do something about it.  The APS journal Perspectives on Psychological Science today announced a new policy to enable the publication of pre-registered, robust studies seeking to replicate important published findings.  The journal will add a new section for this purpose, edited by Dan Simons and Alex Halcombe.  For details, click here.

This idea has been kicked around in other places, including proposals for new journals exclusively dedicated to replication studies.  One of the most interesting aspects of the new initiative is that instead of isolating replications in an independent journal few people might see, they will appear in an already widely-read and prestigious journal with a high impact factor.

When a similar proposal — in the form of a suggested new journal — was floated in a meeting I attended a few weeks ago, it quickly stimulated controversy. Some saw the proposal as a self-defeating attack on our own discipline that would only undermine the credibility of psychological research.  Others saw it as a much-needed self-administered corrective action; better to come from within the field than be imposed from outside. And still others — probably the largest group — raised and got a bit bogged down in worrying about specifics of implementation.  For example, what will stop a researcher from running a failed replication study, and only then “pre-registering” it?  How many failed replications does it take to overturn the conclusions of a published study, and what does “failed replication” mean exactly, anyway?  What degree of statistical power should replication studies be required to have, and what effect size should be used to make this calculation?  Finally, running these replication studies (as described in the PPS policy) looks to be a demanding and expensive enterprise. Who will have sufficient time, money and/or incentive to run them?   These questions all lack ready answers.

My own view is that the answers to these questions — or their ultimate unanswerability — will only be established through experimentation.  Somebody needs to try it and see what happens.  I admire APS for taking this step and am looking forward to seeing what, if anything, ultimately becomes of it.

How High is the Sky? Well, Higher than the Ground

Challenged by some exchanges in my own personal emails and over in Brent Robert’s “pigee” blog, I’ve found myself thinking more about what is surely the weakest point in my previous post about effect size: I failed to reach a clear conclusion about how “big” an effect has to be to matter. As others have pointed out, it’s not super-coherent to claim, on the one hand, that effect size is important and must always be reported yet to acknowledge, on the other hand, that under at least some circumstances very “small” effects can matter for practical and/or theoretical purposes.

My attempt to restore coherence has two threads, so far. First, to say that small effect sizes are sometimes important does not mean that they always are. It depends. Is .034 (in terms of r) big enough? It is, if we are talking about aspirin’s effect on heart attacks, because wide prescription can save thousands of lives a year (notice, though, that you need effect size to do this calculation). Probably not, though, for other purposes.

But honestly, I don’t know how small an effect is too small. As I said, it depends. I suspect that if social psychologists, in particular, reported and emphasized their effect sizes more often, over time an experiential base would accrue that would make interpreting them easier. But, in the meantime, maybe there is another way to think about things.

So the second thread of my response is to suggest that perhaps we should focus on the ordinal rather than absolute nature of effect sizes. While we don’t often know exactly how big an effect has to be to matter, in an absolute sense, there are many contexts in which we care which of two things matters **more**. Personality psychologists routinely publish long (and to some people, boring) lists of correlates; such lists draw attention to the personality variables that appear to be more and less related to the outcome of interest, even if the exact numerical values aren’t necessarily all that informative.

Social psychological theorizing is also often, often, phrased in terms of relative effect size, though the actual numbers aren’t always included. The whole point of Ross & Nisbett’s classic book “The Person and the Situation” was that the effects of situational variables are larger than the effects of personality variables, and they draw theoretical implications from that comparison that — read almost any social psychology textbook or social psych. section of any intro textbook — goes to the heart of how social psychology is theoretically framed at the most general level. The famous “Fundamental Attribution Error” is explicitly expressed in terms of effect size — situational variables allegedly affect behavior “more” than people think. How do you even talk about that claim without comparing effect sizes? The theme of Susan Fiske’s address at the presidential symposium at the 2012 SPSP was that “small” manipulations can have “large” effects; this is also effect size language expressing a theoretical view. Going back further, when attitude change theorists talked about direct and indirect routes to persuasion, this raised a key theoretical question of relative influence of the two effects. More recently, Lee Jussim wrote a whole (and excellent) book about the size of expectancy effects, comparing them to the effects of prior experience, valid information, etc. and building a theoretical model from that comparison.

I could go on, but, in short, the relative size of effects matters in social psychological theorizing whether the effects are computed and reported, or not. When they aren’t, of course, the theorizing is proceeding in an empirical vaccum that might not even be noticed – and this happens way too often, including in some of the examples I just listed. My point is that effect size comparisons, usually implicit, are ubiquitous in psychological theorizing so it would probably be better if we remembered to explicitly calculate them, report them, and consider them carefully.

Does (effect) Size Matter?

Personality psychologists wallow in effect size; the ubiquitous correlation coefficient, Pearson’s r, is central to nearly every research finding they report.  As a consequence, discussions of relationships between personality variables and outcomes are routinely framed by assessments of their strength.  For example, a landmark paper reviewed predictors of divorce, mortality, and occupational achievement, and concluded that personality traits have associations with these life outcomes that are as strong as or stronger than traditional predictors such as socio-economic status or cognitive ability (Roberts et al., 2007).  This is just one example of how personality psychologists routinely calculate, care about, and even sometimes worry about the size of the relationships between their theoretical variables and their predicted outcomes.

Social psychologists, not so much.  The typical report in experimental social psychology focuses on p-level, the probability of the magnitude of the difference between experimental groups occurring if the null hypothesis of no difference were to be true.   If this probability is .05 or less, then: Success!  While effect sizes (usually Cohen’s d  or, less often, Pearson’s r) are reported more often they they used to be – probably because the APA Publication Manual explicitly requires it (a requirement not always enforced) – the emphasis of the discussion of the theoretical or even the practical importance of the effect typically centers around whether it exists.  The size simply doesn’t matter.

Is this description an unfair caricature of social psychological research practice?  That’s what I thought until recently.  Even though the typical statistical education of many experimentally-oriented psychologists bypasses extensive discussion of effect size in favor of the ritual of null-hypothesis testing, I assumed that the smarter social psychologists grasped that an important part of scientific understanding involves ascertaining not just whether some relationship between two variables “exists,” but how big that relationship is and how it compares to various benchmarks of theoretical or practical utility.

It turns out I was wrong.  I recently had an email exchange with a prominent social psychologist who I greatly respect. [i] I was shocked, therefore, when he wrote the following[ii]:

 …the key to our research… [is not] to accurately estimate effect size. If I were testing an advertisement for a marketing research firm and wanted to be sure that the cost of the ad would produce enough sales to make it worthwhile, effect size would be crucial. But when I am testing a theory about whether, say, positive mood reduces information processing in comparison with negative mood, I am worried about the direction of the effect, not the size (indeed, I could likely change the size by using a different manipulation of mood, a different set of informational stimuli, a different contextual setting for the research — such as field versus lab). But if the results of such studies consistently produce a direction of effect where positive mood reduces processing in comparison with negative mood, I would not at all worry about whether the effect sizes are the same across studies or not, and I would not worry about the sheer size of the effects across studies. This is true in virtually all research settings in which I am engaged. I am not at all concerned about the effect size (except insofar as very small effects might require larger samples to find clear evidence of the direction of the effect — but this is more of a concern in the design phase, not in interpreting the meaning of the results). In other words, I am yet to develop a theory for which an effect size of r = .5 would support the theory, but an effect size of r = .2 (in the same direction) would fail to support it (if the effect cannot be readily explained by chance). Maybe you have developed such theories, but most of our field has not.

To this comment, I had three reactions.

First, I was startled by the claim that social psychologists don’t and shouldn’t care about effect size. I began my career during the dark days of the Mischelian era, and the crux of Mischel’s critique was that personality traits rarely correlate with outcomes greater than .30. He never denied that the correlations were significant, mind you, just that they weren’t big enough to matter to anybody on either practical or theoretical grounds. Part of the sport was to square this correlation, and state triumphantly (and highly misleadingly) that therefore personality only “explains” “9% of the variance.”  Social psychologists of the era LOVED this critique[iii]! Some still do. Oh, if only one social psychologist had leapt to personality psychology’s defense in those days, and pointed out that effect size doesn’t matter as long as we have the right sign on the correlation… we could have saved ourselves a lot of trouble (Kenrick & Funder, 1988).

Second, I am about 75% joking in the previous paragraph, but the 25% that’s serious is that I actually think that Mischel made an important point – not that .30 was a small effect size (it isn’t), but that effect size should  be the name of the game.  To say that an effect “exists” is a remarkably simplistic statement that on close examination means almost nothing.  If you work with census data, for example, EVERYTHING — every comparison between two groups, every correlation between any two variables — is statistically significant at the .000001 level. But the effect sizes are generally teeny-tiny, and of course lots of them don’t make any sense either (perhaps these should be considered “counter-intuitive” results). Should all of these findings be taken seriously?

Third, if the answer is no, then we have to decide how big an effect is in fact worth taking seriously. And not just for purposes of marketing campaigns! If, for example, a researcher wants to say something like “priming effects can overwhelm our conscious judgment” (I have read statements like that), then we need to start comparing effect sizes. Or, if we are just going to say that “holding a hot cup of coffee makes you donate more money to charity” (my favorite recent forehead-slapping finding) then the effect size is important for theoretical, not just practical purposes, because a small effect size implies that a sizable minority is giving LESS money to charity, and that’s a theoretical problem, not just a practical one.  More generally, the reason a .5 effect size is more convincing, theoretically, than a .2 effect size is that the theorist can put less effort into explaining why so many participants did the opposite of what the theory predicted.

Still, it’s difficult to set a threshold for how big is big enough.  As my colleague pointed out in a subsequent e-mail – and as I’ve written myself, in the past — there are many reasons to take supposedly “small” effects seriously.  Psychological phenomena are determined by many variables, and to isolate one that has an effect on an interesting outcome is a real achievement, even though in particular instances it might be overwhelmed by other variables with opposite influences.  Rosenthal and Rubin (1982) demonstrated how a .30 correlation was enough to be right, about two times out of three.  Ahadi and Diener (1989) showed that if just a few factors affect a common outcome, the maximum size of the effect of any one of them is severely constrained.  In a related vein, Abelson (1985) calculated how very small effect sizes – in particular, the relationship between batting average and performance in a single at-bat – can cumulate fairly quickly into large differences in outcomes (or ballplayer salaries).  So far be it from me to imply that a “small” effect, by any arbitrary standard, is unimportant.

Now we are getting near the crux of the matter.  Arbitrary standards – whether the .05 p-level threshold or some kind of minimum credible effect size – are paving stones on the road to ruin.  Personality psychologists routinely calculate and report their effect sizes, and as a result have developed a pretty good view of what these numbers mean and how to interpret them.  Social psychologists, to this day, still don’t pay much attention to effect sizes so haven’t developed a base of experience for evaluation. This is why my colleague Dan Ozer and I were able to make a splash as young beginning researchers, simply by pointing out that, for example, the effect size of the distance of the victim on obedience in the Milgram study was in the .30’s (Funder & Ozer, 1983).  The calculation was easy, even obvious, but apparently nobody had done it before.  A meta-analysis by Richard et al. (2003) found that the average effect size of published research in experimental social psychology is r = .21.  This finding remains unknown, and probably would come as a surprise, to many otherwise knowledgeable experimental researchers.

But this is what happens when the overall attitude is that “effect size doesn’t matter.”  Judgment lacks perspective, and we are unable to separate that which is truly important from that which is so subtle as to be virtually undetectable (and, in some cases, notoriously difficult to replicate).

My conclusion, then, is that effect size is important and the business of science should be to evaluate it, and its moderators, as accurately as possible.  Evaluating effect sizes is and will continue to be difficult, because (among other issues) they may be influenced by extraneous factors, because apparently “small” effects can cumulate into huge consequences over time, and because any given outcome is influenced by many different factors, not just one or even a few.  But the solution to this difficulty is not to regard effect sizes as unimportant, much less to ignore them altogether.  Quite the contrary, the more prominence we give to effect sizes in reporting and thinking about research findings, the better we will get at understanding what we have discovered and how important it really is.

References

Abelson, R. P. (1985). “A variance explanation paradox: When a little is a lot.” Psychological Bulletin, 97, 129–133.

Ahdadi, S., & Diener, E. (1989). Multiple determinants and effect size. Journal of Personality and Social Psychology, 56, 398-406.

Funder, D.C., & Ozer, D.J. (1983). Behavior as a function of the situation. Journal of Personality and Social Psychology, 44, 107-112.

Kenrick, D.T., & Funder, D.C. (1988). Profiting from controversy: Lessons from the person-situation debate. American Psychologist, 43, 23-34.

Nisbett, R.E., (1980). The trait construct in lay and professional psychology. In L. Festinger (Ed.), Retrospections on social psychology  (pp. 109-130). New York: Oxford University Press.

Richard, F.D., Bond, C.F., Jr., & Stokes-Zoota, J.J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331-363.

Roberts, B.W., Kuncel, N.R., Shiner, R., Caspi, A., & Goldberg L.R. (2007). The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives in Psychological Science, 2, 313-345.

Rosenthal, R., & Rubin, D.B. (1982). A simple, general-purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166-169.

[i] We served together for several years on a grant review panel, a bonding experience as well as a scientific trial by fire, and I came to admire his incisive intellect and clear judgment.

[ii] I obtained his permission to quote this passage but, understandably, he asked that he not be named in order to avoid being dragged into a public discussion he did not intend to start with a private email.

[iii] See, e.g., Nisbett, 1980, who raised the “personality correlation” to .40 but still said it was too small to matter.  Only 16% of the variance, don’t you know.