NSF Gets an Earful about Replication

I spent last Thursday and Friday (February 20 and 21) at an NSF workshop concerning the replicability of research results. It was chaired by John Cacioppo and included about 30 participants including such well-known contributors to the discussion as Brian Nosek, Hal Pashler, Eric Eich, and Tony Greenwald, to name a few.  Participants also included officials from NIH, NSF, the White House Office on Science and Technology and at least one private foundation. I was invited, I presume, in my capacity as Past-President of SPSP and chair of an SPSP task force on research practices which recently published a report on non-retracted PSPB articles by investigators who retracted articles elsewhere, and a set of recommendations for research and educational practice, which was just published in PSPR.

Committees, task forces and workshops – whatever you call them – about replicability issues have become almost commonplace.  The SPSP Task Force was preceded by a meeting and report sponsored by the European Association of Personality Psychology, and other efforts have been led by APS, the Psychonomic Society and other organizations.  Two symposia on the subject were held at the SPSP meeting in Austin just the week before.  But this discussion was perhaps special, because it is the first (to my knowledge) to be sponsored by the US government, with the explicit purpose of seeking advice about what NSF and other research agencies should do.  I think it is fair to say: When replication is discussed in a meeting with representatives from NIH, NSF and the White House, the issue is on the front burner!

The discussion covered several themes, some of which are more familiar than others. From my scribbled notes, I can list a few that received particular attention.

1.      It’s not just – or even especially – about psychology.  I was heartened to see that the government representatives saw the bulk of problems with replication as lying in fields such as molecular biology, genetics, and medicine, not in psychology.  Psychology has problems too, but is widely viewed as the best place to look for solutions since the basic issues all involve human behavior.  It makes me a bit crazy when psychologists say (or sometimes shout) that everything is fine, that critics of research practices are “witch hunting,” or that examining the degree to which our science is replicable is self-defeating.  Quite the reverse: psychology is being looked to as the source of the expertise that can improve all of science.  As a psychologist, I’m proud of this.

2.      Preregistration.  Widespread enthusiasm early on Thursday for the idea of pre-registering hypotheses waned by Friday afternoon.  In his concluding remarks, Tony Greenwald listed it among suggestions that he found “doubtful,” in part because it could increase pressure on investigators to produce the results they promised, rather than be open to what the data are really trying to tell them.  Greg Francis also observed that pre-registration buys into the idea that the aim of science is simply to dichotomously confirm or disconfirm findings, an idea that is the source of many of our problems to begin with.  In a pithy phrase, he said we should “measure effects, not test them.”

3.      The tradeoff between “ground-breaking,” innovative studies and solid science.  In a final editorial, the previous editor of Psychological Science described common reasons for “triaging” (rejecting without review) submitted articles.  It seems he and his associate editors categorized these reasons with musical references, which they found amusing.    One of the reasons was (and I quote) the “Pink Floyd rejection: Most triaged papers were of this type; they reported work that was well done and useful, but not sufficiently groundbreaking. So the findings represented just another brick in the wall of science” (emphasis in the original).  I thought this label was appalling even before I saw Greg Francis’s article, now in-press at Psychonomic Bulletin and Review, that concludes that among articles published during this editor’s term that Francis reviewed, 82% had problems he concluded showed signs “that unsuccessful findings were suppressed, the experiments or analyses were improper, or that the theory does not properly account for the data.” Well, they weren’t bricks in the wall; that’s for sure!

It seems clear that grant panels and journal editors have traditionally overvalued flashy findings, especially counter-intuitive ones.  This particular editor is not alone in his value system. But “counter intuitive” means prima facie implausible, and such findings should demand stronger evidence than the small N’s and sketchily-described methods that so often seem to be their basis.  More broadly, when everybody is trying to be ground-breaking at the same time, who will produce the scientific “bricks” on which knowledge can securely rest?  Given this state of affairs, it is not surprising that so many findings fail to hold up under scrutiny, and that the cutest findings are, as a rule, the most vulnerable.

But while everybody around the table wistfully agreed that solid research and replication deserved more respect than it typically receives, a Stanford Dean who was there pointed out the obvious (and I paraphrase): “Look: nobody is going to get tenure at Stanford [or, I would add, from any other ambitious research university] from replicating other people’s findings.”

Science needs to move forward or die; science needs a strong base of reliable fact from which to proceed.  The tension between these two needs is not an easy one to resolve.

4.      Do we look forward or look back?  I was glad Greg Francis was at the table because his recent article puts this issue front and center.  He named names!  Out there in the twitterverse, I’ve seen complaints that this does more harm than good.  It makes people defensive, it probably tars some people unfairly (e.g., a grad student co-author on an article p-hacked by her advisor), and in general naming names causes an uproar of a sort not exactly conducive to sober scientific discussion.  Hal Pashler ran into exactly the same issue a couple of years ago when he and a student uncovered – and named – studies reporting implausibly large “voodoo correlations” in fMRI research.
It is uncomfortable to accuse someone of having published inaccurate, unreliable or possibly even fraudulent research.  It is even more uncomfortable, I am sure, to be the person accused.  It is probably for this reason that we hear so many calls to “look forward, not back.”  This was a specific statement of both the EAPP report and the SPSP report, to name just two, and such comments were heard at the NSF meeting as well:  Let’s clean up our act and do right from now on.  Let’s not go back and make people uncomfortable and defensive.

I understand this view and have even on occasion expressed it.  But it does worry me.  If we think the findings enshrined in our journals are important – and presumably we do; it’s the output of the whole scientific enterprise in which we are engaged – then don’t we also have an obligation to clean out the findings that turn out to be unreliable or just plain wrong?  And to do this, don’t we have to be specific?  It’s hard to avoid concluding that the answer is “yes.”  But who will do it?  Do we want people to build careers looking for p-hacked articles in the published literature?  Do we want to damage the reputations and possibly careers of people who were just doing “business as usual” the way they were taught?  Do you want to do this?  I know I don’t.

5.      What can the government do?  Addressing this question was the whole ostensive reason for the conference.  Suggestions fell into three broad categories:

a.      Reform grant review practices.  NSF reviewers are, at present, explicitly required to address whether a given grant proposal is “potentially transformative.”  They are not explicitly required to address whether the findings on which the research is based can be deemed reliable, or whether the proposed research includes any safeguards to insure the replicability of its findings.  Maybe they should.

b.      Fund replications.  Before the meeting, one colleague told me that I should inform NSF that “if you pay us to do it, we’ll do it.”  In other words, funds should be available for replication research.  Maybe an institute should be set up to replicate key findings, or grant proposals to perform a program of important replications should receive a sympathetic hearing, or maybe ordinary grants should include a replication study or two.  None of this happens now.

c.      Fund research on research.  I was not expecting this theme to emerge as strongly as it did.  It seems like NSF may be interested in looking at ways to support “meta-science,” such as the development of techniques to detect fraudulent data, or the effect of various incentive structures on scientific behavior.  Psychology includes a lot of experts on statistical methodology as well as on human behavior; there seems to be an opportunity here.

6.      Unintended consequences.  Our meeting chair was concerned about this issue, and rightly so.  He began the meeting by recalling the good intentions that underlay the initiation of IRB’s to protect human subjects, and the monsters of mission creep, irrelevant regulation, and bureaucratic nonsense so many have become.  (These are my words, not his.)  Some proposed reforms entail the danger of going down the same road.  Even worse:  as we enshrine confidence intervals over p-values, or preregistration over p-hacking, or thresholds of permissible statistical power or even just N, do we risk replacing old mindless rituals with new ones?

7.      Backlash and resistance.  This issue came up only a couple of times and I wish it had gotten more attention.  It seemed like nobody at the table (a) denied there was a replicability problem in much of the most prominent research in the major journals or (b) denied that something needed to be done.  As one participant said, “we are all drinking the same bath water.”  (I thought this phrase usually referenced Kool-Aid, but never mind.)  Possibly, nobody spoke up because of the apparent consensus in the room.  In any case, there will be resistance out there.  And we need to watch out for it.

I expect the resistance to be of the passive-aggressive sort.  Indeed, that’s what we have already seen.  The sudden awarding of APA’s major scientific award to someone in the midst of a replication controversy seems like an obvious thumb in the eye to the reform movement.  To underline the point, a prominent figure in social psychology (he actually appears in TV commercials!) tweeted that critics of the replicability of the research would never win such a major award.  Sadly, he was almost certainly correct.

One of Geoff Cumming’s graduate students, Fiona Fidler, recently wrote a thesis on the history of null hypothesis significance testing.  It’s a fascinating read and I hope will be turned into a book soon. One of its major themes is that NHST has been criticized thoroughly and compellingly many times over the years.  Yet it persists, even though – and, ironically, perhaps because – it has never really been explicitly defended!  Instead, the defense of NHST is largely passive.  People just keep using it.  Reviewers and editors just keep publishing it; granting agencies keep giving money to researchers who use it.  Eventually the critiques die down.  Nothing changes.

That could happen this time too.  The defenders of the status quo rarely actively defend anything. They aren’t about to publish articles explaining why NHST tells you everything you need to know, or arguing that effect sizes of r = .80 in studies with an N of 20 represent important and reliable breakthroughs, or least of all reporting data to show that major counter-intuitive findings are robustly replicable.   Instead they will just continue to publish each others’ work in all the “best” places, hire each other into excellent jobs and, of course, give each other awards.  This is what has happened every time before.

Things just might be different this time.  Doubts about statistical standard operating procedure and the replicability of major findings are rampant across multiple fields of study, not just psychology.  And, these issues have the attention of major scientific studies and even the US Government.  But the strength of the resistance should not be underestimated.

This meeting was very recent and I am still mulling over my own reactions to these and other issues.  This post is, at best, a rough first draft.  I welcome your reactions, in the comments section or elsewhere in the blogosphere/twitterverse.

Why I Decline to do Peer Reviews (part one): Re-reviews

Like pretty much everyone fortunate enough to occupy a faculty position in psychology at a research university, I am frequently asked to review articles submitted for publication to scientific journals. Editors rely heavily on these reviews in making their accept/reject decisions. I know: I’ve been an editor myself, and I experienced first-hand the frustrations in trying to persuade qualified reviewers to help me assess the articles that flowed over my desk in seemingly ever-increasing numbers. So don’t get me wrong: I often do agree to do reviews – around 25 times a year, which is probably neither much above nor below the average for psychologists at my career stage. But sometimes I simply refuse, and let me explain one reason why.

The routine process of peer review is that the editor reads a submitted article, selects 2 or 3 individuals thought to have reasonable expertise in the topic, and asks them for reviews. After some delays due to reviewers’ competing obligations, trips out of town, personal emergencies or – the editor’s true bane – lengthy failures to respond at all, the requisite number of reviews eventually arrive. In a very few cases, the editor reads the reviews, reads the article, and accepts it for publication. In rather more cases, the editor rejects the article. The authors of the remaining articles get a letter inviting them to “revise and resubmit.” In such cases, in theory at least, the reviewers and/or the editor see a promising contribution. Perhaps a different, more informative statistic could be calculated, an omitted relevant article cited, or a theoretical derivation explained more clearly. But the preliminary decision clearly is – or should be – that the research is worth publishing; it could just be reported a bit better.

What happens then? What should happen, in my opinion, is that the author(s) complete their revision, the editor reads it, perhaps refreshing his or her memory by rereading the reviewers’ comments, and then makes a final accept/reject decision. After all, the editor is – presumably and hopefully – at least somewhat cognizant of the topic area and perhaps truly expert. The reviewers were selected for their specific expertise and have already commented on the article, sometimes in great detail. Armed with that, the editor should not find it too difficult to make a final decision.

Too often, this is not what happens. Instead, the editor sends the revised paper out for further review! Usually, this entails sending it to the same individuals who reviewed the article once already. Sometimes – in a surprisingly common practice that every author deeply loathes – the editor also sends the revised article to new reviewers. Everyone then weighs in, looking to make sure their favorite comments were addressed, and making further comments for further revision. The editor reads the reviews and perhaps now makes a final decision, but sometimes not. Yes, the author may be asked to revise and resubmit yet again – and while going back to the old reviewers a third time (plus yet new reviewers) is less likely, it is far from unheard of.

What is the result of this process? What an editor who acts this way would no doubt say is, the article is getting better and the journal is as a result publishing better science. Perhaps. But there are a few other results:

  1. The editor has effectively dodged much of the responsibility for his/her editorial decision. Some editors add up the reviewers’ verdicts as if they were votes; some insist on unanimous positive verdicts from long lists of reviewers; in every case the editor can and often does point to the reviewers – rather than to him or herself – as the source of negative comments or outcomes.
  2. The review process has been extended to be epically long. It is, sadly, not especially unusual for this process of reiteration to take a year or more. Any review process shorter than several months is considered lightning-fast for most journals in psychology.
  3. The reviewers have been given the opportunity to micro-manage the paper. They can and often do demand that new references be inserted (sometimes articles written by the reviewers themselves), theoretical claims be toned down (generally ones the reviewers disagree with), and different statistics be calculated. Reviewers may even insist that whole sections be inserted, removed, or completely rewritten.
  4. (As a result of point 3): The author is driven to, in a phrase we have all heard, “make the reviewers happy.” In an attempt to be published, the author will (a) insert references he/she does not actually think are germane, (b) make theoretical statements different from what he/she actually believes to be correct and (c) take out sections he or she thought was important, add sections he or she thinks are actually irrelevant, and rephrase discussions using another person’s words. The author’s name still goes on the paper, but the reviewers have become, in effect, co-authors. In a final bit of humiliating obsequience, the “anonymous reviewers” may be thanked in a footnote. This expressed gratitude is not always 100% sincere.

These consequences are all bad, but the worst is number 4. A central obligation of every scientist – of every scholar in every field, actually – is to say what one really thinks. (If tenure has a justification, this is it.) And yet the quest to “make the reviewers happy” leads too many authors to say things they don’t completely believe. At best, they are phrasing things differently than they would prefer, or citing a few articles that they don’t really regard as relevant. At worst, they are distorting their article into an incoherent mish-mash co-written by a committee of anonymous reviewers — none of whom came up with the original idea for the research, conducted the study, or is held accountable for whether the article finally published is right or wrong.
So that’s why, on the little box at the bottom of the peer review sheet that asks, “Would you be willing to review a revision of this article?” I check “no.” Please, editors: Evaluate the article that was submitted. If it needs a few minor tweaks, give the author a chance to make them. If it needs more than that, reject it. But don’t drag out the review process to the end of time, and don’t let a panel of reviewers – no matter how brilliant – co-author the article. They should write their own.