I spent last Thursday and Friday (February 20 and 21) at an NSF workshop concerning the replicability of research results. It was chaired by John Cacioppo and included about 30 participants including such well-known contributors to the discussion as Brian Nosek, Hal Pashler, Eric Eich, and Tony Greenwald, to name a few. Participants also included officials from NIH, NSF, the White House Office on Science and Technology and at least one private foundation. I was invited, I presume, in my capacity as Past-President of SPSP and chair of an SPSP task force on research practices which recently published a report on non-retracted PSPB articles by investigators who retracted articles elsewhere, and a set of recommendations for research and educational practice, which was just published in PSPR.
Committees, task forces and workshops – whatever you call them – about replicability issues have become almost commonplace. The SPSP Task Force was preceded by a meeting and report sponsored by the European Association of Personality Psychology, and other efforts have been led by APS, the Psychonomic Society and other organizations. Two symposia on the subject were held at the SPSP meeting in Austin just the week before. But this discussion was perhaps special, because it is the first (to my knowledge) to be sponsored by the US government, with the explicit purpose of seeking advice about what NSF and other research agencies should do. I think it is fair to say: When replication is discussed in a meeting with representatives from NIH, NSF and the White House, the issue is on the front burner!
The discussion covered several themes, some of which are more familiar than others. From my scribbled notes, I can list a few that received particular attention.
1. It’s not just – or even especially – about psychology. I was heartened to see that the government representatives saw the bulk of problems with replication as lying in fields such as molecular biology, genetics, and medicine, not in psychology. Psychology has problems too, but is widely viewed as the best place to look for solutions since the basic issues all involve human behavior. It makes me a bit crazy when psychologists say (or sometimes shout) that everything is fine, that critics of research practices are “witch hunting,” or that examining the degree to which our science is replicable is self-defeating. Quite the reverse: psychology is being looked to as the source of the expertise that can improve all of science. As a psychologist, I’m proud of this.
2. Preregistration. Widespread enthusiasm early on Thursday for the idea of pre-registering hypotheses waned by Friday afternoon. In his concluding remarks, Tony Greenwald listed it among suggestions that he found “doubtful,” in part because it could increase pressure on investigators to produce the results they promised, rather than be open to what the data are really trying to tell them. Greg Francis also observed that pre-registration buys into the idea that the aim of science is simply to dichotomously confirm or disconfirm findings, an idea that is the source of many of our problems to begin with. In a pithy phrase, he said we should “measure effects, not test them.”
3. The tradeoff between “ground-breaking,” innovative studies and solid science. In a final editorial, the previous editor of Psychological Science described common reasons for “triaging” (rejecting without review) submitted articles. It seems he and his associate editors categorized these reasons with musical references, which they found amusing. One of the reasons was (and I quote) the “Pink Floyd rejection: Most triaged papers were of this type; they reported work that was well done and useful, but not sufficiently groundbreaking. So the findings represented just another brick in the wall of science” (emphasis in the original). I thought this label was appalling even before I saw Greg Francis’s article, now in-press at Psychonomic Bulletin and Review, that concludes that among articles published during this editor’s term that Francis reviewed, 82% had problems he concluded showed signs “that unsuccessful findings were suppressed, the experiments or analyses were improper, or that the theory does not properly account for the data.” Well, they weren’t bricks in the wall; that’s for sure!
It seems clear that grant panels and journal editors have traditionally overvalued flashy findings, especially counter-intuitive ones. This particular editor is not alone in his value system. But “counter intuitive” means prima facie implausible, and such findings should demand stronger evidence than the small N’s and sketchily-described methods that so often seem to be their basis. More broadly, when everybody is trying to be ground-breaking at the same time, who will produce the scientific “bricks” on which knowledge can securely rest? Given this state of affairs, it is not surprising that so many findings fail to hold up under scrutiny, and that the cutest findings are, as a rule, the most vulnerable.
But while everybody around the table wistfully agreed that solid research and replication deserved more respect than it typically receives, a Stanford Dean who was there pointed out the obvious (and I paraphrase): “Look: nobody is going to get tenure at Stanford [or, I would add, from any other ambitious research university] from replicating other people’s findings.”
Science needs to move forward or die; science needs a strong base of reliable fact from which to proceed. The tension between these two needs is not an easy one to resolve.
4. Do we look forward or look back? I was glad Greg Francis was at the table because his recent article puts this issue front and center. He named names! Out there in the twitterverse, I’ve seen complaints that this does more harm than good. It makes people defensive, it probably tars some people unfairly (e.g., a grad student co-author on an article p-hacked by her advisor), and in general naming names causes an uproar of a sort not exactly conducive to sober scientific discussion. Hal Pashler ran into exactly the same issue a couple of years ago when he and a student uncovered – and named – studies reporting implausibly large “voodoo correlations” in fMRI research.
It is uncomfortable to accuse someone of having published inaccurate, unreliable or possibly even fraudulent research. It is even more uncomfortable, I am sure, to be the person accused. It is probably for this reason that we hear so many calls to “look forward, not back.” This was a specific statement of both the EAPP report and the SPSP report, to name just two, and such comments were heard at the NSF meeting as well: Let’s clean up our act and do right from now on. Let’s not go back and make people uncomfortable and defensive.
I understand this view and have even on occasion expressed it. But it does worry me. If we think the findings enshrined in our journals are important – and presumably we do; it’s the output of the whole scientific enterprise in which we are engaged – then don’t we also have an obligation to clean out the findings that turn out to be unreliable or just plain wrong? And to do this, don’t we have to be specific? It’s hard to avoid concluding that the answer is “yes.” But who will do it? Do we want people to build careers looking for p-hacked articles in the published literature? Do we want to damage the reputations and possibly careers of people who were just doing “business as usual” the way they were taught? Do you want to do this? I know I don’t.
5. What can the government do? Addressing this question was the whole ostensive reason for the conference. Suggestions fell into three broad categories:
a. Reform grant review practices. NSF reviewers are, at present, explicitly required to address whether a given grant proposal is “potentially transformative.” They are not explicitly required to address whether the findings on which the research is based can be deemed reliable, or whether the proposed research includes any safeguards to insure the replicability of its findings. Maybe they should.
b. Fund replications. Before the meeting, one colleague told me that I should inform NSF that “if you pay us to do it, we’ll do it.” In other words, funds should be available for replication research. Maybe an institute should be set up to replicate key findings, or grant proposals to perform a program of important replications should receive a sympathetic hearing, or maybe ordinary grants should include a replication study or two. None of this happens now.
c. Fund research on research. I was not expecting this theme to emerge as strongly as it did. It seems like NSF may be interested in looking at ways to support “meta-science,” such as the development of techniques to detect fraudulent data, or the effect of various incentive structures on scientific behavior. Psychology includes a lot of experts on statistical methodology as well as on human behavior; there seems to be an opportunity here.
6. Unintended consequences. Our meeting chair was concerned about this issue, and rightly so. He began the meeting by recalling the good intentions that underlay the initiation of IRB’s to protect human subjects, and the monsters of mission creep, irrelevant regulation, and bureaucratic nonsense so many have become. (These are my words, not his.) Some proposed reforms entail the danger of going down the same road. Even worse: as we enshrine confidence intervals over p-values, or preregistration over p-hacking, or thresholds of permissible statistical power or even just N, do we risk replacing old mindless rituals with new ones?
7. Backlash and resistance. This issue came up only a couple of times and I wish it had gotten more attention. It seemed like nobody at the table (a) denied there was a replicability problem in much of the most prominent research in the major journals or (b) denied that something needed to be done. As one participant said, “we are all drinking the same bath water.” (I thought this phrase usually referenced Kool-Aid, but never mind.) Possibly, nobody spoke up because of the apparent consensus in the room. In any case, there will be resistance out there. And we need to watch out for it.
I expect the resistance to be of the passive-aggressive sort. Indeed, that’s what we have already seen. The sudden awarding of APA’s major scientific award to someone in the midst of a replication controversy seems like an obvious thumb in the eye to the reform movement. To underline the point, a prominent figure in social psychology (he actually appears in TV commercials!) tweeted that critics of the replicability of the research would never win such a major award. Sadly, he was almost certainly correct.
One of Geoff Cumming’s graduate students, Fiona Fidler, recently wrote a thesis on the history of null hypothesis significance testing. It’s a fascinating read and I hope will be turned into a book soon. One of its major themes is that NHST has been criticized thoroughly and compellingly many times over the years. Yet it persists, even though – and, ironically, perhaps because – it has never really been explicitly defended! Instead, the defense of NHST is largely passive. People just keep using it. Reviewers and editors just keep publishing it; granting agencies keep giving money to researchers who use it. Eventually the critiques die down. Nothing changes.
That could happen this time too. The defenders of the status quo rarely actively defend anything. They aren’t about to publish articles explaining why NHST tells you everything you need to know, or arguing that effect sizes of r = .80 in studies with an N of 20 represent important and reliable breakthroughs, or least of all reporting data to show that major counter-intuitive findings are robustly replicable. Instead they will just continue to publish each others’ work in all the “best” places, hire each other into excellent jobs and, of course, give each other awards. This is what has happened every time before.
Things just might be different this time. Doubts about statistical standard operating procedure and the replicability of major findings are rampant across multiple fields of study, not just psychology. And, these issues have the attention of major scientific studies and even the US Government. But the strength of the resistance should not be underestimated.
This meeting was very recent and I am still mulling over my own reactions to these and other issues. This post is, at best, a rough first draft. I welcome your reactions, in the comments section or elsewhere in the blogosphere/twitterverse.