NSF Gets an Earful about Replication

Posted on February 25, 2014 by David Funder

I spent last Thursday and Friday (February 20 and 21) at an NSF workshop concerning the replicability of research results. It was chaired by John Cacioppo and included about 30 participants including such well-known contributors to the discussion as Brian Nosek, Hal Pashler, Eric Eich, and Tony Greenwald, to name a few. Participants also included officials from NIH, NSF, the White House Office on Science and Technology and at least one private foundation. I was invited, I presume, in my capacity as Past-President of SPSP and chair of an SPSP task force on research practices which recently published a report on non-retracted PSPB articles by investigators who retracted articles elsewhere, and a set of recommendations for research and educational practice, which was just published in PSPR.

Committees, task forces and workshops – whatever you call them – about replicability issues have become almost commonplace. The SPSP Task Force was preceded by a meeting and report sponsored by the European Association of Personality Psychology, and other efforts have been led by APS, the Psychonomic Society and other organizations. Two symposia on the subject were held at the SPSP meeting in Austin just the week before. But this discussion was perhaps special, because it is the first (to my knowledge) to be sponsored by the US government, with the explicit purpose of seeking advice about what NSF and other research agencies should do. I think it is fair to say: When replication is discussed in a meeting with representatives from NIH, NSF and the White House, the issue is on the front burner!

The discussion covered several themes, some of which are more familiar than others. From my scribbled notes, I can list a few that received particular attention.

1. It’s not just – or even especially – about psychology. I was heartened to see that the government representatives saw the bulk of problems with replication as lying in fields such as molecular biology, genetics, and medicine, not in psychology. Psychology has problems too, but is widely viewed as the best place to look for solutions since the basic issues all involve human behavior. It makes me a bit crazy when psychologists say (or sometimes shout) that everything is fine, that critics of research practices are “witch hunting,” or that examining the degree to which our science is replicable is self-defeating. Quite the reverse: psychology is being looked to as the source of the expertise that can improve all of science. As a psychologist, I’m proud of this.

2. Preregistration. Widespread enthusiasm early on Thursday for the idea of pre-registering hypotheses waned by Friday afternoon. In his concluding remarks, Tony Greenwald listed it among suggestions that he found “doubtful,” in part because it could increase pressure on investigators to produce the results they promised, rather than be open to what the data are really trying to tell them. Greg Francis also observed that pre-registration buys into the idea that the aim of science is simply to dichotomously confirm or disconfirm findings, an idea that is the source of many of our problems to begin with. In a pithy phrase, he said we should “measure effects, not test them.”

3. The tradeoff between “ground-breaking,” innovative studies and solid science. In a final editorial, the previous editor of Psychological Science described common reasons for “triaging” (rejecting without review) submitted articles. It seems he and his associate editors categorized these reasons with musical references, which they found amusing. One of the reasons was (and I quote) the “Pink Floyd rejection: Most triaged papers were of this type; they reported work that was well done and useful, but not sufficiently groundbreaking. So the findings represented just another brick in the wall of science” (emphasis in the original). I thought this label was appalling even before I saw Greg Francis’s article, now in-press at Psychonomic Bulletin and Review, that concludes that among articles published during this editor’s term that Francis reviewed, 82% had problems he concluded showed signs “that unsuccessful findings were suppressed, the experiments or analyses were improper, or that the theory does not properly account for the data.” Well, they weren’t bricks in the wall; that’s for sure!

It seems clear that grant panels and journal editors have traditionally overvalued flashy findings, especially counter-intuitive ones. This particular editor is not alone in his value system. But “counter intuitive” means prima facie implausible, and such findings should demand stronger evidence than the small N’s and sketchily-described methods that so often seem to be their basis. More broadly, when everybody is trying to be ground-breaking at the same time, who will produce the scientific “bricks” on which knowledge can securely rest? Given this state of affairs, it is not surprising that so many findings fail to hold up under scrutiny, and that the cutest findings are, as a rule, the most vulnerable.

But while everybody around the table wistfully agreed that solid research and replication deserved more respect than it typically receives, a Stanford Dean who was there pointed out the obvious (and I paraphrase): “Look: nobody is going to get tenure at Stanford [or, I would add, from any other ambitious research university] from replicating other people’s findings.”

Science needs to move forward or die; science needs a strong base of reliable fact from which to proceed. The tension between these two needs is not an easy one to resolve.

4. Do we look forward or look back? I was glad Greg Francis was at the table because his recent article puts this issue front and center. He named names! Out there in the twitterverse, I’ve seen complaints that this does more harm than good. It makes people defensive, it probably tars some people unfairly (e.g., a grad student co-author on an article p-hacked by her advisor), and in general naming names causes an uproar of a sort not exactly conducive to sober scientific discussion. Hal Pashler ran into exactly the same issue a couple of years ago when he and a student uncovered – and named – studies reporting implausibly large “voodoo correlations” in fMRI research.
It is uncomfortable to accuse someone of having published inaccurate, unreliable or possibly even fraudulent research. It is even more uncomfortable, I am sure, to be the person accused. It is probably for this reason that we hear so many calls to “look forward, not back.” This was a specific statement of both the EAPP report and the SPSP report, to name just two, and such comments were heard at the NSF meeting as well: Let’s clean up our act and do right from now on. Let’s not go back and make people uncomfortable and defensive.

I understand this view and have even on occasion expressed it. But it does worry me. If we think the findings enshrined in our journals are important – and presumably we do; it’s the output of the whole scientific enterprise in which we are engaged – then don’t we also have an obligation to clean out the findings that turn out to be unreliable or just plain wrong? And to do this, don’t we have to be specific? It’s hard to avoid concluding that the answer is “yes.” But who will do it? Do we want people to build careers looking for p-hacked articles in the published literature? Do we want to damage the reputations and possibly careers of people who were just doing “business as usual” the way they were taught? Do you want to do this? I know I don’t.

5. What can the government do? Addressing this question was the whole ostensive reason for the conference. Suggestions fell into three broad categories:

a. Reform grant review practices. NSF reviewers are, at present, explicitly required to address whether a given grant proposal is “potentially transformative.” They are not explicitly required to address whether the findings on which the research is based can be deemed reliable, or whether the proposed research includes any safeguards to insure the replicability of its findings. Maybe they should.

b. Fund replications. Before the meeting, one colleague told me that I should inform NSF that “if you pay us to do it, we’ll do it.” In other words, funds should be available for replication research. Maybe an institute should be set up to replicate key findings, or grant proposals to perform a program of important replications should receive a sympathetic hearing, or maybe ordinary grants should include a replication study or two. None of this happens now.

c. Fund research on research. I was not expecting this theme to emerge as strongly as it did. It seems like NSF may be interested in looking at ways to support “meta-science,” such as the development of techniques to detect fraudulent data, or the effect of various incentive structures on scientific behavior. Psychology includes a lot of experts on statistical methodology as well as on human behavior; there seems to be an opportunity here.

6. Unintended consequences. Our meeting chair was concerned about this issue, and rightly so. He began the meeting by recalling the good intentions that underlay the initiation of IRB’s to protect human subjects, and the monsters of mission creep, irrelevant regulation, and bureaucratic nonsense so many have become. (These are my words, not his.) Some proposed reforms entail the danger of going down the same road. Even worse: as we enshrine confidence intervals over p-values, or preregistration over p-hacking, or thresholds of permissible statistical power or even just N, do we risk replacing old mindless rituals with new ones?

7. Backlash and resistance. This issue came up only a couple of times and I wish it had gotten more attention. It seemed like nobody at the table (a) denied there was a replicability problem in much of the most prominent research in the major journals or (b) denied that something needed to be done. As one participant said, “we are all drinking the same bath water.” (I thought this phrase usually referenced Kool-Aid, but never mind.) Possibly, nobody spoke up because of the apparent consensus in the room. In any case, there will be resistance out there. And we need to watch out for it.

I expect the resistance to be of the passive-aggressive sort. Indeed, that’s what we have already seen. The sudden awarding of APA’s major scientific award to someone in the midst of a replication controversy seems like an obvious thumb in the eye to the reform movement. To underline the point, a prominent figure in social psychology (he actually appears in TV commercials!) tweeted that critics of the replicability of the research would never win such a major award. Sadly, he was almost certainly correct.

One of Geoff Cumming’s graduate students, Fiona Fidler, recently wrote a thesis on the history of null hypothesis significance testing. It’s a fascinating read and I hope will be turned into a book soon. One of its major themes is that NHST has been criticized thoroughly and compellingly many times over the years. Yet it persists, even though – and, ironically, perhaps because – it has never really been explicitly defended! Instead, the defense of NHST is largely passive. People just keep using it. Reviewers and editors just keep publishing it; granting agencies keep giving money to researchers who use it. Eventually the critiques die down. Nothing changes.

That could happen this time too. The defenders of the status quo rarely actively defend anything. They aren’t about to publish articles explaining why NHST tells you everything you need to know, or arguing that effect sizes of r = .80 in studies with an N of 20 represent important and reliable breakthroughs, or least of all reporting data to show that major counter-intuitive findings are robustly replicable. Instead they will just continue to publish each others’ work in all the “best” places, hire each other into excellent jobs and, of course, give each other awards. This is what has happened every time before.

Things just might be different this time. Doubts about statistical standard operating procedure and the replicability of major findings are rampant across multiple fields of study, not just psychology. And, these issues have the attention of major scientific studies and even the US Government. But the strength of the resistance should not be underestimated.

This meeting was very recent and I am still mulling over my own reactions to these and other issues. This post is, at best, a rough first draft. I welcome your reactions, in the comments section or elsewhere in the blogosphere/twitterverse.

34 thoughts on “NSF Gets an Earful about Replication”

Nick Brown on February 26, 2014 at 1:24 am said:

I find it ironic that paragraph #1 expresses pride that “psychology is being looked to as the source of the expertise that can improve all of science” while the rest of the article details a series of ongoing issues with groupthink and a whole raft of other psychological phenomena that were identified by psychologists half a century or more ago.

Reply ↓
davenussbaum on February 26, 2014 at 9:37 am said:

Thanks for a wonderful post, David! I wanted to take a minute clarify what I mean by the problem with “naming names.” I agree that ultimately there is no way to clean up our science and leave all existing research in place unscathed, sweeping its problems under the rug. At the same time, change is hard and to be effective it has to be done thoughtfully.

As you note, this isn’t the first time the rallying cry of reform has been sounded — and yet we don’t have much progress to show for it. There is a lot of inertia to overcome, and ultimately what we want is a science in which it is an unquestioned norm to follow responsible practices; for it to be as unthinkable to p-hack as it is to throw out a bad subject or make up your data (some people may still try to get away with it, but everyone will clearly know it’s beyond the bounds of acceptable behavior).

There are few researchers out there who have submitted more than a small handful of manuscripts who have not had reviewers ask them to drop a study — to re-run it with a modified design, more subjects (not often enough!), or to simply discard it altogether. There is no easy alternative to complying with the request that isn’t harmful to the researcher (although increasingly satisfactory solutions are starting to emerge). It seems unfair to me to then tar researchers who comply with such a request as liars or frauds or ignoramuses. That’s not to say that the practice is unproblematic, only that this way of dealing with it is, to my eyes, clearly unfair.

When you try to enact reform using practices that are unfair to some researchers, you foment resistance to reform, not only among these researchers, but to their friends and colleagues, and other witnesses to this unfairness. If there’s anything that’s going to stand in the way of normative and ultimately regulatory) reforms it’s resistance from within our ranks. What I find most encouraging about Eich’s recent changes at Psych Science, for instance, is that it’s a top down approach to fixing the problem that is built on a foundation of bottom up initiatives with relatively broad support. That’s the direction I think we need to go — not pretending like we can clean up our science by picking through the literature arbitrarily at the expense of the very people whose support we need for reforms to succeed. To end on a positive note, I think there are many people leading the charge in that direction and I think it’s telling that they are the ones who are getting the most support from others in the field.

Reply ↓
- Greg Francis on February 28, 2014 at 12:22 pm said:
  
  Hi Dave
  
  You wrote,
  
  “It seems unfair to me to then tar researchers who comply with such a request as liars or frauds or ignoramuses”
  
  which is incredibly misleading because I have never claimed that authors with work that appears too good to be true were “liars, frauds, or ignoramuses”. Indeed all of my papers argue that the issue is something else entirely: many of us misunderstand how to make good arguments across multiple studies and p-hacking and related issues are very difficult to avoid. You may believe that lies, fraud, and stupidity are involved in these cases, but please refrain from suggesting that my beliefs reflect your own.
  
  Regarding whether such naming of names if appropriate, I should inform you that I was invited by Eric Eich to do this analysis. Indeed, many details of the paper (e.g., using articles with 4 or more experiments) are his suggestion. With the invitation, he said Psych Science would publish my paper and encourage a discussion with the authors whose work was critiqued. After seeing the paper, Eich changed his mind and rejected the manuscript. Thus, I have published the article in Psychonomic Bulletin & Review. I believe that critiquing published work is an integral part of the scientific process, but if you want to complain about the naming of names, then Eich shares some of the blame.
  
  Greg
  
  Reply ↓
Sanjay Srivastava on February 26, 2014 at 2:33 pm said:

David, thanks for the great and thorough insight into this meeting. I’m really happy NSF is dedicating time and resources to it.

I want to comment on your point #3, “The tradeoff between ‘ground-breaking,’ innovative studies and solid science,” because I wrote a blog post about the tradeoff between groundbreaking and definitive (and I discussed it in my talk at SPSP) but I think I have a different take on it.

I do not think the tradeoff between groundbreaking and “solid science” is an unavoidable one, if you mean “solid” in a methodological sense. We can improve our methods in all kinds of ways — run larger samples, use better-validated measures and experimental treatments, use multiple methods, use better statistics, cross-validate our findings, and report more of our results — and none of that is at odds with producing truly groundbreaking discoveries. (Perhaps the difference is that you seem to be talking about faux-groundbreaking discoveries, and I think your point is valid there.)

The unavoidable tradeoff I see is in the conclusions we draw. A discovery that we regard as groundbreaking (meaning novel and unexpected) should not be regarded as definitive (meaning very likely to be true) until it has passed through many more hoops. And by the time it has been through those hoops — vetted by experts, directly and conceptually replicated, survived serious attempts at falsification, etc. — it will no longer be novel. Moreover, since objectivity is a core value of science, a lot of this work should be done by independent investigators — so it is actually necessary that the the initial, groundbreaking discovery gets disseminated before we have enough evidence to treat it as definitive.

A result of this is that I actually see it as necessary that in a healthy science, we should be publishing some things that later turn out not to be true. That’s a byproduct of trying to discover new things, which inherently involves the risk of being wrong. Good methodology can limit the risks but cannot make them go away. I think Simine Vazire was the one who connected the dots from that argument to NSF and its emphasis on “potentially transformative” studies. That is important, but it needs a heavy counterweight of money and credit for people doing the hard work of sorting out what part of the groundbreaking-but-not-yet-definitive studies is really true. If we don’t do that, no amount of methodological reforms will be enough.

Reply ↓
Ged Ridgway on February 27, 2014 at 2:25 am said:

I see no reason why pre-registering your methods means your results must focus on p-value based dichotomies. One can pre-reg to measure something, not just to NHST it.

Reply ↓
mbdonnellan on February 27, 2014 at 6:17 am said:

I understand why there is discomfort in going back and trying to figure out which packages have evidence of publication bias and whether specific findings replicate. However, I do not see how the field can move forward if the existing literature is shaky in parts. So I see this re-examination of the past as a necessary part of moving forward. I doubt a rational person would build an addition or second-story on a house with foundation problems.

To be frank, I think it inevitable that names are going to be named if we go down this road. But, ideally this occurs as a byproduct of getting things right. Naming names is not seen as an end in itself. I also hope that people recognize that we all make mistakes (like my typos in this comment!) and we all follow local norms. This perspective might make people less judgmental about those who get named in this process. But names are attached to papers and findings. As a practical matter, I do not see how researchers could communicate about this process without using names.

Here is the problem some colleagues and I face right now. We participated in a multi-site replication of a famous finding. We did not duplicate the result three times. The original author(s) wrote a nice reply but essentially dismiss this failure by pointing to conceptual replications in the current literature. We look at those studies and we notice a familiar pattern of large effect sizes, wide confidence intervals that almost include zero, small sample sizes, and sometimes even just errors in reporting basic degrees of freedom from t-tests. There is also what seems to be a post-hoc individual difference moderator in one study. Why can’t we raise concerns about those studies? The authors cited them in their reply so I think they are fair game for scholarly criticism. (And these studies honestly need to be replicated. Are ds over 1.0 plausible in most areas of social/personality psychology?) But we will probably look bad by bringing up these concerns.

In the end, I just can’t see how we can move forward without looking back. I think there is a way to do this respectfully and cautiously. But I think it has to be done. Most importantly, just as we should not assume bad things about the names that get named, we should not assume bad things about those who name names.

Reply ↓
- davenussbaum on February 27, 2014 at 6:58 am said:
  
  You make excellent points and highlight the fact that the context around “naming names” matters a great deal. When people go on a fishing expedition to find studies that look suspicious, that is likely to be detrimental to the overall reform movement — winning the battle but losing the war. But as you say, when you’re trying to push research forward, and that means calling past published findings into question, naming names is inevitable, and also much less problematic.
  
  Reply ↓
jbrittholbrook on February 27, 2014 at 6:25 am said:

Reblogged this on jbrittholbrook and commented:
Interesting stuff here …. I wonder about possible connections with H.R. 4012 (https://www.govtrack.us/congress/bills/113/hr4012/text).

Reply ↓
Bobbie Spellman (@BobbiePoPS) on February 27, 2014 at 3:32 pm said:

Pretty good summary, thanks. But… I could have sworn that in addition to the names you list, there were some women at the meeting, right? (Longer reply later.)

Reply ↓
- David Funder on February 27, 2014 at 3:57 pm said:
  
  Yes: Barbara Entwisle, Katherine Button, Patricia Devine, Jo Handelsman, Marcia McNutt, Leslie John, Elena Koustova, Bobbie Spellman, Joanne Tornow, Simine Vazire. And as long as I’m at it, and to list everybody else: Giorgio Ascoli, Eric Eich, Richard Ball, Ken Bollen, Gregory Francis, John Cacioppo, Anthony Greenwald, Larry Hedges, Brad Hesse, Richard Nakamura, Brian Nosek, Bob Kaplan, Jim Olds, Hal Pashler, Alan Kraut, Philip Rubin, Jon Krosnick, Richard Saller, Ron Thisted, Gary VandenBos.
  
  Reply ↓
pigee on March 4, 2014 at 6:21 am said:

The researchers who say “look forward, not back” at unbelievable past research sound like Mark McGuire testifying to Congress about performance enhancing drugs.

Reply ↓
MK on March 4, 2014 at 5:15 pm said:

That Stanford dean you paraphrased is indirectly a HUGE part of the problem. If institutions of higher learning (even if they are as unenlightened as the deans that comprise much of their governance) don’t understand that real Science is about replication and building upon the existing bricks in the wall (rather than just throwing bricks haphazardly around a field and hoping that scientific knowledge somehow precipitates around that initial brick), then our centers of research and thought are useless and obsolete. It’s time for other universities to pick up the gauntlet if Stanford is afraid of real science that is sound enough to base humankind’s future advancement upon.

Reply ↓
Robert V. Kail on March 7, 2014 at 1:07 pm said:

To imply that 82% of the articles published in Psychological Science had problems is a gross misrepresentation of Francis’ results. For the analyses reported in the PBR article, Francis downloaded all 951 articles published in Psychological Science from 2009-2012. Of those, 79 had the four or more experiments required for his analyses; of those, 44 (< 5% of those published) provided sufficient information for analysis. Of the 44, 38 were biased as Francis defines it. In other words, 38 of the 951 articles published in the four-year period met Francis’ standards of bias: 4%, not the 82% described above. In the absence of evidence indicating that the 44 articles represent a random sample of all articles published—something Francis does not contend—it’s irresponsible to suggest that four out of five authors who published an article in Psychological Science from 2009-2012 were biased by Francis’ definition.

Reply ↓
- David Funder on March 7, 2014 at 1:45 pm said:
  
  As stated in the original post, “among articles published during this editor’s term that Francis reviewed, 82% had problems…” As you say, Francis did not claim this to be a random sample. However, I see no reason to think the studies he reviewed were any worse (or better) than the ones he did not review, unless one presumes that articles reporting four or more studies are somehow more flawed than articles reporting fewer studies. Now, that would be an interesting presumption.
  
  Reply ↓
  - Robert V. Kail on May 8, 2014 at 1:07 pm said:
    
    One hypothesis is that research involving “straightforward methods” (behavioral or survey data obtained from undergraduates or online samples) would be overrepresented in Francis’ work relative to research using more labor- and time-intensive methods and less accessible populations (e.g., electrophysiological, imaging, or genotyping methods; children, families, twins, or patients as participants). In other words, researchers using “straightforward” methods can typically complete a study more rapidly, making them more likely to publish a report including the four studies needed for Francis’ analyses. In fact, only two of the 44 articles (4.5%) in the Francis analysis include complex methods; almost all involved behavioral or survey data collected from undergraduates. In contrast, for the middle 12 issues of the 4-year-period included in Francis’ analyses (July 2010-June 2011), 31% of the articles include studies with complex methods.
    
    Thus, the studies included in the Francis analysis are not representative of those published in Psychological Science and the conclusions of that analysis cannot be applied to the larger population of papers published in Psychological Science. And I would be wary of claiming that the findings apply to the 69% of articles that use “straightforward” methods like those that dominate Francis’ sample. The studies in that sample may well differ systematically along other dimensions from studies using “straightforward” methods that include three of fewer studies.
    
    In short, bias as Francis defines it is common in the 44 studies analyzed but the extent of that bias in other articles published in Psychological Science (and elsewhere) is an open question (as is the extent of other kinds of bias in studies using more complex methods).
  - Greg Francis on May 12, 2014 at 8:59 am said:
    
    I agree with Rob’s final statement that there remains an open question regarding the extent of bias in other articles published in PSCI and elsewhere. However, I don’t agree with his conclusion that the studies in the analysis are unrepresentative of those published in PSCI. Even if we accept Rob’s classification as being meaningful and related to the presence of bias, 69% of the articles in PSCI apparently fall into the straightforward methods category. So, even if the bias estimate is only valid for articles with straightforward methods, we still estimate 0.82 x 0.69 = 57% of the articles in PSCI with bias (more if some of the articles with complex methods are biased).
    
    Rob also suggests that the rate of bias might not apply to articles with straightforward methods because of systematic differences along other (unspecified) dimensions. If the same cautious attitude about generalisation was applied to experiments published in PSCI, I suspect the journal would be quite thin.
    
    Since we are speculating about whether the 82% bias rate generalises, I will offer a few thoughts.
    
    — I have trouble imagining a scenario where authors are less likely to produce biased outcomes for complex methods than for straightforward methods. If data are difficult to collect, then there is more pressure to (somehow) report a significant outcome for a small sample size. I think p-hacking is at least as likely to occur for experiments with complex methods as for experiments with straightforward methods.
    
    — I kind of hope that Rob’s suggestion of articles with complex methods having less bias turns out not to be true because such a difference would imply that psychological scientists do know how to properly connect empirical data to theoretical ideas. That sounds good by itself, except that it suggests that the authors of the 38 apparently biased articles _chose_ to make a poor argument for their theories. Such a decision by an author is uncomfortably close to fraud. My hope is that many articles appear biased because of author ignorance rather than malfeasance.
    
    — Until convinced otherwise, I tend to believe that the 82% bias rate should generalise to other articles in PSCI and to other journals. My reasoning is that bias occurs because of the way authors derive conclusions from empirical data. These authors probably make the same kind of derivations for other articles, whether there are two studies or five. It is possible to test this reasoning by looking at the average success rates for articles with three or fewer studies. If the above reasoning is correct, then the average estimated success rate (e.g., power) for those experiments would be similar to the average estimated success rate for articles with four or more studies. If articles with fewer studies are unbiased, then their average power might be higher.
Pingback: What a difference a century makes – On March 8. For those who get gloomy. | …not that kind of psychologist
Pingback: Replication: The Technology Transfer Problem | Research Enterprise
Pingback: The Deathly Hallows of Psychological Science | pigee
Pingback: Best of replication & data sharing: Collection 6 (April 2014) | Political Science Replication
Eric Eich on May 8, 2014 at 1:40 pm said:

Greetings: A colleague alerted me to a comment Greg Francis posted to this site on February 28. His remarks relate to my decision, made last October, to decline Francis’ “Cargo Cult Thinking” paper for Psychological Science (PSCI). Francis changed the title, but not much else, and got the paper published in Psychonomic Bulletin & Review; it’s online now and will appear in print shortly.

In his Funderstorms comment, Francis seeks to put his own spin on why (Senior Editor) John Jonides and I declined his PSCI submission. By Francis’ account:

“With the invitation, he [Eich] said Psych Science would publish my paper and encourage a discussion with the authors whose work was critiqued. After seeing the paper, Eich changed his mind and rejected the manuscript. Thus, I have published the article in Psychonomic Bulletin & Review. I believe that critiquing published work is an integral part of the scientific process, but if you want to complain about the naming of names, then Eich shares some of the blame.”

NSF Gets an Earful about Replication

I believe Francis’ account is misleading, and so does Jonides, who co-signed the invitation letter alluded to by Francis. It is difficult to defend one’s decisions without sounding, well, defensive. And so, I requested and received Francis’ permission to publicly share four key documents: (1) Jonides’ and my original invitation, (2) my first decision letter, (3) Francis’ appeal, and (4) my second and final decision letter. I think that providing these documents, to anyone who wants to read them, serves the interests of transparency and accountability. I told Francis that I would redact any portion of any document that would or might compromise reviewer or editor confidentiality, even though doing so undercuts the force of some of my arguments.

I’ve provided PDFs of the four documents to Greg Francis, and would be happy to share them with any interested reader; please contact me at ee@psych.ubc.ca

Reply ↓
- Greg Francis on May 9, 2014 at 8:52 am said:
  
  I am not sure which part of my comment Dr. Eich feels is misleading, as it simply describes the facts of the situation. I am sure Dr. Eich feels that he had good reasons for rejecting the manuscript. I believe Dr. Eich made a mistake to invite the analyses and then to reject the manuscript that describes their outcomes. Nevertheless, I appreciate his efforts to generate the PDF documents describing our interactions regarding my submission. I can also provide a copy of the PDFs (they are identical to Dr. Eich’s).
  
  I should mention that while the analysis details are unchanged from what was submitted to PSCI, the discussion changed quite bit. In particular, the PSCI submission made references to Feynman’s description of cargo cult science, but that discussion is not part of the Psychonomic Bulletin & Review paper.
  
  Reply ↓
  - genobollocks on May 12, 2014 at 4:11 pm said:
    
    Are you _perhaps_ finding that registered reports (and thus pre-registration) are not such a bad idea after all, Dr Francis?
    
    I’m being cheeky of course, Eich made himself look bad and you just told the facts. It’s good that you went public with this, as it goes to show how little substance that big-time editorial of his had.
    Thanks for cleaning up a little in psychology.
Chris Chambers on May 9, 2014 at 8:45 am said:

It’s a shame pre-registration didn’t get a better hearing because the concerns raised in #2 are off base in my view. All pre-registration does is ensure that authors adhere to the hypothetico-deductive method and that journals don’t assess the quality of science according to whether the results supported the hypothesis.

Greenwald’s concern that pre-reg “could increase pressure on investigators to produce the results they promised” doesn’t make sense. Journal-based pre-registration *prevents* authors engaging in practices that produce desired results; moreover, it entirely removes the pressure to do so because the outcome of hypothesis-testing doesn’t determine publishability.

And Francis’ concern that “pre-registration buys into the idea that the aim of science is simply to dichotomously confirm or disconfirm findings” is irrelevant. Such a criticism could perhaps be applied to all hypothesis-testing across all science, but it is not a criticism of pre-registration per se, especially when almost all psychological science adopts hypothesis testing and NHST. If we’re going to engage in hypothesis-testing then pre-registration prevents QRPs. If we’re not, fine. But it is not logical to argue against pre-registration on these ground.

The Registered Reports format is now launched at seven journals with more in the pipeline. Earlier this week we published a 25-point Q&A addressing some of the most comment concerns about journal-based pre-registration: http://www.aimspress.com/AN2.pdf

Reply ↓
- Greg Francis on May 9, 2014 at 9:07 am said:
  
  There was actually a bit more context to my comments than came across in Funder’s (generally excellent) description. My pithy phrase “measure effects, not test them” was intended to reflect that many of our experiments are exploratory and thereby are better analysed with techniques for effect estimation than hypothesis testing. I agree with Chris that such an emphasis does not invalidate the benefits of preregistration when there is really something worth testing.
  
  At the NSF meeting, I don’t think we fully hashed out the pros and cons of pre-registration. It was mostly discussed in an ad hoc way in between talks.
  
  Chris’s AIMS Neuroscience manuscript is definitely worth reading if you have questions about preregistration. I do have some concerns about preregistration that are not addressed in the AIMS document, and Chris and I debated them at
  
  http://andrewgelman.com/2014/01/23/discussion-preregistration-research-studies/
  
  Reply ↓
  - Chris Chambers on May 9, 2014 at 9:18 am said:
    
    Thanks, Greg! I think the answer here is for our community to better recognise the merits of exploratory research and provide a mechanism for it to be published in its native state. The problem at the moment is that all research, whether hypothesis-driven or exploratory, must be shoe-horned into a publication model that is designed around hypothesis testing. I think this is a key reason why some people find pre-registration threatening.
    
    But actually the problem here isn’t pre-registration, it is the shoe-horning. The answer is to create a new format for exploratory research alongside pre-registered formats: an exploratory format with no hypotheses, no NHST, with analyses based on confidence intervals or parameter estimation, and a discussion that generates hypotheses to be tested in future pre-registered studies.
    
    Stay tuned because this is coming soon to a journal near you…
Pingback: When Did We Get so Delicate? | funderstorms
Martha Smith on July 6, 2014 at 3:56 pm said:

The debate about NHST often seems a bit of a diversion from the real issues to me: NHST can be done well and can done poorly. Similarly, estimating parameters can be done well and can be done poorly.
When either is done poorly, it’s for pretty much the same reasons (often more than one): Lack of attention to quality and suitability of outcome measures; lack of attention to model assumptions and robustness; neglecting to account for multiple inference (calculating a confidence interval is a form of inference!); using sloppy methods for calculating power (or neglecting it altogether); neglecting practical significance.
The issue here is that these sloppy practices are entrenched. (For more discussion, see the series of posts Beyond the Buzz at http://www.ma.utexas.edu/blogs/mks/

Reply ↓
Victoria Savalei on July 10, 2014 at 5:20 pm said:

Ah, but Cumming is wrong:

Abelson, R. P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science, 8, 12-15.

Frick, R.W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379–390.

Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods,5, 241-301.

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16-26. (despite the title this article is an attempt at a pragmatic compromise)

Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.

And many others cited therein… Most published critiques or defenses further included commentaries, always representing a variety of perspectives. Cumming’s view is extreme, and his literature review on the topic is an extreme case of citation bias.

Reply ↓
Pingback: The replication crisis in science has just begun. It will be big. | Fabius Maximus website
Pingback: The replication crisis in science has just begun. It will be big. - Principia Scientific International
Pingback: Climate science might become the most important casualty of the replication crisis | Watts Up With That?
Editor of the Fabius Maximus website on May 14, 2016 at 9:34 am said:

Other evidence that the pushback has begun to the replication crisis. Institutions will not reform without a fight, astonishing those who believed the “Whig history” of science.

“Errors riddled 2015 study showing replication crisis in psychology research, scientists say” by Amy Ellis Nut at the WaPo, 3 March 2016.

“Is Psychology’s Replication Crisis Really Overblown?” by Jesse Singal at New York Magazine, 8 March 2016.

Reply ↓
Lydia Maniatis (@LydiaManiatis) on June 7, 2016 at 11:50 pm said:

“Science is rooted in the will to truth. With the will to truth it stands or falls. Lower the standard even slightly and science becomes diseased at the core.” Max Wertheimer

Reply ↓