The Real Source of the Replication Crisis

“Replication police.” “P-squashers.” “Hand-wringers.” “Hostile replicators.”  And of course, who can ever forget, “shameless little bullies.”  These are just some of the labels applied to what has become known as the replication movement, an attempt to improve science (psychological and otherwise) by assessing whether key findings can be reproduced in independent laboratories.

Replication researchers have sometimes targeted findings they found doubtful.  The grounds for finding them doubtful have included (a) the effect is “counter-intuitive” or in some way seems odd (1), (b) the original study had a small N and an implausibly large effect size, (c) anecdotes (typically heard at hotel bars during conferences) abound concerning naïve researchers who can’t reproduce the finding, (d) the researcher who found the effect refuses to make data public, has “lost” the data or refuses to answer procedural questions, or (e) sometimes, all of the above.

Fair enough. If a finding seems doubtful, and it’s important, then it behooves the science (if not any particular researcher) to get to the bottom of things.  And we’ve seen a lot of attempts to do that lately. Famous findings by prominent researchers have been put  through the replication wringer, sometimes with discouraging results.  But several of these findings also have been stoutly defended, and indeed the failure to replicate certain prominent effects seems to have stimulated much of the invective thrown at replicators more generally.

One target of the “replication police” has enjoyed few defenders. I speak, of course, of Daryl Bem (2). His findings on ESP stimulated an uncounted number of replications.  These were truly “hostile” – they set out to prove his findings wrong and what do you know, the replications uniformly failed. In response, by all accounts, Daryl has been the very model of civility. He provides his materials and data freely and without restriction, encourages the publication of all findings regardless of how they turn out, and has managed to refrain from telling his critics that they have “nothing in their heads” or indeed, saying anything negative about them at all that I’ve seen. Yet nobody comes to his defense (3) even though any complaint anybody might have had about the “replication police” applies, times 10, to the reception his findings have received. In particular, critiques of the replication movement never mention the ESP episode, even though Bem’s experience probably provides the best example of everything they are complaining about.

Because, and I say this with some reluctance, the backlash to the replication movement does have a point.  There IS something a bit disturbing about the sight of researchers singling out effects they don’t like (maybe for good reasons, maybe not), and putting them – and only them – under the replication microscope.  And, I also must admit, there is something less-than-1000% persuasive about the inability to find an effect by somebody who didn’t believe it existed in the first place. (4)

But amidst all the controversy, one key fact seems to be repeatedly overlooked by both sides (6).  The replication crisis did NOT arise because of studies intended to assess the reality of doubtful effects.  That only came later.  Long before were repeated and, indeed, ubiquitous stories of fans of research topics – often graduate students – who could not make the central effects work no matter how hard they tried.  These failed replications were anything but hostile.  I related a few examples in a previous post but there’s a good chance you know some of your own.  Careers have been seriously derailed as graduate students and junior researchers, naively expecting to be able to build on some of the most famous findings in their field, published in the top journals by distinguished researchers, simply couldn’t make them work.

What did they do? In maybe 99% of all cases (100% of the cases I personally know about) they kept quiet. They – almost certainly correctly – saw no value for their own careers in publicizing their inability to reproduce an effect enshrined in textbooks.  And that was before a Nobel laureate promulgated new rules for doing replications, before a Harvard professor argued that failed replications provide no information and, of course, long before people reporting replication studies started to be called “shameless little bullies.”

Most of the attention given these days to the replication movement — both pro and con — seems to center around studies specifically conducted to assess whether or not particular findings can replicated. I am one of those who believe that such studies are, by and large, useful and important.  But we should remember that the movement did not come about because of, or in order to promote such studies.  Instead, its original motivation was to make it a bit easier and safer to report data that go against the conventional wisdom, and thereby protect those who might otherwise waste years of their lives trying to follow-up on famous findings that have already been disconfirmed, everywhere except in public.  From what we’ve seen lately, this goal remains a long ways off.

  1. Naturally, counter-intuitiveness and oddity is in the eye of the beholder.
  2. Full disclosure: He was my graduate advisor. An always-supportive, kind, wise and inspirational graduate advisor.
  3. Unless this post counts.
  4. This is why Brian Nosek and the Center for Open Science made exactly the right move when they picked out a couple of issues of JPSP and two other prominent journals and began to recruit researchers to replicate ALL of the findings in them (5). Presumably nobody will have a vested interest or even a pre-existing bias as to whether or not the effects are real, making the eventual results of these replications, when they arrive, all the more persuasive.
  5. Full disclosure: A study of mine was published in one of those issues of JPSP. Gulp!
  6. Even here, which is nothing if not thorough.

Acknowledgements: Simine Vazire showed me that footnotes can be fun.  But I still use my caps key. Simine and Sanjay Srivastava gave me some advice, some of which I followed.  But in no way is this post their fault.

When Did We Get so Delicate?

Replication issues are rampant these days. The recent round of widespread concern over whether supposedly established findings can be reproduced began in biology and the related life sciences, especially medicine. Psychologists entered the fray a bit later, largely in a constructive way. Individuals and professional societies published commentaries on methodology, journals acted to revise their policies to promote data transparency and encourage replication, and the Center for Open Science took concrete steps to make doing research “the right way” easier. As a result, psychology was viewed not as the poster child of replication problems, quite the opposite. It became viewed as the best place to look for solutions to these problems.

So what just happened? In the words of a headline in the Chronicle of Higher Education, the situation in psychology has suddenly turned “ugly and odd.”  Some psychologists whose findings were not replicated are complaining plaintively about feeling bullied. Others are chiming in about how terrible it is that people’s reputations are ruined when others can’t replicate their work. People doing replication studies have been labeled the “replication police,” “replication Nazis” and even, in one prominent psychologist’s already famous phrase, “shameless little bullies.” This last-mentioned writer also passed along an anonymous correspondent’s description of replication as a “McCarthyite nightmare.”  More sober commentators have expressed worries about “negative psychology” and “p-squashing.” Concern has shifted away from the difficulties faced by those who can’t make famous effects “work,” and the dilemma about whether they dare to go public when this happens. Instead, prestigious commentators are worrying about the possible damage to the reputations of the psychologists who discovered these famous effects, and promulgating new rules to follow before going public with disconfirmatory data.

First, a side comment: It’s my impression that reputations are not really damaged, much less ruined, by failures to replicate. Reputations are damaged, I fear, by defensive, outraged reactions to failures to replicate one’s work. And we’ve seen too many of those, and not enough reactions like this.

But now, the broader point: When did we get so delicate? Why are psychologists, who can and should lead the way in tackling this scientific issue head-on, and until recently were doing just that, instead becoming distracted by reputational issues and hurt feelings?

Is anybody in medicine complaining about being bullied by non-replicators, or is anyone writing blog posts about the perils of “negative biology”? Or is it just us? And if it’s just us, why is that? I would really like to know the answer to this question.

For now, if you happen to be a psychologist sitting on some data that might undermine somebody’s famous finding, the only advice I can give you is this:  Mum’s the word.  Don’t tell a soul.  Unless you are the kind of person who likes to poke sticks into hornets’ nests.

The “Fundamental Attribution Error” and Suicide Terrorism

Review of: Lankford, A. (2013) The myth of martyrdom: What really drives suicide bombers, rampage shooters, and other self-destructive killers. Palgrave Macmillan.
In Press, Behavioral and Brain Sciences (published version may differ slightly)

In 1977, the social psychologist Lee Ross coined the term “fundamental attribution error” to describe the putative tendency of people to overestimate the importance of dispositional causes of behavior, such as personality traits and political attitudes, and underestimate the importance of situational causes, such as social pressure or objective circumstances.  Over the decades since, the term has firmly rooted itself into the conventional wisdom of social psychology, to the point where it is sometimes identified as the field’s basic insight (Ross & Nisbett 2011). However, the actual research evidence purporting to demonstrate this error is surprisingly weak (see, e.g., Funder 1982; Funder & Fast 2010; Krueger & Funder 2004), and at least one well-documented error (the “false consensus bias” (Ross 1977a) implies that people overestimate the degree to which their behavior is determined by the situation.

Moreover, everyday counter-examples are not difficult to formulate. Consider the last time you tried, in an argument, to change someone’s attitude. Was it easier, or harder than you expected?  Therapeutic interventions and major social programs intended to correct dispositional problems, such as tendencies towards violence or alcoholism also are generally less successful than anticipated. Work supervisors and even parents, who have a great deal of control over the situations experienced by their employees or children, similarly find it surprisingly difficult to control behaviors as simple as showing up on time or making one’s bed. My point is not that people never change their minds, that interventions never work, or that employers and parents have no control over employees or children; it is simply that situational influences on behavior are often weaker than expected.

Even so, it would be going too far to claim that the actual “fundamental” error is the reverse, that people overestimate the importance of situational factors and underestimate the importance of dispositions.  A more judicious conclusion would be that sometimes people overestimate the importance of dispositional factors, and sometimes they overestimate the importance of situational factors, and the important thing, in a particular case, is to try to get it right. The book under review, The Myth of Martyrdom (Lankford 2013), aims to present an extended example of an important context in which many authoritative figures get it wrong, by making the reverse of the fundamental attribution error (though the book never uses this term): When trying to find the causes of suicide terrorism, too many experts ascribe causality to the political context in which terrorism occurs, or the practical aims that terrorists hope to achieve. Instead, the author argues, most, if not all, suicide terrorists are mentally disturbed, vulnerable, and angry individuals who are not so different from run-of-the-mill suicides, and who are in fact highly similar to “non-terrorist” suicidal killers such as the Columbine or Sandy Hook murderers. Personality and individual differences are important; suicide terrorists are not ordinary people driven by situational forces.

Lankford convincingly argues that misunderstanding suicide terrorists as individuals who are rationally responding to oppression or who are motivated by political or religious goals is dangerous, because it plays into the propaganda aims of terrorist organizations to portray such individuals as brave martyrs rather than weak, vulnerable and exploitable pawns. By spreading the word that suicide terrorists are mentally troubled individuals who wish to kill themselves as much or more than they desire to advance any particular cause, Lankford hopes to lessen the attractiveness of the martyr role to would-be recruits, and also remove any second-hand glory that might otherwise accrue to a terrorist group that manages to recruit suicide-prone operatives to its banner.

Lankford’s overall message is important.  However, the book is less than an ideal vehicle for it. The evidence cited consists mostly of a hodge-podge of case studies which show that some suicide terrorists, such as the lead 9/11 hijacker, had mental health issues and suicidal tendencies that long preceded their infamous acts. The book speaks repeatedly of the “unconscious” motives of such individuals, without developing a serious psychological analysis of what unconscious motivation really means or how it can be detected. It rests much of its argument on quotes from writers that Lankford happens to agree with, rather than independent analysis. It never mentions the “fundamental attribution error,” a prominent theme within social psychology that is the book’s major implicit counterpoint, whether Lankford knows this or not. The obvious parallels between suicide terrorists and genuine heroes who are willing to die for a cause is noted, but a whole chapter (Ch. 5) attempting to explain how they are different fails to make a distinction that was clear to this reader. In the end, the book is not a work of serious scholarship. It is written at the level of a popular, “trade” book, in prose that is sometimes distractingly overdramatic and even breathless. Speaking as someone who agrees with Lankford’s basic thesis, I wish it had received the serious analysis and documentation it deserves, as well as being tied to other highly relevant themes in social psychology.  Perhaps another book, more serious but less engaging to the general reader, lies in the future. I hope so.

For, the ideas in this book are important.  One attraction of the concept of the “fundamental attribution error,” and the emphasis on situational causation in general, is that it is seen by some as removing limits on human freedom, implying that anybody can accomplish anything regardless of one’s abilities or stable attributes. While these are indeed attractive ideas, they are values and not scientific principles. Moreover, an overemphasis on situational causation removes personal responsibility, one example being the perpetrators of the Nazi Holocaust who claimed they were “only following orders.” A renewed attention on the personal factors that affect behavior not only may help to identify people at risk of committing atrocities, but also restore the notion that, situational factors notwithstanding, a person is in the end responsible for what he or she does.

Funder, D. C. (1982) On the accuracy of dispositional vs. situational attributions. Social Cognition 1:205–22.
Funder, D. C. & Fast, L. A. (2010) Personality in social psychology. In: Handbook of social psychology, 5th edition, ed. D. Gilbert & S. Fiske, pp. 668–97. Wiley.
Krueger, J. I. & Funder, D. C. (2004) Towards a balanced social psychology: Causes, consequences and cures for the problem-seeking approach to social behavior and cognition. Behavioral and Brain Sciences 27:313–27.
Lankford, A. (2013) The myth of martyrdom: What really drives suicide bombers, rampage shooters, and other self-destructive killers. Palgrave Macmillan.
Ross, L (1977a) The false consensus effect: An egocentric bias in social perception and attribution processes Journal of Experimental Social Psychology 13(3):279–301.
Ross, L. (1977b) The intuitive psychologist and his shortcomings: Distortions in the attribution process. In: Advances in experimental social psychology, vol. 10, ed. L. Berkowitz, pp. 173–220. Academic Press.
Ross, L. & Nisbett, R. E. (2011) The person and the situation: Perspectives of social psychology, 2nd edition. Pinter and Martin.

Why I Decline to do Peer Reviews (part two): Eternally Masked Reviews

In addition to the situation described in a previous post, there is another situation where I decline to do a peer review. First, I need to define a couple of terms. “Blind review” refers to the practice of concealing the identity of reviewers from authors. The reason seems pretty obvious. Scientific academia is a small world, egos are easily bruised, and vehicles for subtle or not-so-subtle vengeance (e.g., journal reviews and tenure letters) are readily at hand. If an editor wants an unvarnished critique, the reviewer’s identity needs to be protected. That’s why every journal (I know of) follows the practice of blind review.

“Masked review” is different. In this practice, the identity of the author(s) is concealed from reviewers. The well-intentioned reason is to protect authors from bias, such as bias against women, junior researchers, or researchers from non-famous institutions. Some journals use masked review for all articles; some offer the option to authors; some do not use it at all.

A few years ago, I did a review of an article submitted to Psychological Bulletin. The journal had a policy of masked review posted on its masthead, noting that that the identity of the author(s) is concealed from the reviewers “during the review process.” I liked the article and wrote a positive review. The other two reviewers didn’t like it, and the article was rejected. I was surprised, when I received my copy of the rejection letter, that the authors’ identity was still redacted.

So I contacted the editor. I was sure there had been some (minor) mistake. But the editor refused to reveal who the authors were, saying that the review was masked. I pointed out the phrase in the statement of journal policy that authors’ identity would be concealed “during the review process.” I had assumed this meant, only during the review process. The editor replied that while he could see my point, he could only reveal the authors’ name(s) with the authors’ permission. This seemed odd but I said ok, go ahead, ask the authors if I can know who they are. The answer came back that I could, if I revealed my own identity!

Now, I should not have had any problem with this, right? My own review was positive, so this was probably a chance to make a new friend. I only wanted to know the authors’ identity so that I could follow their work in general, and the fate of this particular article in particular. Still, the implications disturbed me. If the rule is that author identity is unmasked after the review process only if the reviewer agrees to be identified to the author, then it seems that only writers of positive reviews would learn authors’ identity, because they are the only ones would agree. Authors of negative reviews would be highly unlikely to allow their identity to be revealed because of possible adverse consequences – recall this is the very reason for “blind” review in the first place. And, the whole situation makes no sense anyway. What’s the point of continuing to mask author identity after the review is over?

At this time, ironically, I was a member of the Publications and Communications Board of the American Psychological Association, which oversees all of its journals including Psychological Bulletin. And then, though the normal rotation, I became Chair of this august body! There was a sort-of joke around the P&C Board, that every Chair got one “gimme,” a policy change that everybody would go along with to allow the Chair to feel like he or she had made a mark. The gimme I wanted was to change APA’s policy on masked review to match what the statement at Psychological Bulletin implied was its policy already: Authors’ identities would be revealed to reviewers at the conclusion of the review process.

The common sense of this small change, if that’s what it even was, seemed so obvious that arguments in its favor seemed superfluous. But I came up with a few anyway:
1. The purpose of masked review, in the words of the APA Editor’s Handbook, is “to achieve unbiased review of manuscripts.” This purpose is no longer served once review is over.
2. Reviewers are unpaid volunteers. One of the few rewards of reviewing is early and first-hand contact with the research literature, which allows one to follow the development of research programs by researchers or teams of researchers over time. This reward is to some extent – to a large extent? – removed by concealing author identity even when the review is over. Moreover, the persistent concealment of author identity signals a distrust of reviewers who have given of their time.
3. Important facts can come to light when author identity is revealed. A submitted article may be a virtual repeat of a previous article by the same authors (self-plagiarism), it may contradict earlier work by the same authors without attempting to resolve the contradiction, or it may have been written by a student or advisor of a reviewer who may or may not have noticed and may or may not have notified the editor if he or she did notice. These possibilities are all bad enough during the review process; they can permanently evade detection unless author identity is unmasked at some point.
4. The APA handbook already acknowledges that masking is incomplete at best. The action editor knows author identity, and the mask often slips in uncontrolled ways (e.g., the reviewer guessing – correctly or not). So ending masking at the end of the review process is a way to equalize the status of all authors rather than have their identity guessed correctly in some cases and incorrectly guessed in others — which itself could have odd consequences for the person who was thought to be the author, but wasn’t.

Do these arguments make sense to you? Then you and I are both in the minority. The arguments failed. The P&C Board actually did vote to change APA policy, as a personal favor I think, but the change was made contingent on comments from the Board of Editors (which comprises the editors of all the APA journals). I was not included in the Board of Editors meeting, but word came back that they did not like my proposal. Among the reasons: an author’s feelings might get hurt! And, it might hurt an author’s reputation if it ever became known that he or she had an article rejected. Because, it seems, this never happens to good scientists.

Today, the policy at Psychological Bulletin reads as follows: “The identities of authors will be withheld from reviewers and will be revealed after determining the final disposition of the manuscript only upon request and with the permission of the authors.” This is pretty much where the editor of the Bulletin came down, years ago, when I tried to find out an author’s identity. I guess I did have an impact on how this policy is now worded, if not its substance.

So here is the second reason that I (sometimes) decline to do peer reviews. If the authors’ identity is masked, I ask the editor whether the masking will be removed when the review process is over. If the answer is no, then I decline. The answer is usually no, so I get to decline a fair number of reviews.

Postscript: After writing the first draft of this blog, I was invited to review a (masked) article submitted to the Bulletin. I asked my standard question about unmasking at the conclusion of the review process. Instead of an answer, I received the following email: “As it turns out, your review will not be needed for me to make a decision, so unless you have already started, please do not complete your review.” So, I didn’t.

Can Personality Change?

Can personality change? In one respect, the answer is clearly “yes,” because ample evidence shows that, overall, personality does change. On average, as people get older (after about age 20), they also become less neurotic and more conscientious, agreeable, and open, until about age 60 or so (Soto, John, Gosling & Potter, 2011). And then, after about age 65, they become on average less conscientious, agreeable, and extraverted – a phenomenon sometimes called the La Dolce Vita effect (Lucas & Donnellan, 2011). You no longer have to go to work every day, or socialize with people you don’t really like. Old age might really have some compensating advantages, after all.

But I think when people ask “can personality change” the inevitable consequences of age are not really what they have in mind. What they are asking is: Can personality change on purpose? Can I change my own personality? Or, can I change the personality of my child, or my spouse? One of the disconcerting things about being a psychologist is that the people I meet sometimes think I can answer questions like these. (This belief persists even if I try to beg off, saying “I’m not that kind of psychologist.”)

Here is the answer I have been giving for years: No. Or, almost no. Personality is the persistent foundation of who you are. Any attempt to change it, to have a chance of success, will have to be commensurate to the factors that created your personality in the first place. By which I mean: years of experience, rewards for doing some things, and punishments for doing others, as they interacted with your own particular genetic makeup over your lifetime up until now.

I’m starting to think I was wrong. Evidence accumulating since the time of Smith, Glass and Miller’s (1980) classic meta-analysis suggests that some kinds of psychotherapy, especially if combined with the right mix of medical interventions (e.g., flouxetine), can change personality in consequential ways. An intriguing new theoretical model (Magidson, Roberts, Collado-Rodriguez & Lejuez, 2014) provides a deceptively simple route towards personality change: Change the behaviors, and the trait will follow. For example, if you can get someone in the habit of showing up for work on time, socializing with his family, and fulfilling other obligations instead of doing cocaine (this is a real example from the article just cited), he just might develop an enhanced trait of conscientiousness that will spill over in a beneficial way to all areas of his life.

Many years ago, while a graduate student at Stanford, I took a course from Albert Bandura that was titled, “Principles of Personality Change.” Ironically given the title, the course wasn’t really about personality, it was about how techniques based on social learning theory could be used to change specific problematic behaviors. Two behaviors of particular interest were agoraphobia (the fear of going outside) and fear of snakes. Stanford had an experimental clinic that would run occasional newspaper ads offering free treatment and would always be immediately deluged with calls. In particular, Palo Alto had a surprising number of housewives who were so afraid of snakes they couldn’t go outdoors. This despite the fact that in Palo Alto there are no snakes. (The social learning theorists in charge of the clinic made no Freudian inferences from this phenomenon.) Yet it turned out to be easier to train these clients not to fear snakes, than it was to convince them that in Palo Alto, there aren’t any.

The treatment involved “systematic desensitization,” in which clients are induced to perform the feared behavior through small, incremental steps. One day Dr. Bandura told our class about a recent client who graduated to the point of being able to comfortably handle a boa constrictor. After that, she was able to go home and, for the first time, confront her landlord and get her toilet fixed. I recall asking whether that didn’t show that the snake phobia treatment had an effect on her trait of assertiveness. My recollection of Dr. Bandura’s answer is less clear, but I do recall that he didn’t care for any sort of reference to personality traits. The word “trait” was (and in some quarters still is) anathema. He preferred to talk of things like generalization gradients (landlord = snake?). But I thought then, and think now, that it is more parsimonious, clear, and just plain correct to think about an effect like this in terms of traits. The United States Marines used to use the recruiting slogan, “The Marine Corps builds men.” I think this was a parallel claim, that the kind of training one would get in the course of becoming a Marine would change general personality traits that would affect behavior in all areas of life. I once mused about doing a study to find out if this was true, but never did it.

But the time for such studies has arrived: If you change a behavior in one area, will it change behaviors in other areas? If I learn to be more assertive with my boss, will I become more assertive with my spouse, or my children, or with the car dealer that sells me a lemon? If I learn to be on time for appointments, or even just make my bed regularly, will I become a more conscientious person? It is a yet unproven but extremely intriguing possibility that the answer in these cases, and others, might just be “yes.” If that’s true, the implications for improving human well-being are profound. Maybe we really can become the people we’d prefer to be, and help others to do the same.


Lucas, R.E., & Donnellan, M.B. (2011). Personality development across the life span: Longitudinal analyses with a national sample from Germany. Journal of Personality and Social Psychology, 101, 847-861.
Magidson, J.F., Roberts, B.W., Collado-Rodriguez,A., & Lejuez, C.W. (2014). Theory-driven intervention for changing personality: Expectancy value theory, behavioral activation, and conscientiousness. Developmental Psychology, 50, 1442-1450.
Smith, M.L., Glass, G.V., & Miller,T.I. (1980). The benefits of psychotherapy. Baltimore: Johns Hopkins University Press.
Soto, C.J., John, O.P., Gosling, S.D., & Potter, J. (2011). Age differences in personality traits from 10 to 65: Big Five domains and facets in a large cross-sectional sample. Journal of Personality and Social Psychology, 100, 330-348.


NSF Gets an Earful about Replication

I spent last Thursday and Friday (February 20 and 21) at an NSF workshop concerning the replicability of research results. It was chaired by John Cacioppo and included about 30 participants including such well-known contributors to the discussion as Brian Nosek, Hal Pashler, Eric Eich, and Tony Greenwald, to name a few.  Participants also included officials from NIH, NSF, the White House Office on Science and Technology and at least one private foundation. I was invited, I presume, in my capacity as Past-President of SPSP and chair of an SPSP task force on research practices which recently published a report on non-retracted PSPB articles by investigators who retracted articles elsewhere, and a set of recommendations for research and educational practice, which was just published in PSPR.

Committees, task forces and workshops – whatever you call them – about replicability issues have become almost commonplace.  The SPSP Task Force was preceded by a meeting and report sponsored by the European Association of Personality Psychology, and other efforts have been led by APS, the Psychonomic Society and other organizations.  Two symposia on the subject were held at the SPSP meeting in Austin just the week before.  But this discussion was perhaps special, because it is the first (to my knowledge) to be sponsored by the US government, with the explicit purpose of seeking advice about what NSF and other research agencies should do.  I think it is fair to say: When replication is discussed in a meeting with representatives from NIH, NSF and the White House, the issue is on the front burner!

The discussion covered several themes, some of which are more familiar than others. From my scribbled notes, I can list a few that received particular attention.

1.      It’s not just – or even especially – about psychology.  I was heartened to see that the government representatives saw the bulk of problems with replication as lying in fields such as molecular biology, genetics, and medicine, not in psychology.  Psychology has problems too, but is widely viewed as the best place to look for solutions since the basic issues all involve human behavior.  It makes me a bit crazy when psychologists say (or sometimes shout) that everything is fine, that critics of research practices are “witch hunting,” or that examining the degree to which our science is replicable is self-defeating.  Quite the reverse: psychology is being looked to as the source of the expertise that can improve all of science.  As a psychologist, I’m proud of this.

2.      Preregistration.  Widespread enthusiasm early on Thursday for the idea of pre-registering hypotheses waned by Friday afternoon.  In his concluding remarks, Tony Greenwald listed it among suggestions that he found “doubtful,” in part because it could increase pressure on investigators to produce the results they promised, rather than be open to what the data are really trying to tell them.  Greg Francis also observed that pre-registration buys into the idea that the aim of science is simply to dichotomously confirm or disconfirm findings, an idea that is the source of many of our problems to begin with.  In a pithy phrase, he said we should “measure effects, not test them.”

3.      The tradeoff between “ground-breaking,” innovative studies and solid science.  In a final editorial, the previous editor of Psychological Science described common reasons for “triaging” (rejecting without review) submitted articles.  It seems he and his associate editors categorized these reasons with musical references, which they found amusing.    One of the reasons was (and I quote) the “Pink Floyd rejection: Most triaged papers were of this type; they reported work that was well done and useful, but not sufficiently groundbreaking. So the findings represented just another brick in the wall of science” (emphasis in the original).  I thought this label was appalling even before I saw Greg Francis’s article, now in-press at Psychonomic Bulletin and Review, that concludes that among articles published during this editor’s term that Francis reviewed, 82% had problems he concluded showed signs “that unsuccessful findings were suppressed, the experiments or analyses were improper, or that the theory does not properly account for the data.” Well, they weren’t bricks in the wall; that’s for sure!

It seems clear that grant panels and journal editors have traditionally overvalued flashy findings, especially counter-intuitive ones.  This particular editor is not alone in his value system. But “counter intuitive” means prima facie implausible, and such findings should demand stronger evidence than the small N’s and sketchily-described methods that so often seem to be their basis.  More broadly, when everybody is trying to be ground-breaking at the same time, who will produce the scientific “bricks” on which knowledge can securely rest?  Given this state of affairs, it is not surprising that so many findings fail to hold up under scrutiny, and that the cutest findings are, as a rule, the most vulnerable.

But while everybody around the table wistfully agreed that solid research and replication deserved more respect than it typically receives, a Stanford Dean who was there pointed out the obvious (and I paraphrase): “Look: nobody is going to get tenure at Stanford [or, I would add, from any other ambitious research university] from replicating other people’s findings.”

Science needs to move forward or die; science needs a strong base of reliable fact from which to proceed.  The tension between these two needs is not an easy one to resolve.

4.      Do we look forward or look back?  I was glad Greg Francis was at the table because his recent article puts this issue front and center.  He named names!  Out there in the twitterverse, I’ve seen complaints that this does more harm than good.  It makes people defensive, it probably tars some people unfairly (e.g., a grad student co-author on an article p-hacked by her advisor), and in general naming names causes an uproar of a sort not exactly conducive to sober scientific discussion.  Hal Pashler ran into exactly the same issue a couple of years ago when he and a student uncovered – and named – studies reporting implausibly large “voodoo correlations” in fMRI research.
It is uncomfortable to accuse someone of having published inaccurate, unreliable or possibly even fraudulent research.  It is even more uncomfortable, I am sure, to be the person accused.  It is probably for this reason that we hear so many calls to “look forward, not back.”  This was a specific statement of both the EAPP report and the SPSP report, to name just two, and such comments were heard at the NSF meeting as well:  Let’s clean up our act and do right from now on.  Let’s not go back and make people uncomfortable and defensive.

I understand this view and have even on occasion expressed it.  But it does worry me.  If we think the findings enshrined in our journals are important – and presumably we do; it’s the output of the whole scientific enterprise in which we are engaged – then don’t we also have an obligation to clean out the findings that turn out to be unreliable or just plain wrong?  And to do this, don’t we have to be specific?  It’s hard to avoid concluding that the answer is “yes.”  But who will do it?  Do we want people to build careers looking for p-hacked articles in the published literature?  Do we want to damage the reputations and possibly careers of people who were just doing “business as usual” the way they were taught?  Do you want to do this?  I know I don’t.

5.      What can the government do?  Addressing this question was the whole ostensive reason for the conference.  Suggestions fell into three broad categories:

a.      Reform grant review practices.  NSF reviewers are, at present, explicitly required to address whether a given grant proposal is “potentially transformative.”  They are not explicitly required to address whether the findings on which the research is based can be deemed reliable, or whether the proposed research includes any safeguards to insure the replicability of its findings.  Maybe they should.

b.      Fund replications.  Before the meeting, one colleague told me that I should inform NSF that “if you pay us to do it, we’ll do it.”  In other words, funds should be available for replication research.  Maybe an institute should be set up to replicate key findings, or grant proposals to perform a program of important replications should receive a sympathetic hearing, or maybe ordinary grants should include a replication study or two.  None of this happens now.

c.      Fund research on research.  I was not expecting this theme to emerge as strongly as it did.  It seems like NSF may be interested in looking at ways to support “meta-science,” such as the development of techniques to detect fraudulent data, or the effect of various incentive structures on scientific behavior.  Psychology includes a lot of experts on statistical methodology as well as on human behavior; there seems to be an opportunity here.

6.      Unintended consequences.  Our meeting chair was concerned about this issue, and rightly so.  He began the meeting by recalling the good intentions that underlay the initiation of IRB’s to protect human subjects, and the monsters of mission creep, irrelevant regulation, and bureaucratic nonsense so many have become.  (These are my words, not his.)  Some proposed reforms entail the danger of going down the same road.  Even worse:  as we enshrine confidence intervals over p-values, or preregistration over p-hacking, or thresholds of permissible statistical power or even just N, do we risk replacing old mindless rituals with new ones?

7.      Backlash and resistance.  This issue came up only a couple of times and I wish it had gotten more attention.  It seemed like nobody at the table (a) denied there was a replicability problem in much of the most prominent research in the major journals or (b) denied that something needed to be done.  As one participant said, “we are all drinking the same bath water.”  (I thought this phrase usually referenced Kool-Aid, but never mind.)  Possibly, nobody spoke up because of the apparent consensus in the room.  In any case, there will be resistance out there.  And we need to watch out for it.

I expect the resistance to be of the passive-aggressive sort.  Indeed, that’s what we have already seen.  The sudden awarding of APA’s major scientific award to someone in the midst of a replication controversy seems like an obvious thumb in the eye to the reform movement.  To underline the point, a prominent figure in social psychology (he actually appears in TV commercials!) tweeted that critics of the replicability of the research would never win such a major award.  Sadly, he was almost certainly correct.

One of Geoff Cumming’s graduate students, Fiona Fidler, recently wrote a thesis on the history of null hypothesis significance testing.  It’s a fascinating read and I hope will be turned into a book soon. One of its major themes is that NHST has been criticized thoroughly and compellingly many times over the years.  Yet it persists, even though – and, ironically, perhaps because – it has never really been explicitly defended!  Instead, the defense of NHST is largely passive.  People just keep using it.  Reviewers and editors just keep publishing it; granting agencies keep giving money to researchers who use it.  Eventually the critiques die down.  Nothing changes.

That could happen this time too.  The defenders of the status quo rarely actively defend anything. They aren’t about to publish articles explaining why NHST tells you everything you need to know, or arguing that effect sizes of r = .80 in studies with an N of 20 represent important and reliable breakthroughs, or least of all reporting data to show that major counter-intuitive findings are robustly replicable.   Instead they will just continue to publish each others’ work in all the “best” places, hire each other into excellent jobs and, of course, give each other awards.  This is what has happened every time before.

Things just might be different this time.  Doubts about statistical standard operating procedure and the replicability of major findings are rampant across multiple fields of study, not just psychology.  And, these issues have the attention of major scientific studies and even the US Government.  But the strength of the resistance should not be underestimated.

This meeting was very recent and I am still mulling over my own reactions to these and other issues.  This post is, at best, a rough first draft.  I welcome your reactions, in the comments section or elsewhere in the blogosphere/twitterverse.

Why I Decline to do Peer Reviews (part one): Re-reviews

Like pretty much everyone fortunate enough to occupy a faculty position in psychology at a research university, I am frequently asked to review articles submitted for publication to scientific journals. Editors rely heavily on these reviews in making their accept/reject decisions. I know: I’ve been an editor myself, and I experienced first-hand the frustrations in trying to persuade qualified reviewers to help me assess the articles that flowed over my desk in seemingly ever-increasing numbers. So don’t get me wrong: I often do agree to do reviews – around 25 times a year, which is probably neither much above nor below the average for psychologists at my career stage. But sometimes I simply refuse, and let me explain one reason why.

The routine process of peer review is that the editor reads a submitted article, selects 2 or 3 individuals thought to have reasonable expertise in the topic, and asks them for reviews. After some delays due to reviewers’ competing obligations, trips out of town, personal emergencies or – the editor’s true bane – lengthy failures to respond at all, the requisite number of reviews eventually arrive. In a very few cases, the editor reads the reviews, reads the article, and accepts it for publication. In rather more cases, the editor rejects the article. The authors of the remaining articles get a letter inviting them to “revise and resubmit.” In such cases, in theory at least, the reviewers and/or the editor see a promising contribution. Perhaps a different, more informative statistic could be calculated, an omitted relevant article cited, or a theoretical derivation explained more clearly. But the preliminary decision clearly is – or should be – that the research is worth publishing; it could just be reported a bit better.

What happens then? What should happen, in my opinion, is that the author(s) complete their revision, the editor reads it, perhaps refreshing his or her memory by rereading the reviewers’ comments, and then makes a final accept/reject decision. After all, the editor is – presumably and hopefully – at least somewhat cognizant of the topic area and perhaps truly expert. The reviewers were selected for their specific expertise and have already commented on the article, sometimes in great detail. Armed with that, the editor should not find it too difficult to make a final decision.

Too often, this is not what happens. Instead, the editor sends the revised paper out for further review! Usually, this entails sending it to the same individuals who reviewed the article once already. Sometimes – in a surprisingly common practice that every author deeply loathes – the editor also sends the revised article to new reviewers. Everyone then weighs in, looking to make sure their favorite comments were addressed, and making further comments for further revision. The editor reads the reviews and perhaps now makes a final decision, but sometimes not. Yes, the author may be asked to revise and resubmit yet again – and while going back to the old reviewers a third time (plus yet new reviewers) is less likely, it is far from unheard of.

What is the result of this process? What an editor who acts this way would no doubt say is, the article is getting better and the journal is as a result publishing better science. Perhaps. But there are a few other results:

  1. The editor has effectively dodged much of the responsibility for his/her editorial decision. Some editors add up the reviewers’ verdicts as if they were votes; some insist on unanimous positive verdicts from long lists of reviewers; in every case the editor can and often does point to the reviewers – rather than to him or herself – as the source of negative comments or outcomes.
  2. The review process has been extended to be epically long. It is, sadly, not especially unusual for this process of reiteration to take a year or more. Any review process shorter than several months is considered lightning-fast for most journals in psychology.
  3. The reviewers have been given the opportunity to micro-manage the paper. They can and often do demand that new references be inserted (sometimes articles written by the reviewers themselves), theoretical claims be toned down (generally ones the reviewers disagree with), and different statistics be calculated. Reviewers may even insist that whole sections be inserted, removed, or completely rewritten.
  4. (As a result of point 3): The author is driven to, in a phrase we have all heard, “make the reviewers happy.” In an attempt to be published, the author will (a) insert references he/she does not actually think are germane, (b) make theoretical statements different from what he/she actually believes to be correct and (c) take out sections he or she thought was important, add sections he or she thinks are actually irrelevant, and rephrase discussions using another person’s words. The author’s name still goes on the paper, but the reviewers have become, in effect, co-authors. In a final bit of humiliating obsequience, the “anonymous reviewers” may be thanked in a footnote. This expressed gratitude is not always 100% sincere.

These consequences are all bad, but the worst is number 4. A central obligation of every scientist – of every scholar in every field, actually – is to say what one really thinks. (If tenure has a justification, this is it.) And yet the quest to “make the reviewers happy” leads too many authors to say things they don’t completely believe. At best, they are phrasing things differently than they would prefer, or citing a few articles that they don’t really regard as relevant. At worst, they are distorting their article into an incoherent mish-mash co-written by a committee of anonymous reviewers — none of whom came up with the original idea for the research, conducted the study, or is held accountable for whether the article finally published is right or wrong.
So that’s why, on the little box at the bottom of the peer review sheet that asks, “Would you be willing to review a revision of this article?” I check “no.” Please, editors: Evaluate the article that was submitted. If it needs a few minor tweaks, give the author a chance to make them. If it needs more than that, reject it. But don’t drag out the review process to the end of time, and don’t let a panel of reviewers – no matter how brilliant – co-author the article. They should write their own.