A Replication Initiative from APS

Several of the major research organizations in psychology, including APA, EAPP (European Association of Personality Psychology) and SPSP, have been talking about the issue of replicability of published research, but APS has made the most dramatic move so far to actually do something about it.  The APS journal Perspectives on Psychological Science today announced a new policy to enable the publication of pre-registered, robust studies seeking to replicate important published findings.  The journal will add a new section for this purpose, edited by Dan Simons and Alex Halcombe.  For details, click here.

This idea has been kicked around in other places, including proposals for new journals exclusively dedicated to replication studies.  One of the most interesting aspects of the new initiative is that instead of isolating replications in an independent journal few people might see, they will appear in an already widely-read and prestigious journal with a high impact factor.

When a similar proposal — in the form of a suggested new journal — was floated in a meeting I attended a few weeks ago, it quickly stimulated controversy. Some saw the proposal as a self-defeating attack on our own discipline that would only undermine the credibility of psychological research.  Others saw it as a much-needed self-administered corrective action; better to come from within the field than be imposed from outside. And still others — probably the largest group — raised and got a bit bogged down in worrying about specifics of implementation.  For example, what will stop a researcher from running a failed replication study, and only then “pre-registering” it?  How many failed replications does it take to overturn the conclusions of a published study, and what does “failed replication” mean exactly, anyway?  What degree of statistical power should replication studies be required to have, and what effect size should be used to make this calculation?  Finally, running these replication studies (as described in the PPS policy) looks to be a demanding and expensive enterprise. Who will have sufficient time, money and/or incentive to run them?   These questions all lack ready answers.

My own view is that the answers to these questions — or their ultimate unanswerability — will only be established through experimentation.  Somebody needs to try it and see what happens.  I admire APS for taking this step and am looking forward to seeing what, if anything, ultimately becomes of it.

Speaking of replication…

A small conference sponsored by the European Association of Personality, held in Trieste, Italy last summer, addressed the issue of replicability in psychological research.  Discussions led to an article describing recommended best practices, and the article is now “in press” at the European Journal of Personality.  You can see the article if you click here.

Update November 8: Courtesy of Brent Roberts, the contents of the special issue of Perspectives in Psychological Science on replicability are available.  To go to his blog post, with links, click here.

On inference (updated x 2)

At a conference I attended last month, I heard for the first time about an Oxford philosopher who, according to his fellow philosophers, has pretty much proved that we live inside of a computer simulation. I’ll take the philosophers’ word for it when they say the inferential logic appears to be impeccable.

Which brings me to draw the following lesson: Any system of rigid (or automatic) inferential rules, followed out on a long enough chain, will eventually lead to an absurd conclusion. (If someone else has already coined this principle as a proverb, or something, I’d love to hear about it.)

For example, consider rigid applications of constitutional law. The Second Amendment of the US Constitution says that the right to bear arms shall not be infringed. Therefore, as an American citizen, I cannot be prohibited from owning a pistol, an assault rifle or (why not?) a nuclear bomb. The logic is fine; the conclusion is ridiculous.

The vulnerability of automatic systems to absurd outcomes is one reason  I dislike the term “inferential statistics.” There is really no such thing. All statistics are descriptive. Some describe the probability of a result under the null, which is not uninteresting. But this calculation can’t do your inference for you, as brainlessly comforting as that would be. Instead, you need to think about the actual result. You should consider, for example, its a priori plausibility, its theoretical context, and its consistency with other known facts, not to mention its replicability. You might even add a dollop of – dare I say it? – common sense.

Of course, your inference might be wrong. That’s the thing about inferences. But a system of rules that tries to make your inferences for you (especially if the rules include arbitrary standards like the .05 threshold) risks drawing conclusions that are out-and-out absurd.

Update November 10, 2012: On reading post-election commentary I’m realizing that what I said above has some problems, or is at best incomplete.  Right before the election a battle raged on the internet between Nate Silver, of the incomparable blog fivethirtyeight.com, and various pundits — most but not all of whom were conservative.  Silver earned the pundits’ ire by using a sophisticated statistical predicting model that produced a high-confidence prediction of an Obama victory, whereas the pundits “knew in their guts” or from “years of experience” that Romney was going to win — in a landslide, some even said.  Of course we know now that Silver was right and the pundits wrong.  This outcome is a clear victory for statistical prediction over what the pundits surely would have been willing to characterize “as a priori plausibility…theoretical context… consistency with other known facts…[and] common sense” (see above).

So where does that leave my little aphorism and its supposed implications?  In an uncomfortable spot, that’s where. Common sense and “gut reactions” (about which Gerd Gigerenzer has written a an entire, brilliant book) remain indispensable, especially in situations where the data to calculate a Silver-ish model aren’t available, which is probably most situations in real life.  But relying on common sense and the gut also can make one’s conclusions vulnerable to wishful thinking, which seems to be what happened in the case of the election pundits.  When you have a wealth of relevant data and a model, based on past experience and reasonable theorizing, for combining them into a prediction, then you probably do want to rely on the model and not on so-called common sense.

However: Note the word “probably” in the preceding sentence.  Even Nate Silver’s final prediction was only issued with 91% confidence.  Maybe the remaining 9% is where common sense saves itself.  I still don’t like the term “inferential statistics” and the arbitrary .05 threshold for deciding whether something is true (which Silver doesn’t use, by the way; he always reports exact probabilities).  And, I still don’t think we live inside a computer simulation.  Who would program a universe like this?

Update November 14, 2012: It turns out that of all the pundits and prognosticators for the 2012 presidential election, only three got perfect scores on predicting the electoral college.  Two of them were statistical modelers, the previously mentioned Nate Silver (of fivethirtyeight.com) and a professor at Emory University named Drew Linzer.  But the one with the very best predictive record — with a near perfect estimate of the margin of victory or defeat in each of the swing states — was Markos Moulitsas, proprieter of the liberal blog dailykos. What was the edge he had over the statisticians?  To quote his own answer:

 All three of us used data to arrive at our conclusions. The difference between them and me? They were wedded to their algorithmic and automatic models, but my model is manual, allowing me the freedom to evaluate each piece of data on its merits and separate the wheat from the chaff, while mixing in early vote performance to further refine my calls.

That’s the point I’ve been struggling to make, above.  We can and should be informed by our statistical calculations — and not simply ignore them, as a startling number of pundits did — but we still have to take responsibility for the conclusions we draw.  Sometimes human judgment adds value, and every once in a great while it can save us from fatal error.  Or just silly conclusions.  I still don’t believe we live inside a computer simulation.

The perilous plight of the (non)-replicator

               As I mentioned in my previous post, while I’m sympathetic to many of the ideas that have been suggested about how to improve the reliability of psychological knowledge and move towards “scientific utopia,” my own thoughts are less ambitious and keep returning to the basic issue of replication.  A scientific culture that consistently produced direct replications of important results would be one that eventually purged itself of many of the problems people having been worrying about lately, including questionable research practices, p-hacking, and even data fraud.

But, as I also mentioned in my previous post, this is obviously not happening.  Many observers have commented on the institutional factors that discourage the conduct and, even more, the publication of replication studies.  These include journal policies, hiring committee practices, tenure standards, and even the natural attractiveness of fun, cute, and counter-intuitive findings.  In this post, I want to focus on a factor that has received less attention: the perilous plight of the (non) replicator.

The situation of a researcher who has tried and failed to replicate a prominent research finding is an unenviable one.  My sense is that the typical non-replicator started out as a true believer, not a skeptic.  For example, a few years ago I spent sabbatical time at a large, well-staffed and well-equipped institute in which several researchers were interested in a very prominent finding in their field, and wished to test further hypotheses they had generated about its basis.  As good scientists, they began by making sure that they could reproduce the basic effect.  To their surprise and increasing frustration, they simply could not.  They followed the published protocol, contacted the original investigator for more details, tweaked this, tweaked that.  (As I said, they had lots of resources.)  Nothing.  Eventually they simply gave up.

Another anecdote.  A graduate student of a colleague of mine was intrigued by a finding published in Science.  You don’t see psychological research published in that ultimately prestigious journal very often, so it seemed like a safe bet that the effect was real and that further creative studies to develop its theoretical foundation would be a great project towards a dissertation and a research career.  Wrong.  After about three years of failing to replicate the original finding, the advisor finally had to insist that the student find another topic and start over.  You can imagine the damage this experience did to the student’s career prospects.

Stories like these are legion, but you don’t see many of them in the published literature. Indeed, I suspect most failures to replicate are never written up, much less submitted for publication. There are probably many reasons, but consider just one:  What happens when a researcher does decide to “go public” with a failure – or even repeated, robust failures – to replicate a prominent finding?  If some recent, highly publicized cases are any guide, several unpleasant outcomes can be anticipated.

First, the finding will be vehemently defended, sometimes not just by its originator but also by the acolytes that a surprising number of prominent researchers seem to have attracted into loyal camps.[i]  The defensive articles, written by prominent people with considerable skills, are likely to be strongly argued, eloquent, and long.  The non-replicator has a good chance of being publicly labeled as incompetent if not deliberately deceptive, and may be compared to skeptics of global warming!  Even a journalist who has the temerity to write about non-replication issues risks being dismissed as a hack.  This situation can’t be pleasant. It takes a certain kind of person to be willing to be dragged into it – and not necessarily the same kind of person who was attracted to a scientific career in the first place.

It gets worse.  The failed replicator also risks various kinds of subtle and not-so-subtle retaliation.  I was at a conference a few weeks ago where I heard, first-hand, from a researcher who found that a promotion letter that subtly but powerfully derogated the researcher’s career was not only an outlier with respect to the other letters in the file, but was written by a practitioner in a field that the researcher’s work had dared to question.  Another first-hand story concerned a researcher who, after publishing some reversals of findings that had been pushed for years by a powerful school of investigators, found that external reviews of submitted journal articles on other topics had suddenly turned harshly critical.  And, in an episode I had the opportunity to observe directly, a professor and graduate student who had a paper questioning an established finding actually accepted for publication in a prominent journal found themselves subjected to threats!  The person who “owned” the original effect said to them: you need to withdraw this paper.  I’m the most prominent researcher in the field and the New York Times will surely call me for comment.  I will be forced to publicly expose your incompetence.  Your career will be damaged; your student’s career will be ruined.  The threat concluded, darkly: I say this as a friend; I only have your best interests at heart.

Do you know other stories like this?  There is a good chance you do.  Publishing a failure to replicate a prominent finding, or even challenging the accepted state of the evidence in any way, is not for the squeamish.  No wonder the typical response of a failed replicator is simply to drop the whole thing and walk away.  The reaction makes sense, and from the point of view of individual self-interest – especially for a junior researcher — is probably the rational thing to do.  But it’s disastrous for the accumulation of reliable scientific knowledge.

This is a cultural problem that needs to be solved.  As individuals and as members of a research culture, we need to clarify two things.  First, we have to make clear that denunciations of people with contrary findings as incompetent or deceptive, retaliation through journal reviews and promotion letters, and overt threats, are, in a phrase, SERIOUSLY NOT OK.  This should go without saying, but – judging from what we’ve seen happen recently – apparently it doesn’t.

Second, and only slightly less obviously, we should try to recognize that a failure to confirm one of your findings does not have to be viewed as an attack.  Indeed, a colleague attending this same meeting pointed out that a failure to replicate is a sort of compliment: it means your work was interesting and potentially important enough to merit further investigation!  It’s much worse – and far more common – simply to be ignored.  A failure to replicate should be seen, instead of an attack, as an invitation to clarify what’s going on.  After all, if you couldn’t replicate one of your effects in your own lab what would you do?  Attack yourself?  No, you’d probably sit down and try to figure out what happened.  So why is it so different if it happens in someone else’s lab?  This could be the beginning of a joint effort to share methods, look at data together, and come to a collaborative understanding of an important scientific issue.

I know I’m dreaming here.  Even a psychologist knows enough about human nature to understand that such an outcome goes against all of our natural defensive inclinations.  But it’s a nice thought, and maybe if we hold it in mind even as an unattainable ideal it might help us to be not quite so vehement, a little less personal, and a bit more open minded in our responses to scientific challenge.

How can we enforce better responses to failures to replicate?  Sociology teaches us that in small communities gossip is an effective mechanism to enforce social norms.  Research psychology is effectively a small town, a few thousand people at the most spread out around the world but in regular contact nonetheless.  So the late-night gossip about defensive reactions, retaliation, and threats is one way to ensure that such conduct carries a social price.

In the longer term, we need to change our overall social norm of what’s acceptable.  We need to accept, practice, and, above all, teach constructive approaches to scientific controversy.  This is a very long road.  But, as the proverb tells us, it starts with one step.

Note: This post is based on a brief talk given at a conference on the “Decline Effect,” held at UC Santa Barbara in October, 2012.  The conference was organized by Jonathan Schooler and sponsored by the Fetzer-Franklin Foundation. As always, this post expresses my personal opinion and not necessarily that of any other institution or individual.

[i] Typically, their defense will draw on the existence of “conceptual replications,” studies that found theoretically parallel effects using different methods.  However, as Hal Pashler has noted, no matter how many conceptual replications are reported, there is no way to know how many failed efforts never saw the light of day.  This is why it is essential to find out whether the original effect was reliable.

Replication, period.

Can we believe everything (or anything) that social psychological research tells us?  Suddenly, the answer to this question seems to be in doubt.  The past few months have seen a shocking series of cases of fraud –researchers literally making their data up — by prominent psychologists at prestigious universities.  These revelations have catalyzed an increase in concern about a much broader issue, the replicability of results reported by social psychologists.  Numerous writers are questioning common research practices such as selectively reporting only studies that “work” and ignoring relevant negative findings that arise over the course of what is euphemistically called “pre-testing,” increasing N’s or deleting subjects from data sets until the desired findings are obtained and, perhaps worst of all, being inhospitable or even hostile to replication research that could, in principle, cure all these ills.

Reaction is visible.  The European Association of Personality Psychology recently held a special three-day meeting on the topic, to result in a set of published recommendations for improved research practice, a well-financed conference in Santa Barbara in October will address the “decline effect” (the mysterious tendency of research findings to fade away over time), and the President of the Society for Personality and Social Psychology was recently motivated to post a message to the membership expressing official concern.  These are just three reactions that I personally happen to be familiar with; I’ve also heard that other scientific organizations and even agencies of the federal government are looking into this issue, one way or another.

This burst of concern and activity might seem to be unjustified.  After all, literally making your data up is a far cry from practices such as pre-testing, selective reporting, or running multiple statistical tests.  These practices are even, in many cases, useful and legitimate.  So why did they suddenly come under the microscope as a result of cases of data fraud?  The common thread seems to be the issue of replication.  As I already mentioned, the idealistic model of healthy scientific practice is that replication is a cure for all ills.  Conclusions based on fraudulent data will fail to be replicated by independent investigators, and so eventually the truth will out.  And, less dramatically, conclusions based on selectively reported data or derived from other forms of quasi-cheating, such as “p-hacking,” will also fade away over time.

The problem is that, in the cases of data fraud, this model visibly and spectacularly failed.  The examples that were exposed so dramatically — and led tenured professors to resign from otherwise secure and comfortable positions (note:  this NEVER happens except under the most extreme circumstances) — did not come to light because of replication studies.  Indeed, anecdotally — which, sadly, seems to be the only way anybody ever hears of replication studies — various researchers had noticed that they weren’t able to repeat the findings that later turned out to be fraudulent, and one of the fakers even had a reputation of generating data that were “too good to be true.”  But that’s not what brought them down.  Faking of data was only revealed when research collaborators with first-hand knowledge — sometimes students — reported what was going on.

This fact has to make anyone wonder: what other cases are out there?  If literal faking of data is only detected when someone you work with gets upset enough to report you, then most faking will never be detected.  Just about everybody I know — including the most pessimistic critics of social psychology — believes, or perhaps hopes, that such outright fraud is very rare.  But grant that point and the deeper moral of the story still remains:  False findings can remain unchallenged in the literature indefinitely.

Here is the bridge to the wider issue of data practices that are not outright fraudulent, but increase the risk of misleading findings making it into the literature.  I will repeat: so-called “questionable” data practices are not always wrong (they just need to be questioned).  For example, explorations of large, complex (and expensive) data sets deserve and even require multiple analyses to address many different questions, and interesting findings that emerge should be reported.  Internal safeguards are possible, such as split-half replications or randomization analyses to assess the probability of capitalizing on chance.  But the ultimate safeguard to prevent misleading findings from permanent residence in (what we think is) our corpus of psychological knowledge is independent replication.  Until then, you never really know.

Many remedies are being proposed to cure the ills, or alleged ills, of modern social psychology.  These include new standards for research practice (e.g., registering hypotheses in advance of data gathering), new ethical safeguards (e.g., requiring collaborators on a study to attest that they have actually seen the data), new rules for making data publicly available, and so forth.  All of these proposals are well-intentioned but the specifics of their implementation are debatable, and ultimately raise the specter of over-regulation.  Anybody with a grant knows about the reams of paperwork one now must mindlessly sign attesting to everything from the exact percentage of their time each graduate student has worked on your project to the status of your lab as a drug-free workplace.  And that’s not even to mention the number of rules — real and imagined — enforced by the typical campus IRB to “protect” subjects from the possible harm they might suffer from filling out a few questionnaires.  Are we going to add yet another layer of rules and regulations to the average over-worked, under-funded, and (pre-tenure) insecure researcher?  Over-regulation always starts out well-intentioned, but can ultimately do more harm than good.

The real cure-all is replication.  The best thing about replication is that it does not rely on researchers doing less (e.g., running fewer statistical tests, only examining pre-registered hypotheses, etc.), but it depends on them doing more.  It is sometimes said the best remedy for false speech is more speech.  In the same spirit, the best remedy for misleading research is more research.

But this research needs to be able to see the light of day.  Current journal practices, especially among our most prestigious journals, discourage and sometimes even prohibit replication studies from publication.  Tenure committees value novel research over solid research.  Funding agencies are always looking for the next new thing — they are bored with the “same old same old” and give low priority to research that seeks to build on existing findings — much less seeks to replicate them.  Even the researchers who find failures to replicate often undervalue them.  I must have done something wrong, most conclude, stashing the study into the proverbial “file drawer” as an unpublishable, expensive and sad waste of time.  Those researchers who do become convinced that, in fact, an accepted finding is wrong, are unlikely to attempt to publish this conclusion.  Instead, the failure becomes fodder for late-night conversations, fueled by beverages at hotel bars during scientific conferences.  There, and pretty much only there, can you find out which famous findings are the ones that “everybody knows” can’t be replicated.

I am not arguing that every replication study must be published.  Editors have to use their judgment.  Pages really are limited (though less so in the arriving age of electronic publishing) and, more importantly, editors have a responsibility to direct the limited attentional resources of the research community to articles that matter.  So any replication study should be carefully evaluated for the skill with which it was conducted, the appropriate level of statistical power, and the overall importance of the conclusion.  For example, a solid set of high-powered studies showing that a widely accepted and consequential conclusion was dead wrong, would be important in my book[1].  And this series of studies should, ideally, be published in the same journal that promulgated the original, misleading conclusion.  As your mother always said, clean up your own mess.

Other writers have recently laid out interesting, ambitious, and complex plans for reforming psychological research, and even have offered visions of a “research utopia.”  I am not doing that here.  I only seek to convince you of one point:  psychology (and probably all of science) needs more replications.  Simply not ruling replication studies as inadmissible out-of-hand would be an encouraging start.   Do I ask too much?

Note: Thanks to Sanjay Srivastava for originally publishing this as a guest post on his blog.  Since I happen to be the president-elect of the Society for Personality and Social Psychology, I should also add that this essay represents my personal opinion and does not express the policies of the Society or the opinions of its other officers.

[1] So would a series of studies confirming that an important surprising and counter-intuitive finding was actually true.  But most aren’t, I suspect.