One example of how to do it

In a previous post, I wrote about the contentious atmosphere that so often surrounds replication studies, and fantasized a world in which one might occasionally see replication researchers and the original authors come together in “a joint effort to share methods, look at data together, and come to a collaborative understanding of an important scientific issue.”  Happily one example that comes close to this ideal has been recently accepted for publication in Psychological Science — the same journal that published the original paper.  The authors of both the original and replication studies appear to have worked together to share information about procedures and analyses, which while perhaps not a full collaboration, is at least cooperation of a sort that’s seen too rarely.  The result was that the original, intriguing finding did not replicate; two large new studies obtained non-significant findings in the wrong direction.  The hypothesis that anxiously attached people might prefer warm foods when their attachment concerns are activated was provocative, to say the least.  But it seems to have been wrong.

With this example now out there, I hope others follow the same path towards helping the scientific literature perform the self-correcting process that, in principle, is its principal distinctive advantage.  I also hope that, one of these days, an attempt to independently replicate a provocative finding will actually succeed!  Now that would be an important step forward.

Sanjay Srivastava  and Job van Wolferen have also commented on this replication study.

UPDATE, April 15:  Via Eric Eich, the editor of Psychological Science who accepted the paper, discussed above, by LeBel and Campbell that failed replicate the study by Matthew Vess.  Eich offered Vess the opportunity to publish a rejoinder and this is what Vess said:

Thank you for the opportunity to submit a rejoinder to LeBel and Campbell’s commentary. I have, however, decided not to submit one. While I am certainly dismayed to see the failed attempts to reproduce a published study of mine, I am in agreement with the journal’s decision to publish the replication studies in a commentary and believe that such decisions will facilitate the advancement of psychological science and the collaborative pursuit of accurate knowledge. LeBel and Campbell provide a fair and reasonable interpretation of what their findings mean for using this paradigm to study attachment and temperature associations, and I appreciated their willingness to consult me in the development of their replication efforts. Once again, thank you for the opportunity.

Hats off to Matthew Vess.  Imagine if everyone whose findings were challenged responded in such a civil and science-promoting manner.  What a wonderful world it would be.

A Replication Initiative from APS

Several of the major research organizations in psychology, including APA, EAPP (European Association of Personality Psychology) and SPSP, have been talking about the issue of replicability of published research, but APS has made the most dramatic move so far to actually do something about it.  The APS journal Perspectives on Psychological Science today announced a new policy to enable the publication of pre-registered, robust studies seeking to replicate important published findings.  The journal will add a new section for this purpose, edited by Dan Simons and Alex Halcombe.  For details, click here.

This idea has been kicked around in other places, including proposals for new journals exclusively dedicated to replication studies.  One of the most interesting aspects of the new initiative is that instead of isolating replications in an independent journal few people might see, they will appear in an already widely-read and prestigious journal with a high impact factor.

When a similar proposal — in the form of a suggested new journal — was floated in a meeting I attended a few weeks ago, it quickly stimulated controversy. Some saw the proposal as a self-defeating attack on our own discipline that would only undermine the credibility of psychological research.  Others saw it as a much-needed self-administered corrective action; better to come from within the field than be imposed from outside. And still others — probably the largest group — raised and got a bit bogged down in worrying about specifics of implementation.  For example, what will stop a researcher from running a failed replication study, and only then “pre-registering” it?  How many failed replications does it take to overturn the conclusions of a published study, and what does “failed replication” mean exactly, anyway?  What degree of statistical power should replication studies be required to have, and what effect size should be used to make this calculation?  Finally, running these replication studies (as described in the PPS policy) looks to be a demanding and expensive enterprise. Who will have sufficient time, money and/or incentive to run them?   These questions all lack ready answers.

My own view is that the answers to these questions — or their ultimate unanswerability — will only be established through experimentation.  Somebody needs to try it and see what happens.  I admire APS for taking this step and am looking forward to seeing what, if anything, ultimately becomes of it.

How High is the Sky? Well, Higher than the Ground

Challenged by some exchanges in my own personal emails and over in Brent Robert’s “pigee” blog, I’ve found myself thinking more about what is surely the weakest point in my previous post about effect size: I failed to reach a clear conclusion about how “big” an effect has to be to matter. As others have pointed out, it’s not super-coherent to claim, on the one hand, that effect size is important and must always be reported yet to acknowledge, on the other hand, that under at least some circumstances very “small” effects can matter for practical and/or theoretical purposes.

My attempt to restore coherence has two threads, so far. First, to say that small effect sizes are sometimes important does not mean that they always are. It depends. Is .034 (in terms of r) big enough? It is, if we are talking about aspirin’s effect on heart attacks, because wide prescription can save thousands of lives a year (notice, though, that you need effect size to do this calculation). Probably not, though, for other purposes.

But honestly, I don’t know how small an effect is too small. As I said, it depends. I suspect that if social psychologists, in particular, reported and emphasized their effect sizes more often, over time an experiential base would accrue that would make interpreting them easier. But, in the meantime, maybe there is another way to think about things.

So the second thread of my response is to suggest that perhaps we should focus on the ordinal rather than absolute nature of effect sizes. While we don’t often know exactly how big an effect has to be to matter, in an absolute sense, there are many contexts in which we care which of two things matters **more**. Personality psychologists routinely publish long (and to some people, boring) lists of correlates; such lists draw attention to the personality variables that appear to be more and less related to the outcome of interest, even if the exact numerical values aren’t necessarily all that informative.

Social psychological theorizing is also often, often, phrased in terms of relative effect size, though the actual numbers aren’t always included. The whole point of Ross & Nisbett’s classic book “The Person and the Situation” was that the effects of situational variables are larger than the effects of personality variables, and they draw theoretical implications from that comparison that — read almost any social psychology textbook or social psych. section of any intro textbook — goes to the heart of how social psychology is theoretically framed at the most general level. The famous “Fundamental Attribution Error” is explicitly expressed in terms of effect size — situational variables allegedly affect behavior “more” than people think. How do you even talk about that claim without comparing effect sizes? The theme of Susan Fiske’s address at the presidential symposium at the 2012 SPSP was that “small” manipulations can have “large” effects; this is also effect size language expressing a theoretical view. Going back further, when attitude change theorists talked about direct and indirect routes to persuasion, this raised a key theoretical question of relative influence of the two effects. More recently, Lee Jussim wrote a whole (and excellent) book about the size of expectancy effects, comparing them to the effects of prior experience, valid information, etc. and building a theoretical model from that comparison.

I could go on, but, in short, the relative size of effects matters in social psychological theorizing whether the effects are computed and reported, or not. When they aren’t, of course, the theorizing is proceeding in an empirical vaccum that might not even be noticed – and this happens way too often, including in some of the examples I just listed. My point is that effect size comparisons, usually implicit, are ubiquitous in psychological theorizing so it would probably be better if we remembered to explicitly calculate them, report them, and consider them carefully.

Does (effect) Size Matter?

Personality psychologists wallow in effect size; the ubiquitous correlation coefficient, Pearson’s r, is central to nearly every research finding they report.  As a consequence, discussions of relationships between personality variables and outcomes are routinely framed by assessments of their strength.  For example, a landmark paper reviewed predictors of divorce, mortality, and occupational achievement, and concluded that personality traits have associations with these life outcomes that are as strong as or stronger than traditional predictors such as socio-economic status or cognitive ability (Roberts et al., 2007).  This is just one example of how personality psychologists routinely calculate, care about, and even sometimes worry about the size of the relationships between their theoretical variables and their predicted outcomes.

Social psychologists, not so much.  The typical report in experimental social psychology focuses on p-level, the probability of the magnitude of the difference between experimental groups occurring if the null hypothesis of no difference were to be true.   If this probability is .05 or less, then: Success!  While effect sizes (usually Cohen’s d  or, less often, Pearson’s r) are reported more often they they used to be – probably because the APA Publication Manual explicitly requires it (a requirement not always enforced) – the emphasis of the discussion of the theoretical or even the practical importance of the effect typically centers around whether it exists.  The size simply doesn’t matter.

Is this description an unfair caricature of social psychological research practice?  That’s what I thought until recently.  Even though the typical statistical education of many experimentally-oriented psychologists bypasses extensive discussion of effect size in favor of the ritual of null-hypothesis testing, I assumed that the smarter social psychologists grasped that an important part of scientific understanding involves ascertaining not just whether some relationship between two variables “exists,” but how big that relationship is and how it compares to various benchmarks of theoretical or practical utility.

It turns out I was wrong.  I recently had an email exchange with a prominent social psychologist who I greatly respect. [i] I was shocked, therefore, when he wrote the following[ii]:

 …the key to our research… [is not] to accurately estimate effect size. If I were testing an advertisement for a marketing research firm and wanted to be sure that the cost of the ad would produce enough sales to make it worthwhile, effect size would be crucial. But when I am testing a theory about whether, say, positive mood reduces information processing in comparison with negative mood, I am worried about the direction of the effect, not the size (indeed, I could likely change the size by using a different manipulation of mood, a different set of informational stimuli, a different contextual setting for the research — such as field versus lab). But if the results of such studies consistently produce a direction of effect where positive mood reduces processing in comparison with negative mood, I would not at all worry about whether the effect sizes are the same across studies or not, and I would not worry about the sheer size of the effects across studies. This is true in virtually all research settings in which I am engaged. I am not at all concerned about the effect size (except insofar as very small effects might require larger samples to find clear evidence of the direction of the effect — but this is more of a concern in the design phase, not in interpreting the meaning of the results). In other words, I am yet to develop a theory for which an effect size of r = .5 would support the theory, but an effect size of r = .2 (in the same direction) would fail to support it (if the effect cannot be readily explained by chance). Maybe you have developed such theories, but most of our field has not.

To this comment, I had three reactions.

First, I was startled by the claim that social psychologists don’t and shouldn’t care about effect size. I began my career during the dark days of the Mischelian era, and the crux of Mischel’s critique was that personality traits rarely correlate with outcomes greater than .30. He never denied that the correlations were significant, mind you, just that they weren’t big enough to matter to anybody on either practical or theoretical grounds. Part of the sport was to square this correlation, and state triumphantly (and highly misleadingly) that therefore personality only “explains” “9% of the variance.”  Social psychologists of the era LOVED this critique[iii]! Some still do. Oh, if only one social psychologist had leapt to personality psychology’s defense in those days, and pointed out that effect size doesn’t matter as long as we have the right sign on the correlation… we could have saved ourselves a lot of trouble (Kenrick & Funder, 1988).

Second, I am about 75% joking in the previous paragraph, but the 25% that’s serious is that I actually think that Mischel made an important point – not that .30 was a small effect size (it isn’t), but that effect size should  be the name of the game.  To say that an effect “exists” is a remarkably simplistic statement that on close examination means almost nothing.  If you work with census data, for example, EVERYTHING — every comparison between two groups, every correlation between any two variables — is statistically significant at the .000001 level. But the effect sizes are generally teeny-tiny, and of course lots of them don’t make any sense either (perhaps these should be considered “counter-intuitive” results). Should all of these findings be taken seriously?

Third, if the answer is no, then we have to decide how big an effect is in fact worth taking seriously. And not just for purposes of marketing campaigns! If, for example, a researcher wants to say something like “priming effects can overwhelm our conscious judgment” (I have read statements like that), then we need to start comparing effect sizes. Or, if we are just going to say that “holding a hot cup of coffee makes you donate more money to charity” (my favorite recent forehead-slapping finding) then the effect size is important for theoretical, not just practical purposes, because a small effect size implies that a sizable minority is giving LESS money to charity, and that’s a theoretical problem, not just a practical one.  More generally, the reason a .5 effect size is more convincing, theoretically, than a .2 effect size is that the theorist can put less effort into explaining why so many participants did the opposite of what the theory predicted.

Still, it’s difficult to set a threshold for how big is big enough.  As my colleague pointed out in a subsequent e-mail – and as I’ve written myself, in the past — there are many reasons to take supposedly “small” effects seriously.  Psychological phenomena are determined by many variables, and to isolate one that has an effect on an interesting outcome is a real achievement, even though in particular instances it might be overwhelmed by other variables with opposite influences.  Rosenthal and Rubin (1982) demonstrated how a .30 correlation was enough to be right, about two times out of three.  Ahadi and Diener (1989) showed that if just a few factors affect a common outcome, the maximum size of the effect of any one of them is severely constrained.  In a related vein, Abelson (1985) calculated how very small effect sizes – in particular, the relationship between batting average and performance in a single at-bat – can cumulate fairly quickly into large differences in outcomes (or ballplayer salaries).  So far be it from me to imply that a “small” effect, by any arbitrary standard, is unimportant.

Now we are getting near the crux of the matter.  Arbitrary standards – whether the .05 p-level threshold or some kind of minimum credible effect size – are paving stones on the road to ruin.  Personality psychologists routinely calculate and report their effect sizes, and as a result have developed a pretty good view of what these numbers mean and how to interpret them.  Social psychologists, to this day, still don’t pay much attention to effect sizes so haven’t developed a base of experience for evaluation. This is why my colleague Dan Ozer and I were able to make a splash as young beginning researchers, simply by pointing out that, for example, the effect size of the distance of the victim on obedience in the Milgram study was in the .30’s (Funder & Ozer, 1983).  The calculation was easy, even obvious, but apparently nobody had done it before.  A meta-analysis by Richard et al. (2003) found that the average effect size of published research in experimental social psychology is r = .21.  This finding remains unknown, and probably would come as a surprise, to many otherwise knowledgeable experimental researchers.

But this is what happens when the overall attitude is that “effect size doesn’t matter.”  Judgment lacks perspective, and we are unable to separate that which is truly important from that which is so subtle as to be virtually undetectable (and, in some cases, notoriously difficult to replicate).

My conclusion, then, is that effect size is important and the business of science should be to evaluate it, and its moderators, as accurately as possible.  Evaluating effect sizes is and will continue to be difficult, because (among other issues) they may be influenced by extraneous factors, because apparently “small” effects can cumulate into huge consequences over time, and because any given outcome is influenced by many different factors, not just one or even a few.  But the solution to this difficulty is not to regard effect sizes as unimportant, much less to ignore them altogether.  Quite the contrary, the more prominence we give to effect sizes in reporting and thinking about research findings, the better we will get at understanding what we have discovered and how important it really is.

References

Abelson, R. P. (1985). “A variance explanation paradox: When a little is a lot.” Psychological Bulletin, 97, 129–133.

Ahdadi, S., & Diener, E. (1989). Multiple determinants and effect size. Journal of Personality and Social Psychology, 56, 398-406.

Funder, D.C., & Ozer, D.J. (1983). Behavior as a function of the situation. Journal of Personality and Social Psychology, 44, 107-112.

Kenrick, D.T., & Funder, D.C. (1988). Profiting from controversy: Lessons from the person-situation debate. American Psychologist, 43, 23-34.

Nisbett, R.E., (1980). The trait construct in lay and professional psychology. In L. Festinger (Ed.), Retrospections on social psychology  (pp. 109-130). New York: Oxford University Press.

Richard, F.D., Bond, C.F., Jr., & Stokes-Zoota, J.J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331-363.

Roberts, B.W., Kuncel, N.R., Shiner, R., Caspi, A., & Goldberg L.R. (2007). The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives in Psychological Science, 2, 313-345.

Rosenthal, R., & Rubin, D.B. (1982). A simple, general-purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166-169.

[i] We served together for several years on a grant review panel, a bonding experience as well as a scientific trial by fire, and I came to admire his incisive intellect and clear judgment.

[ii] I obtained his permission to quote this passage but, understandably, he asked that he not be named in order to avoid being dragged into a public discussion he did not intend to start with a private email.

[iii] See, e.g., Nisbett, 1980, who raised the “personality correlation” to .40 but still said it was too small to matter.  Only 16% of the variance, don’t you know.

Speaking of replication…

A small conference sponsored by the European Association of Personality, held in Trieste, Italy last summer, addressed the issue of replicability in psychological research.  Discussions led to an article describing recommended best practices, and the article is now “in press” at the European Journal of Personality.  You can see the article if you click here.

Update November 8: Courtesy of Brent Roberts, the contents of the special issue of Perspectives in Psychological Science on replicability are available.  To go to his blog post, with links, click here.

On inference (updated x 2)

At a conference I attended last month, I heard for the first time about an Oxford philosopher who, according to his fellow philosophers, has pretty much proved that we live inside of a computer simulation. I’ll take the philosophers’ word for it when they say the inferential logic appears to be impeccable.

Which brings me to draw the following lesson: Any system of rigid (or automatic) inferential rules, followed out on a long enough chain, will eventually lead to an absurd conclusion. (If someone else has already coined this principle as a proverb, or something, I’d love to hear about it.)

For example, consider rigid applications of constitutional law. The Second Amendment of the US Constitution says that the right to bear arms shall not be infringed. Therefore, as an American citizen, I cannot be prohibited from owning a pistol, an assault rifle or (why not?) a nuclear bomb. The logic is fine; the conclusion is ridiculous.

The vulnerability of automatic systems to absurd outcomes is one reason  I dislike the term “inferential statistics.” There is really no such thing. All statistics are descriptive. Some describe the probability of a result under the null, which is not uninteresting. But this calculation can’t do your inference for you, as brainlessly comforting as that would be. Instead, you need to think about the actual result. You should consider, for example, its a priori plausibility, its theoretical context, and its consistency with other known facts, not to mention its replicability. You might even add a dollop of – dare I say it? – common sense.

Of course, your inference might be wrong. That’s the thing about inferences. But a system of rules that tries to make your inferences for you (especially if the rules include arbitrary standards like the .05 threshold) risks drawing conclusions that are out-and-out absurd.

Update November 10, 2012: On reading post-election commentary I’m realizing that what I said above has some problems, or is at best incomplete.  Right before the election a battle raged on the internet between Nate Silver, of the incomparable blog fivethirtyeight.com, and various pundits — most but not all of whom were conservative.  Silver earned the pundits’ ire by using a sophisticated statistical predicting model that produced a high-confidence prediction of an Obama victory, whereas the pundits “knew in their guts” or from “years of experience” that Romney was going to win — in a landslide, some even said.  Of course we know now that Silver was right and the pundits wrong.  This outcome is a clear victory for statistical prediction over what the pundits surely would have been willing to characterize “as a priori plausibility…theoretical context… consistency with other known facts…[and] common sense” (see above).

So where does that leave my little aphorism and its supposed implications?  In an uncomfortable spot, that’s where. Common sense and “gut reactions” (about which Gerd Gigerenzer has written a an entire, brilliant book) remain indispensable, especially in situations where the data to calculate a Silver-ish model aren’t available, which is probably most situations in real life.  But relying on common sense and the gut also can make one’s conclusions vulnerable to wishful thinking, which seems to be what happened in the case of the election pundits.  When you have a wealth of relevant data and a model, based on past experience and reasonable theorizing, for combining them into a prediction, then you probably do want to rely on the model and not on so-called common sense.

However: Note the word “probably” in the preceding sentence.  Even Nate Silver’s final prediction was only issued with 91% confidence.  Maybe the remaining 9% is where common sense saves itself.  I still don’t like the term “inferential statistics” and the arbitrary .05 threshold for deciding whether something is true (which Silver doesn’t use, by the way; he always reports exact probabilities).  And, I still don’t think we live inside a computer simulation.  Who would program a universe like this?

Update November 14, 2012: It turns out that of all the pundits and prognosticators for the 2012 presidential election, only three got perfect scores on predicting the electoral college.  Two of them were statistical modelers, the previously mentioned Nate Silver (of fivethirtyeight.com) and a professor at Emory University named Drew Linzer.  But the one with the very best predictive record — with a near perfect estimate of the margin of victory or defeat in each of the swing states — was Markos Moulitsas, proprieter of the liberal blog dailykos. What was the edge he had over the statisticians?  To quote his own answer:

 All three of us used data to arrive at our conclusions. The difference between them and me? They were wedded to their algorithmic and automatic models, but my model is manual, allowing me the freedom to evaluate each piece of data on its merits and separate the wheat from the chaff, while mixing in early vote performance to further refine my calls.

That’s the point I’ve been struggling to make, above.  We can and should be informed by our statistical calculations — and not simply ignore them, as a startling number of pundits did — but we still have to take responsibility for the conclusions we draw.  Sometimes human judgment adds value, and every once in a great while it can save us from fatal error.  Or just silly conclusions.  I still don’t believe we live inside a computer simulation.

The perilous plight of the (non)-replicator

               As I mentioned in my previous post, while I’m sympathetic to many of the ideas that have been suggested about how to improve the reliability of psychological knowledge and move towards “scientific utopia,” my own thoughts are less ambitious and keep returning to the basic issue of replication.  A scientific culture that consistently produced direct replications of important results would be one that eventually purged itself of many of the problems people having been worrying about lately, including questionable research practices, p-hacking, and even data fraud.

But, as I also mentioned in my previous post, this is obviously not happening.  Many observers have commented on the institutional factors that discourage the conduct and, even more, the publication of replication studies.  These include journal policies, hiring committee practices, tenure standards, and even the natural attractiveness of fun, cute, and counter-intuitive findings.  In this post, I want to focus on a factor that has received less attention: the perilous plight of the (non) replicator.

The situation of a researcher who has tried and failed to replicate a prominent research finding is an unenviable one.  My sense is that the typical non-replicator started out as a true believer, not a skeptic.  For example, a few years ago I spent sabbatical time at a large, well-staffed and well-equipped institute in which several researchers were interested in a very prominent finding in their field, and wished to test further hypotheses they had generated about its basis.  As good scientists, they began by making sure that they could reproduce the basic effect.  To their surprise and increasing frustration, they simply could not.  They followed the published protocol, contacted the original investigator for more details, tweaked this, tweaked that.  (As I said, they had lots of resources.)  Nothing.  Eventually they simply gave up.

Another anecdote.  A graduate student of a colleague of mine was intrigued by a finding published in Science.  You don’t see psychological research published in that ultimately prestigious journal very often, so it seemed like a safe bet that the effect was real and that further creative studies to develop its theoretical foundation would be a great project towards a dissertation and a research career.  Wrong.  After about three years of failing to replicate the original finding, the advisor finally had to insist that the student find another topic and start over.  You can imagine the damage this experience did to the student’s career prospects.

Stories like these are legion, but you don’t see many of them in the published literature. Indeed, I suspect most failures to replicate are never written up, much less submitted for publication. There are probably many reasons, but consider just one:  What happens when a researcher does decide to “go public” with a failure – or even repeated, robust failures – to replicate a prominent finding?  If some recent, highly publicized cases are any guide, several unpleasant outcomes can be anticipated.

First, the finding will be vehemently defended, sometimes not just by its originator but also by the acolytes that a surprising number of prominent researchers seem to have attracted into loyal camps.[i]  The defensive articles, written by prominent people with considerable skills, are likely to be strongly argued, eloquent, and long.  The non-replicator has a good chance of being publicly labeled as incompetent if not deliberately deceptive, and may be compared to skeptics of global warming!  Even a journalist who has the temerity to write about non-replication issues risks being dismissed as a hack.  This situation can’t be pleasant. It takes a certain kind of person to be willing to be dragged into it – and not necessarily the same kind of person who was attracted to a scientific career in the first place.

It gets worse.  The failed replicator also risks various kinds of subtle and not-so-subtle retaliation.  I was at a conference a few weeks ago where I heard, first-hand, from a researcher who found that a promotion letter that subtly but powerfully derogated the researcher’s career was not only an outlier with respect to the other letters in the file, but was written by a practitioner in a field that the researcher’s work had dared to question.  Another first-hand story concerned a researcher who, after publishing some reversals of findings that had been pushed for years by a powerful school of investigators, found that external reviews of submitted journal articles on other topics had suddenly turned harshly critical.  And, in an episode I had the opportunity to observe directly, a professor and graduate student who had a paper questioning an established finding actually accepted for publication in a prominent journal found themselves subjected to threats!  The person who “owned” the original effect said to them: you need to withdraw this paper.  I’m the most prominent researcher in the field and the New York Times will surely call me for comment.  I will be forced to publicly expose your incompetence.  Your career will be damaged; your student’s career will be ruined.  The threat concluded, darkly: I say this as a friend; I only have your best interests at heart.

Do you know other stories like this?  There is a good chance you do.  Publishing a failure to replicate a prominent finding, or even challenging the accepted state of the evidence in any way, is not for the squeamish.  No wonder the typical response of a failed replicator is simply to drop the whole thing and walk away.  The reaction makes sense, and from the point of view of individual self-interest – especially for a junior researcher — is probably the rational thing to do.  But it’s disastrous for the accumulation of reliable scientific knowledge.

This is a cultural problem that needs to be solved.  As individuals and as members of a research culture, we need to clarify two things.  First, we have to make clear that denunciations of people with contrary findings as incompetent or deceptive, retaliation through journal reviews and promotion letters, and overt threats, are, in a phrase, SERIOUSLY NOT OK.  This should go without saying, but – judging from what we’ve seen happen recently – apparently it doesn’t.

Second, and only slightly less obviously, we should try to recognize that a failure to confirm one of your findings does not have to be viewed as an attack.  Indeed, a colleague attending this same meeting pointed out that a failure to replicate is a sort of compliment: it means your work was interesting and potentially important enough to merit further investigation!  It’s much worse – and far more common – simply to be ignored.  A failure to replicate should be seen, instead of an attack, as an invitation to clarify what’s going on.  After all, if you couldn’t replicate one of your effects in your own lab what would you do?  Attack yourself?  No, you’d probably sit down and try to figure out what happened.  So why is it so different if it happens in someone else’s lab?  This could be the beginning of a joint effort to share methods, look at data together, and come to a collaborative understanding of an important scientific issue.

I know I’m dreaming here.  Even a psychologist knows enough about human nature to understand that such an outcome goes against all of our natural defensive inclinations.  But it’s a nice thought, and maybe if we hold it in mind even as an unattainable ideal it might help us to be not quite so vehement, a little less personal, and a bit more open minded in our responses to scientific challenge.

How can we enforce better responses to failures to replicate?  Sociology teaches us that in small communities gossip is an effective mechanism to enforce social norms.  Research psychology is effectively a small town, a few thousand people at the most spread out around the world but in regular contact nonetheless.  So the late-night gossip about defensive reactions, retaliation, and threats is one way to ensure that such conduct carries a social price.

In the longer term, we need to change our overall social norm of what’s acceptable.  We need to accept, practice, and, above all, teach constructive approaches to scientific controversy.  This is a very long road.  But, as the proverb tells us, it starts with one step.

Note: This post is based on a brief talk given at a conference on the “Decline Effect,” held at UC Santa Barbara in October, 2012.  The conference was organized by Jonathan Schooler and sponsored by the Fetzer-Franklin Foundation. As always, this post expresses my personal opinion and not necessarily that of any other institution or individual.


[i] Typically, their defense will draw on the existence of “conceptual replications,” studies that found theoretically parallel effects using different methods.  However, as Hal Pashler has noted, no matter how many conceptual replications are reported, there is no way to know how many failed efforts never saw the light of day.  This is why it is essential to find out whether the original effect was reliable.