Why doesn’t personality psychology have a replication crisis?

Because It’s Boring

“[Personality psychology] has reduced the chances of being wrong but palpably increased the fact of being boring. In making that transition, personality psychology became more accurate but less broadly interesting.”  — Roy Baumeister (2016, p. 6)

Many fields of research – not just social psychology but also biomedicine, cancer biology, economics, political science, and even physics – are experiencing crises of replicability.  Recent and classic results are challenged by reports that when new investigators try to repeat them, often they simply can’t.  This fact has led to gnashing of teeth and rending of garments, not to mention back-and-forth controversies pitting creativity against rigor (see the article quoted in the epigram), and spawned memorable phrases such as “replication police” and “shameless little bullies.”

But, as the quote above attests, personality psychology seems to be immune.  In particular, I am not aware of any major finding (1) in personality psychology that has experienced the kind of assault on its reliability that has been inflicted upon many findings in social psychology (2).  Why not?  Is it because personality psychology is boring?  Maybe so, and I’ll come back to that point at the end, but first let’s consider some other

Possible Reasons Personality Psychology Does Not Have a Replication Crisis

  1. Personality Psychology Takes Measurement Seriously

The typical study in personality measures some attribute of persons (usually a personality trait) and also measures an outcome such as a behavior, a level of attainment, or an indicator of mental or physical health.  Even though everyone chants the mantra “correlation is not causality,” generally research proceeds on the (generally reasonable) presumption that the trait can be thought of as the independent variable, and the outcome as the dependent variable.  The IV is measured with several different indicators (items) and its reliability its calculated and reported.  The same practice is followed with the DV, and converging evidence is conventionally required that both the IV and the DV are reasonably good indicators of the constructs they are supposed to represent. Compared to other areas of psychology, the N is usually pretty large too.

Contrast this with the typical study in social psychology.  Many have only two levels of the IV, being the two experimental conditions (experimental and control); maybe there are three or four if the experiment is extra-fancy.  But the typical IV is scaled as merely high or low, or even present or absent. For example, subjects might be asked to unscramble words that do or do not have certain classes of content embedded within them.  Neither the reliability nor the generalizability of this manipulation is assessed (would the manipulation have the same effect if used more than once? Is the manipulation related to, or does it have the same effect as, other manipulations of ostensibly the same psychological variable?), much less its size.  The DV might get a bit more attention, in part because unlike the IV it usually has more than two values (e.g., walking speed) and so the reliability of its measurement (say, by two RA’s) might be reported, but the wider generalizability (aka construct validity) of the DV remains unexamined.

And I won’t even mention the problems of low power that go along with small-N studies, and the resulting imprecision of the results.  That one has been hashed out elsewhere, at length, so as I said, I’m not mentioning it.

A truism within personality psychology is that good research begins with good measurement, of both the dependent and independent variables.  Not all areas of psychology pay as much attention to this principle.

  1. Personality Psychology Cares about Effect Size

Results in personality psychology are always reported in terms of effect size, usually the familiar Pearson correlation coefficient.  Social psychology is different (3); social psychologists often state that they don’t care about effect size because in the context of their research the number is nearly meaningless anyway.  The argument goes like this: Because the experimental conditions are designed to be as different from each other as possible, in order to maximize chances of finding out whether anything will happen at all, and also because experiments, by design, control for things that covary in nature, the sizes of the resulting effects don’t really mean anything outside of the experimental context.  All that matters, for purposes of further theory development, is that an effect is found to exist.  The size is only important if you are doing applied work (4).

I actually think this argument has a point, but it reveals an essential limitation of the two-group experiment.  The method can be informative about the direction of causality, and the direction of the effect (positive or negative).  But it can tell us little or nothing about how big, robust and yes, replicable, this finding will turn out to be.

In contrast, close attention to measurement has produced a research literature establishing that

    3.  Many Key Findings of Personality Psychology are Robustly Replicable

These include:

  • Behavior is consistent across situations
  • Personality predicts longevity, job performance, relationship satisfaction and many other important life outcomes
  • Acquaintances largely agree with each other about the personality traits of the people they know well
  • People agree (with some interesting exceptions) with their acquaintances’ assessments of their personality
  • Measures of personality predict central tendencies of density distributions of behavior (for example, a trait measure of extraversion can predict how many extraverted behaviors you will display, on average)
  • Much of the information (not all) in the 17,953 trait words in the unabridged English dictionary can be reduced to a “”Big Five” basic traits: Extraversion, Neuroticism, Agreeableness, Conscientiousness, and Openness to Experience.

This is a very partial list.  But lest I be accused of bias (5), I will also note that:

4. Too Many Findings in Personality Psychology are Robust but Trivial

I actually co-authored a paper with the author of the epigram above (Baumeister, Vohs & Funder, 2007) that, among other things, took personality psychology to task on this very point.  A lot – too much – research in personality psychology correlates one self-report with another self-report.  Can you say “method variance?”  I’ve done such studies myself and they have their uses, and sometimes are they are all one can do, so my overall attitude is forgiving, even while I also believe that there truly is something to forgive.

Trivial findings will replicate! Correlations among different self-report scales can be expected to robustly replicate because the relationships are often built right into the content of the scales themselves.

Studies with self-report scales are common in part because they are so easy to do, but

5. Many Important Findings in Personality Psychology are Very Difficult to Repeat

Some of these findings come from longitudinal studies, in which individuals are repeatedly assessed over long periods of time.  These studies have shown that conscientious people live longer and that the consistency of individual differences is maintained over decades, and have also charted secular trends in personality development, showing how traits such as extraversion and conscientiousness wax and wane over the lifespan.   These findings have been replicated, but only “conceptually” because no two major longitudinal studies have ever used exactly the same methods.  A skeptic would literally need decades to really double-check them.

Other findings might not take decades to reproduce, but are still no walk in the park.  Consider a study from my lab (Fast & Funder, 2008).  This study was actually in one of the issues of JPSP that was targeted by the Center for Open Science replications project.  But nobody tackled it.  Why not?  Our study looked at correlates between personality, as judged by peers, and the frequency with which people used words in different categories, during a life history interview.  To replicate this study, here’s all you have to do: Recruit a sample of 250 undergraduates.  Recruit two peers of each of them to describe their personalities (500 peers in all).  Subject each of these 250 students to a one-hour life-history interview conducted by a licensed clinical psychologist.  Transcribe the recordings of these interviews, delete the interviewer’s comments, and clean up the transcript so that it can undergo linguistic analysis.  Run the transcript through a linguistic analysis program (we used LIWC) and see which categories of word use are related to personality, as judged by peers. Gathering the data in this project took two years.  Transcribing the interviews and cleaning the transcriptions took another two years, and the analyses took around a year beyond that, so about five years of work, in all.   I do NOT know whether the findings would replicate, though we tried hard to use internal checks to reveal results that were as robust as possible.  I would seriously love to see someone else do the same study to see if our results hold up.  What do you think the chances are that anyone ever will?

The kinds of non-trivial studies that Baumeister, Vohs and I advocated, that gather direct measurements of observed and meaningful behavior, are difficult to do, especially with a sufficiently large N, and commensurately a lot of work to replicate.  I’d like to think  — in fact, I do think — that most of these findings would survive direct replication, but who really knows? Hardly anybody has the time, resources, or sufficient motivation to check. In the meantime, these findings remain immune to the replication controversy.

But, going back to the opening quotation, there is one more reason why personality psychology has avoided a replication crisis, and I believe this reason is the most important of all.

6. Personality Psychology Is Not Afraid to be Boring

Modern personality psychology (since 1950 or so) has never really striven to be clever, or cute, or counter-intuitive.  Its historic goal has been to be useful.  The gold standard in personality research is prediction (6).  Personality is measured in order to predict – and understand – behaviors and life outcomes, in order to be useful in psychological diagnosis, personnel selection, identification of at-risk individuals, career counseling, mental health interventions, improvements in quality of life, and many other purposes. Occasionally the findings are surprising, such as the now well-established fact that the trait of conscientiousness predicts not only longevity, but also job performance in every single occupation where it has ever been tested.  Nobody expected its implications to be so far-reaching.  The Big Five personality traits are not exactly surprising, but they aren’t obvious either.  If they were, it wouldn’t have taken 60 years of research to find them, and there wouldn’t still be controversy about them. Still,  studies in personality psychology typically lack the kind of forehead-slapping surprise value that characterizes many of the most famous (and problematical) findings in social psychology.

According to Bargain Basement Bayesian analysis, counterintuitive findings have low prior probabilities, by definition.  And thus, in the absence of extra-strong evidence, they are unlikely to be true and therefore unlikely to be replicable. I am about the one-hundred thousandth blogger to observe that ignoring this principle has gotten social psychology into tons of trouble.  In contrast, the fact that personality psychology never saw counter-intuitive (or, as some might put it, “interesting”) findings as its ultimate goal, seems to have turned out to be a protective factor.

Conclusion

Admittedly, some of advantages of personality psychology are bugs and not features. It isn’t particularly salutary that so many of personality psychology’s findings are trivially replicable because they amount to intercorrelations of self-report scales. And, the fact that some of the most interesting findings are almost immune to replication studies because they are so difficult to repeat, does not necessarily mean all of those findings are true.  Despite appearances, personality psychology probably has replicability issues too.  They are just harder to detect, which makes it even more important for personality researchers to get it right, the first time.  Nobody might come this way again.

Here’s another quote from the same article excerpted at the beginning of this post:

Social psychology might think carefully about how much to follow in personality psychology’s footsteps. Our entire field might end up being one of the losers.

Well, none of us wants to be a “loser,” but the present comparison of the neighboring disciplines leads to a different conclusion.  Social psychology (and science generally) might in fact do well to draw a few lessons from personality psychology: Take measurement seriously. Use large samples.  Care about effect size.  And don’t be afraid to be boring.  More exactly, push back against the dangerous idea that findings have to be surprising or counter-intuitive to be interesting.  How “interesting,” in the end, is a flashy finding that nobody can replicate?

 Footnotes

(1) Or to be honest, any study at all, but I’m trying do a little CYA here.

(2) To name a few: elderly priming, money priming, too much choice, glucose as a remedy for ego depletion, cleanliness and moral judgment, bathing and loneliness, himmicanes, power posing, precognition (this last finding might not really belong to social psychology, but it was published in the field’s leading empirical outlet).

(3) Even when effect sizes are reported, as required by many journal policies, they are otherwise typically ignored.

(4) This is NOT a straw man.  I used to think it was.  See my earlier blog post, which includes verbatim quotes from a prominent (anonymous) social psychologist.

(5) This will happen anyway; see footnote 7.

(6) In the opinion of many, including myself, the best graduate-level textbook ever written on this topic was Personality and Prediction (Wiggins, 1973).  It says a lot about measurement.  Everybody should read it.

(7) An indicator of my bias: I have written a textbook in personality psychology, one which, by the way, I tried very hard to make not-boring.

Acknowledgment

Ryne Sherman and Simine Vazire gave me some helpful tips, but none of this post should be considered their fault.

References

Baumeister, R.F. (2016). Charting the future of social psychology on stormy seas: Winners, losers, and recommendations. Journal of Experimental Social Psychology.

Baumeister, R.F., Vohs, K.D., & Funder, D.C. (2007). Psychology as the science of self-reports and finger movements: Whatever happened to actual behavior? Perspectives on Psychological Science, 2, 396-403.

Fast, L.A., & Funder, D.C. (2008). Personality as manifest in word use: Correlations with self-report, acquaintance-report, and behavior. Journal of Personality and Social Psychology, 94, 334-346.

Wiggins, J.S. (1973). Personality and prediction: Principles of personality assessment. Reading, MA: Addison-Wesley.

 

 

 

 

What if Gilbert is Right?

I. The Story Until Now (For late arrivals to the party)
Over the decades, since about 1970, social psychologists conducted lots of studies, some of which found cute, counter-intuitive effects that gained great attention. After years of private rumblings that many of these studies – especially some of the cutest ones – couldn’t be replicated, a crisis suddenly broke out into the open (1). Failures to replicate famous and even beloved findings began to publicly appear, become well known, and be thoroughly argued-over, not always in the most civil of terms. The “replicability crisis” became a thing.
But how bad was the crisis really? The accumulation of anecdotal stories and one-off failures to replicate was perhaps clarified to some extent by a major project organized by the Center for Open Science (COS), published last November, in which labs around the world tried to replicate 100 studies and, depending on your definition, “replicated” only 36% of them (2).
In the face of all this, some optimists argued that social psychology shouldn’t really feel so bad, because failed replicators might simply be incompetent, if not actually motivated to fail, and the typical cute, counter-intuitive effect is a delicate flower that can only bloom under the most ideal climate and careful cultivation. Optimists of a different variety (including myself) also pointed out that psychology shouldn’t feel so bad, but for a different reason: problems of replicability are far from unique to our field. Failures to reproduce key findings have become seen as serious problems within biology, biochemistry, cardiac medicine, and even – and disturbingly –cancer research. It was widely reported that the massive biotech company Amgen was unable to replicate 47 out of 53 of seemingly promising cancer biology studies. If we have a problem, we are far from alone.

II. And Then Came Last Friday’s News (3)
Prominent psychology professors Daniel Gilbert and Tim Wilson published an article that “overturned” (4) the epic COS study. Specifically, their reanalysis concluded that the study not only didn’t persuasively show that most of the studies it addressed couldn’t be replicated, its data were actually consistent with the possibility that all of the studies were replicable! The article was widely reported not just in press releases but in outlets including the Washington Post, Wired, the Atlantic on line, and the Christian Science Monitor, to name just a few.
Psychologists who had been skeptical of the “replication movement” all along – come one, we know who you are — quickly tweeted, Facebooked and otherwise cheered the happy news. Some even began to wonder out loud whether “draconian” journal reforms adopted to enhance replicability could now be repealed. At the same time, and almost as quickly, members of the aforesaid replication movement – come one, we know who you are too (5) – took close looks at the claims by Gilbert and Co., and within 48 hours a remarkable number of blogs and posts (6) began to refute their statistical approach and challenge the way they summarized some of the purported flaws of the replication studies. I confess I found most of these responses pretty persuasive, but that’s not my point for today. Instead my point is:

III. What if Gilbert is Right?
Let’s stipulate, for the moment, that Gilbert and Co. are correct that the COS project told us nothing worth knowing about the replicability of social psychological research. What then?

IV. The COS Study Is Not the Only, and Was Far From the First, Sign that We Have A Problem.
One point I have seen mentioned elsewhere – and I’ll repeat it here because it’s a good point – is that the COS project was far from being the only evidence that social psychology has a replicability problem. In fact, it came after, not before, widespread worry had been instigated by a series of serious and compelling failures to reproduce very prominent studies, and many personal reports of research careers delayed if not derailed by the attempt to follow-up on lines of research that only certain members of the in-crowd knew were dead ends. As this state of affairs became more public over the past couple of years, the stigma of failing to replicate some famous psychologist’s famous finding began (not entirely!!) to fall away, and a more balanced representation of what the data really show, on all sorts of topics, began to accumulate in public file drawers, data repositories, and outlets for replication studies. The COS study, whatever its merits, came on top, not as a foundation, of all that.

V. Other Fields Have Replicability Problems Too
A point I haven’t, in this context, seen mentioned yet – and my real motivation for writing this post – is that – remember! – the replication crisis was never exclusive to psychology  in the first place. It has affected many other fields of research as well. So, if Gilbert & Co. are right, are we to take it that the concerns in our sister sciences are also overblown? For example, was Amgen wrong? Were all those cancer biology studies perfectly replicable after all? Do biochemistry, molecular biology, and the medical research community share social psychology’s blight of of uncreative, incompetent, shameless little bullies aiming to pull down the best research in their respective fields?
Well, maybe so. But I doubt it. It seems extremely unlikely that the kinds of complaints issued against the studies that failed to replicate psychological findings apply in the same way in these other fields. It seems doubtful that problems in these other fields stem from geographical or temporal differences in social norms, unique aspects of student demographics, changes in wordings of scale items, exact demeanor of research assistants, or other factors of the sort pointed out by Gilbert & Co. as bedeviling attempts to replicate psychological findings. I also have no reason to think that molecular biology is full of shameless little bullies, but I stand ready to be corrected on that point.

VI: The Ultimate Source of Unreliable Scientific Research
So let’s go back to where some of us were before the COS study, when we pointed out that social psychology is not alone in having replication problems. What did this fact imply? Just this: The causes of a scientific literature full of studies that can’t be replicated are not specific to social psychology. The causes are both deeper and broader. They are deeper because they don’t concern concrete details of particular studies, or even properties of particular areas of research. They are broader because they affect all of science.
And the causes are not hard to see. Among them are:
1. An oversaturated talent market full of smart, motivated people anxious to get, or keep, an academic job.
2. A publication system in which the journals that can best get you a job, earn you tenure, or make you a star, are (or until recently have been) edited with standards such as the “JPSP threshold” (of novelty), and the explicit (former) disdain of Psychological Science for mere “bricks in the wall” that represent solid, incrementally useful, but insufficiently “groundbreaking” findings. I have been told that the same kinds of criteria have long prevailed in major journals in other fields of science as well. And of course we all know what kind of article is required to make it into Science.
3. And, even in so-called lesser journals, an insistence on significant findings as a criterion for publication, and a strong preference for reports of perfect, elegant series of studies without a single puzzling data point to be seen. “Messy” studies are left to work their way down the publication food chain, or to never appear at all.
4. An academic star system that radically, disproportionately rewards researchers whose flashy findings get widespread attention not just in our “best” journals but even in the popular media. The rewards can include jobs in the most prestigious universities, endowed chairs, distinguished scholar awards, Ted talks, and even (presumably lucrative) appearances in television commercials! (7)

It is these factors that are, in my opinion, both the ultimate sources of our problem and the best targets for reforming and improving not just psychology, but scientific research in all fields. And, to end on an optimistic note, I think I see signs that useful reforms are happening. People aren’t quite as enthusiastic about cute, counter-intuitive findings as they used to be. Hiring committees are starting to wonder what it really means when a vita shows 40 articles published in 5 years, all of which have perfect patterns of results. Researchers are occasionally openly responding – and getting publicly praised for openly responding — rather than defensively reacting, to questions about their work. (8)

VII. Summary and Moral
The replicability crisis is not just an issue for social psychology, and its causes aren’t unique to social psychology either. Claims that we don’t have a problem, because of various factors that are themselves unique to social psychology, fail to explain why so many other fields have similar concerns. The essential causes of the replicability crisis are cultural and institutional, and transcend specific fields of research. The remedies are too.

Footnotes
(1) The catalyst for this sudden attention appears to have been the nearly simultaneous appearance in JPSP of a study reporting evidence for precognition, and the exposure of massive data fraud by a prominent Dutch social psychologist. While these two cases were unrelated to each other and each exceptional by any standard, together they illuminated the fallibility of peer review and the self-correcting processes of science that were supposed to safeguard against accepting unreliable findings.
(2) Or 47%, or 39% or 68%, again, depending on your definition.
(3) Or a bit earlier, because Science magazine’s embargo was remarkably leaky, beginning with a Harvard press release issued several days before the article it promoted.
(4) To quote their press release; the word does not appear in their article.
(5) Full disclosure. This probably includes me, but I didn’t write a blog about it (until just now).
(6) A few: Dorothy Bishop, Andrew Gelman, Daniel Lakens, Uri Simonsohn, Sanjay Srivastava, Simine Vazire
(7) I strongly recommend reading Diederik Stapel’s vivid account (generously translated by Nick Brown) of how desperately he craved becoming one of these stars, and what this craving motivated him to do.
(8) Admittedly, defensive reactions, amplified in some cases by fan clubs, are still much more common. But I’m looking for positive signs here, and I think I see a few.

Bargain Basement Bayes

One of the more salutary consequences of the “replication crisis” has been a flurry of articles and blog posts re-examining basic statistical issues such as the relations between N and statistical power, the importance of effect size, the interpretation of confidence intervals, and the meaning of probability levels. A lot of the discussion of what is now often called the “new statistics” really amounts to a re-teaching (or first teaching?) of things anybody, certainly anybody with an advanced degree in psychology, should have learned in graduate school if not as an undergraduate. It should not be news, for example, that bigger N’s give you a bigger chance of getting reliable results, including being more likely to find effects that are real and not being fooled into thinking you have found effects when they aren’t real. Nor should anybody who had a decent undergrad stats teacher be surprised to learn that p-levels, effect sizes and N’s are functions of each other, such that if you know any two of them you can compute the third, and that therefore statements like “I don’t care about effect size” are absurd when said by anybody who uses p-levels and N’s.

But that’s not my topic for today. My topic today is Bayes’ theorem, which is an important alternative to the usual statistical methods, but which is rarely taught at the undergraduate or even graduate level. (1)  I am far from expert about Bayesian statistics. This fact gives me an important advantage: I won’t get bogged down in technical details; in fact that would be impossible, because I don’t really understand them. A problem with discussions of Bayes’ theorem that I often see in blogs and articles is that they have a way of being both technical and dogmatic. A lot of ink – virtual and real – has been spilled about the exact right way to compute Bayes Factors and advocating that all statistical analyses should be conducted within a Bayesian framework. I don’t think the technical and dogmatic aspects of these articles are helpful – in fact I think they are mostly harmful – for helping non-experts to appreciate what thinking in a semi-Bayesian way has to offer. So, herewith is my extremely non-technical and very possibly wrong (2) appreciation of what I call Bargain Basement Bayes.

Bayes Formula: Forget about Bayes Formula. I have found that even experts have to look it up every time they use it. For many purposes, it’s not needed at all. However, the principles behind the formula are important. The principles are these:

1. First, Bayes assumes that belief exists in degrees, and assigns numbers to those degrees of belief. If you are certain that something is false, it has a Bayes “probability” of 0. If you are certain it’s true, the probability is 1. If you have absolutely no idea, whatsoever, the probability is .5. Everything else is in between.
Traditional statisticians hate this. They don’t think a single fact, or event, can even have a probability. Instead, they want to compute probabilities that refer to frequencies within a class, such as the number of times out of hundred a result would be greater than a certain magnitude under pure chance given a certain N. But really, who cares? The only reason anybody cares about this traditional kind of probability is because after you compute that nice “frequentist” result, you will use the information to decide what you believe. And, inevitably, you will make that decision with a certain degree of subjective confidence. Traditional statistics ignores and even denies this last step, which is precisely where it goes very, very wrong. In the end, beliefs are held by and decisions based on those beliefs are made by people, not numbers. Sartre once said that even if there is a God, you would still have to decide whether to do what He says. Even if frequentist statistics are exactly correct (3) you still have to decide what to do with them.

2. Second, Bayes begins with what you believed to be true before you got your data. And then it asks, now that you have your data, how much should you change what you used to believe? (4)
Traditional statisticians hate this even more than they hate the idea of putting numbers on subjective beliefs. They go on about “prior probabilities” and worry about how they are determined, observe (correctly) that there is no truly objective way to estimate them, and suspect that the whole process is just a complicated form of inferential cheating. But the traditional model begins by assuming that researchers know and believe absolutely nothing about their research topic. So, as they then must, they will base everything they believe on the results of their single study. If those results show that people can react to stimuli presented in the future, or that you can get people to slow their walks to a crawl by having them unscramble the word “nldekirw” (5) then that is what we have to believe. In the words of a certain winner of the Nobel Prize, “we have no choice.”
Bayes says, oh come on. Your prior belief was that these things were impossible (in the case of ESP) or, once the possibility of elderly priming was explained, that it seemed pretty darned unlikely. That’s what made the findings “counter-intuitive,” after all. Conventional statistics ignores these facts. Bayes acknowledges that claims that are unlikely to be true, a priori, need extra-strong evidence to become believable. I am about the one millionth commentator to observe that social psychology, in particular, has been for too long in thrall to the lure of the “counter intuitive result.” Bayes explains exactly how that got us into so much trouble. Counter-intuitive, by definition, means that the finding had a low Bayesian prior. Therefore, we should have insisted on iron-clad evidence before we started believing all those cute surprising findings, and we didn’t. Maybe some of them are true, who knows at this point. But the clutter of small-N, underpowered single studies with now-you-see-it-now-you-don’t results are in a poor position to tell us which they are. Really, we almost need to start over.

3. Third, Bayes is in the end all about practical decisions. Specifically, it’s about decisions to believe something, and to do something or not, in the real world. It is no accident, I think, that so many Bayesians work in applied settings and focus on topics such as weather forecasting, financial planning, and medical decisions. In all of these domains, the lesson they teach tends to be – as Kahneman and Tversky pointed out long ago – we underuse baserates (6). In medicine, in particular, the implications are just starting to be understood in the case of screening for disease. When the baserate (aka the prior probability) is low, then even highly diagnostic tests are at a very high probability of yielding false positives, which entail significant physical, psychological, and financial costs. Traditional statistical thinking, which ignores baserates, leads one to think that a positive result of a test with 90% accuracy means that the patient has a 90% chance of having the disease. But if the prevalence in the population is 1%, the actual probability given a positive test is less than 10%. In subjective, Bayesian terms of course! Extrapolating this to the context of academic research, the principle implies that we overestimate the diagnosticity of single research studies, especially when the prior probability of the finding is low. I think this is why we were so willing to accept implausible, “counter-intuitive” results on the basis of inadequate evidence. To our current grief.

You don’t have to be able to remember Bayes’ formula to be a Bargain Basement Bayesian. But, as in all worthwhile bargain basements, you can get something valuable at a low cost.

Footnotes
1. In a recent graduate seminar that included students from several departments, I asked who had ever taken a course that taught anything about Bayes.  One person raised her hand.  Interestingly, she was a student in the business school.
2. Hi Simine.
3. They aren’t.
4. Bayes is sometimes called the “belief revision model,” which I think is pretty apt.
5. Wrinkled
6. Unless the data are presented in an accessible, naturalistic format such as seen in the work by Gerd Gigerenzer and his colleagues, which demonstrates how to present Bayesian considerations in terms other than the intimidating-looking formula.

Towards a De-biased Social Psychology: The effects of ideological perspective go beyond politics.

Behavioral and Brain Sciences, in press; subject to final editing before publication

This is a commentary on: Duarte, J. L., Crawford, J. T., Stern, C., Haidt, J., Jussim, L., & Tetlock, P. E.  (in press). Political diversity will improve social psychological science. Behavioral and Brain Sciences. To access the target article, click here.

“A liberal is a man too broadminded to take his own side in a quarrel.” — Robert Frost

Liberals may be too open-minded for their own (ideological) good; they keep finding fault with themselves and this article is a good example. Which is not to say it’s not largely correct. Social and personality psychology obviously lacks ideological diversity, and Duarte and colleagues provide strong circumstantial evidence that the causes include hostile climate, lack of role models, and subtle and not-so-subtle discrimination of the same sort that underlies other lacks of diversity elsewhere in society.

Duarte et al. argue that our science would be better if more “conservatives” were included in the ideological mix. But the point of view that carries this label has changed greatly in recent years. Not so long ago, no conservative would dream of shutting down the government over an ideological dispute, denying the validity of settled science, or passing laws to encourage open carry of weapons on college campuses. Conservatives were conservative. Such people indeed have a lot to contribute to any discussion, including scientific ones. But many modern-day “conservatives” — especially the loudest ones — would better be described as radical, and among their radical characteristics is a pride in anti-intellectualism and willful ignorance. In a call for more conservatives, who are we actually inviting and, I truly wonder, how many even exist? I am not optimistic about the feasibility of finding enough reasonable conservatives to join our field, even if we could overcome all of the barriers the target article so vividly describes. At best, such change is a long-term goal.

In any case, we shouldn’t wait for conservatives to arrive and save us. We need to save ourselves. The target article presents mixed messages about whether de-biasing is feasible. On the one hand, it cites evidence that de-biasing is difficult or impossible. On the other hand, the entire article is an effort at de-biasing. I choose to believe the more optimistic, implicit claim of Duarte et al., which is that we can become more intellectually honest with ourselves and thereby do better science. I find the “mirror-image” test particularly promising. For any finding, we should indeed get into the habit of asking, what if the very same evidence had led to the opposite conclusion?

Politics is the least of it. In focusing on research that seeks to describe how conservatives are cognitively flawed or emotionally inadequate, or on research that treats conservative beliefs as ipso facto irrational, Duarte et al. grasp only at the low-hanging fruit. More pernicious, I believe, are the way ideological predilections bias the conduct and evaluation of research that, on the surface, has nothing to do with politics. An awful lot of research and commentary seems to be driven by our value systems, what we wish were true. So we do studies to show that what we wish were true is true, and attack the research of others that leads to conclusions that do not fit our world view.

Examples are legion. Consider just a few:

Personality and abilities are heritable. This finding is at last taking hold in psychology, after a century’s dominance of belief in a “blank slate.” The data were just too overwhelming. But the idea that people are different at the starting line is heartbreaking to the liberal world-view and encounters resistance even now.

Human nature is a product of evolution. Social psychologists are the last people you would expect to deny that Darwin was right — except when it comes to human behavior, and especially if it has anything to do with sex differences (Winegard et al., 2014). The social psychological alternative to biological evolution is not intelligent design, it’s culture. And as to where culture came from, that’s a problem left for another day.

The Fundamental Attribution Error is, as we all know, the unfortunate human tendency to view behavior as stemming from the characteristics — the traits and beliefs — of the people who perform it. Really, it’s the situation that matters. So, change the situation and you can change the behavior; it’s as simple as that. This belief is very attractive to a liberal world-view, and one does not have to look very far to find examples of how it is used to support various liberal attitudes towards crime and punishment, economic equality, education, and so forth. But the ideological consequences of belief in the overwhelming power of the situation are not consistent. It implies that the judges at Nuremberg committed the Fundamental Attribution Error when they refused to accept the excuse of Nazi generals that they were “only following orders.”

The consistency controversy, which bedeviled the field of personality psychology for decades and which still lingers in various forms, stems from the conviction among many social psychologists that the Fundamental Attribution Error, just mentioned, affects an entire subfield of psychology. Personality psychology, it is sometimes still said, exaggerates the importance of individual differences. But to make a very long story very short, individual differences in behavior are consistent across situations (Kenrick & Funder, 1988) and stable over decades (e.g., Nave et al., 2010). Many important life outcomes including occupational success, marital stability and even longevity can be predicted from personality traits as well as or better than from any other variables (Roberts et al., 2007). And changing behavior is difficult, as any parent trying to get a child to make his bed can tell you; altering attitudes is just as hard, as anyone who has ever tried to change anyone else’s mind in an argument can tell you. Indeed, does anybody ever change their mind about anything? Maybe so, but generally less than the situation would seem to demand. I expect that responses to the article by Duarte et al. will add one more demonstration of how hard it is to change ingrained beliefs.

REFERENCES
Kenrick, D.T., & Funder, D.C. (1988). Profiting from controversy: Lessons from the person-situation debate. American Psychologist, 43, 23-34.
Nave, C.S., Sherman, R.A., Funder, D.C., Hampson, S.E., & Goldberg, L.R. (2010). On the contextual independence of personality: Teachers’ assessments predict directly observed behavior after four decades. Social Psychological and Personality Science, 1, 327-334.
Roberts, B.W., Kuncel, N.R., Shiner, R., Caspi, A., & Goldberg, L.R. (2007). The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science, 2, 313-345.
Winegard BM, Winegard BM, & Deaner RO (2014). Misrepresentations of evolutionary psychology in sex and gender textbooks. Evolutionary Psychology, 12, 474-508.

How to Flunk Uber: A Guest Post by Bob Hogan

How to Flunk Uber

by Robert Hogan

Hogan Assessment Systems

Delia Ephron, a best-selling American author, screenwriter, and playwright, published an essay in the New York Times on August 31st, 2014 entitled “Ouch, My Personality, Reviewed”  that is a superb example of what Freud called “the psychopathology of everyday life.”  She starts the essay by noting that she recently used Uber, the car service for metrosexuals, and the driver told her that if she received one more bad review, “…no driver will pick you up.”  She reports that this feedback triggered some “obsessive” soul searching:  she wondered how she could have created such a bad score as an Uber passenger when she had only used the service 6 times.  She then reviewed her trips, noting that, although she had often behaved badly (“I do get short tempered when I am anxious”), in each case extenuating circumstances caused her behavior.  She even got a bad review after a trip during which she said very little:  “Perhaps I simply am not a nice person and an Uber driver sensed it.”

The essay is interesting because it is prototypical of people who can’t learn from experience.  For example, when Ms. Ephron reviewed the situations in which she mistreated Uber drivers, she spun each incident to show that her behavior should be understood in terms of the circumstances—the driver’s poor performance—and not in terms of her personality.  Perhaps situational explanations are the last refuge of both neurotics and social psychologists?

In addition, although the situations changed, she behaved the same way in each of them:  she complained, she nagged and micro-managed the drivers, she lost her temper, and she broadcast her unhappiness to the world.  Positive behavior may or may not be consistent across situations, but negative behavior certainly is.  And the types of negative behaviors she displayed fit the typology defined by the Hogan Development Survey (HDS), an inventory of the maladaptive behaviors that occur when people are dealing with others with less power and think no one important is watching them.

All her actions had a manipulative intent—Ms. Ephron wanted to compel a fractious driver to obey her.  Her behaviors were tactical in that they gave her short term, one off wins—she got her way; but the behaviors become counterproductive when she has to deal with the same people repeatedly—or when she is dealing with NYC Uber drivers.  Strategic players carefully control what Irving Goffman called “their leaky channels”, the behavioral displays that provide information regarding a player’s character or real self.  The tactical Ms. Ephron seems unable to control her leaky channels.

It was also interesting to learn that, although Ms. Ephron has been in psychotherapy for years, the way she mistreats “little people” seemingly never came up. This highlights the difference between intrapsychic and interpersonal theories of personality.   From an intrapsychic perspective, emotional distress creates problems in relationships; fix the emotional problems and the relationships will take care of themselves.  From an interpersonal perspective, problems in relationships create emotional distress—fix the relationships (behave better) and the emotional problems will take care of themselves.  In the first model, intrapsychic issues disrupt relationships; in the second model, disrupted relationships cause intrapsychic issues.

As further evidence that Ms. Ephron lacks a strategic understanding of social behavior, she is surprised to learn that other people keep score of her behavior.  This means that she pays no attention to her reputation.  But her reputation is the best source of data other people have concerning how to deal with her.  She might not care about her reputation, but those who deal with her do.  All the data suggest that she will have the same reputation with hair dressers, psychotherapists, and purse repair people as she does with the Uber drivers of New York.

Finally, people flunk Uber the same way as they become unemployable and then flunk life—they flunk one interaction at a time.  After every interaction there is an accounting process, after which something is added to or subtracted from peoples’ reputations.  The score accumulates over time and at some point, the Uber drivers refuse to pick them up.  Ms. Ephron is a successful artist, and her success buys her a degree of idiosyncratic credit—she is allowed to misbehave in the artistic community—but there are consequences when she misbehaves in the larger community of ordinary actors.

The Real Source of the Replication Crisis

“Replication police.” “P-squashers.” “Hand-wringers.” “Hostile replicators.”  And of course, who can ever forget, “shameless little bullies.”  These are just some of the labels applied to what has become known as the replication movement, an attempt to improve science (psychological and otherwise) by assessing whether key findings can be reproduced in independent laboratories.

Replication researchers have sometimes targeted findings they found doubtful.  The grounds for finding them doubtful have included (a) the effect is “counter-intuitive” or in some way seems odd (1), (b) the original study had a small N and an implausibly large effect size, (c) anecdotes (typically heard at hotel bars during conferences) abound concerning naïve researchers who can’t reproduce the finding, (d) the researcher who found the effect refuses to make data public, has “lost” the data or refuses to answer procedural questions, or (e) sometimes, all of the above.

Fair enough. If a finding seems doubtful, and it’s important, then it behooves the science (if not any particular researcher) to get to the bottom of things.  And we’ve seen a lot of attempts to do that lately. Famous findings by prominent researchers have been put  through the replication wringer, sometimes with discouraging results.  But several of these findings also have been stoutly defended, and indeed the failure to replicate certain prominent effects seems to have stimulated much of the invective thrown at replicators more generally.

One target of the “replication police” has enjoyed few defenders. I speak, of course, of Daryl Bem (2). His findings on ESP stimulated an uncounted number of replications.  These were truly “hostile” – they set out to prove his findings wrong and what do you know, the replications uniformly failed. In response, by all accounts, Daryl has been the very model of civility. He provides his materials and data freely and without restriction, encourages the publication of all findings regardless of how they turn out, and has managed to refrain from telling his critics that they have “nothing in their heads” or indeed, saying anything negative about them at all that I’ve seen. Yet nobody comes to his defense (3) even though any complaint anybody might have had about the “replication police” applies, times 10, to the reception his findings have received. In particular, critiques of the replication movement never mention the ESP episode, even though Bem’s experience probably provides the best example of everything they are complaining about.

Because, and I say this with some reluctance, the backlash to the replication movement does have a point.  There IS something a bit disturbing about the sight of researchers singling out effects they don’t like (maybe for good reasons, maybe not), and putting them – and only them – under the replication microscope.  And, I also must admit, there is something less-than-1000% persuasive about the inability to find an effect by somebody who didn’t believe it existed in the first place. (4)

But amidst all the controversy, one key fact seems to be repeatedly overlooked by both sides (6).  The replication crisis did NOT arise because of studies intended to assess the reality of doubtful effects.  That only came later.  Long before were repeated and, indeed, ubiquitous stories of fans of research topics – often graduate students – who could not make the central effects work no matter how hard they tried.  These failed replications were anything but hostile.  I related a few examples in a previous post but there’s a good chance you know some of your own.  Careers have been seriously derailed as graduate students and junior researchers, naively expecting to be able to build on some of the most famous findings in their field, published in the top journals by distinguished researchers, simply couldn’t make them work.

What did they do? In maybe 99% of all cases (100% of the cases I personally know about) they kept quiet. They – almost certainly correctly – saw no value for their own careers in publicizing their inability to reproduce an effect enshrined in textbooks.  And that was before a Nobel laureate promulgated new rules for doing replications, before a Harvard professor argued that failed replications provide no information and, of course, long before people reporting replication studies started to be called “shameless little bullies.”

Most of the attention given these days to the replication movement — both pro and con — seems to center around studies specifically conducted to assess whether or not particular findings can replicated. I am one of those who believe that such studies are, by and large, useful and important.  But we should remember that the movement did not come about because of, or in order to promote such studies.  Instead, its original motivation was to make it a bit easier and safer to report data that go against the conventional wisdom, and thereby protect those who might otherwise waste years of their lives trying to follow-up on famous findings that have already been disconfirmed, everywhere except in public.  From what we’ve seen lately, this goal remains a long ways off.

  1. Naturally, counter-intuitiveness and oddity is in the eye of the beholder.
  2. Full disclosure: He was my graduate advisor. An always-supportive, kind, wise and inspirational graduate advisor.
  3. Unless this post counts.
  4. This is why Brian Nosek and the Center for Open Science made exactly the right move when they picked out a couple of issues of JPSP and two other prominent journals and began to recruit researchers to replicate ALL of the findings in them (5). Presumably nobody will have a vested interest or even a pre-existing bias as to whether or not the effects are real, making the eventual results of these replications, when they arrive, all the more persuasive.
  5. Full disclosure: A study of mine was published in one of those issues of JPSP. Gulp!
  6. Even here, which is nothing if not thorough.

Acknowledgements: Simine Vazire showed me that footnotes can be fun.  But I still use my caps key. Simine and Sanjay Srivastava gave me some advice, some of which I followed.  But in no way is this post their fault.

When Did We Get so Delicate?

Replication issues are rampant these days. The recent round of widespread concern over whether supposedly established findings can be reproduced began in biology and the related life sciences, especially medicine. Psychologists entered the fray a bit later, largely in a constructive way. Individuals and professional societies published commentaries on methodology, journals acted to revise their policies to promote data transparency and encourage replication, and the Center for Open Science took concrete steps to make doing research “the right way” easier. As a result, psychology was viewed not as the poster child of replication problems, quite the opposite. It became viewed as the best place to look for solutions to these problems.

So what just happened? In the words of a headline in the Chronicle of Higher Education, the situation in psychology has suddenly turned “ugly and odd.”  Some psychologists whose findings were not replicated are complaining plaintively about feeling bullied. Others are chiming in about how terrible it is that people’s reputations are ruined when others can’t replicate their work. People doing replication studies have been labeled the “replication police,” “replication Nazis” and even, in one prominent psychologist’s already famous phrase, “shameless little bullies.” This last-mentioned writer also passed along an anonymous correspondent’s description of replication as a “McCarthyite nightmare.”  More sober commentators have expressed worries about “negative psychology” and “p-squashing.” Concern has shifted away from the difficulties faced by those who can’t make famous effects “work,” and the dilemma about whether they dare to go public when this happens. Instead, prestigious commentators are worrying about the possible damage to the reputations of the psychologists who discovered these famous effects, and promulgating new rules to follow before going public with disconfirmatory data.

First, a side comment: It’s my impression that reputations are not really damaged, much less ruined, by failures to replicate. Reputations are damaged, I fear, by defensive, outraged reactions to failures to replicate one’s work. And we’ve seen too many of those, and not enough reactions like this.

But now, the broader point: When did we get so delicate? Why are psychologists, who can and should lead the way in tackling this scientific issue head-on, and until recently were doing just that, instead becoming distracted by reputational issues and hurt feelings?

Is anybody in medicine complaining about being bullied by non-replicators, or is anyone writing blog posts about the perils of “negative biology”? Or is it just us? And if it’s just us, why is that? I would really like to know the answer to this question.

For now, if you happen to be a psychologist sitting on some data that might undermine somebody’s famous finding, the only advice I can give you is this:  Mum’s the word.  Don’t tell a soul.  Unless you are the kind of person who likes to poke sticks into hornets’ nests.