Thoughts on “Ego Depletion” and Some Related Issues Concerning Replication

This brief essay was stimulated by a chapter by Baumeister (2019), which can be accessed at https://psyarxiv.com/uf3cn/.

 “Fatigue,” though a common word, is far from being a boring or neglected concept. A quick search on PsychInfo reveals thousands of published articles on the subject (14,892, to be exact). A lot of this work is theoretical, more of it is applied, and all of it focuses on an experience that is common to everybody. I was particularly impressed by an article by Evans, Boggero, and Segerstrom (2016) that illuminates the connections between physical and psychological factors, and specifically addresses “how fatigue can occur even in the presence of sufficient resources.”  In fact, I read their article as providing evidence that fatigue usually occurs in the presence of sufficient resources – it’s not primarily a physical phenomenon at all; it’s a psychological one. This fact has many important implications. Fascinating stuff.

The related phenomena demonstrated by many, many studies of “ego depletion” are real, and important, and I personally have no doubt whatsoever about that. When people are tired, including psychologically tired (an interesting concept in its own right), their self-control abilities wane, prepotent responses (such as emotional lashing out, simplistic thinking, overlearned habits, selfish impulses) tend to take over as conscious control weakens. Isn’t that pretty much what the studies show, in the aggregate? Does anybody doubt that really happens? Has anybody out there honestly not experienced exactly these phenomena?

So why the controversy and doubt? Was it sparked by incompetent researchers with nefarious motives? No, at least, not at first. The controversy arose as one of many effects of the emergence of the replication crisis, which created an overall skepticism about many findings in social psychology, not just, or especially, ego depletion.  Doubts about famous and even beloved social psychological findings first arose among researchers – many of them  students – who became dismayed to discover that the neat-and-tidy looking JPSP articles they had read and admired reported findings that were surprisingly difficult to repeat in their own work. They certainly weren’t as easy to do as the JPSP articles would lead one to expect! I believe – and this is just a personal impression, but based on lots of conversations in hotel bars at professional meetings going back long before anybody was seriously talking about replication issues – that doubts about many classic social psychological findings first arose in people who loved the findings and wanted to do their own studies to extend them. Examples: elderly walking, “too much choice,” Lady Macbeth effect. These students and (mostly) young faculty, almost always concluded, at least at first, that they had done something wrong and they just couldn’t figure out what. It was only when they compared notes with other researchers (often at hotel bars), failures to replicate started to be talked about more publicly, and work like The Reproducibility Project (Open Science Collaboration, 2015) began to be conducted, that attitudes started to shift, and people who couldn’t make a study work – and, remember, it looked so easy in the JPSP article where, always, all four studies worked! – began to think: Maybe it’s not just me.

So, a few people started to go public with their doubts, and what happened next wasn’t pretty. Researchers who reported failures to replicate famous findings were told, by professors at Yale, Harvard and Princeton (respectively), that they had “nothing in their heads,” were “shameless little bullies” and even amounted to “methodological terrorists.” (These are exact quotes.) Were these failed replicators, coming under this kind of attack, going public in order to take an easy route towards building careers on bad research? It’s hard to think so, given the responses they got – which are reminiscent of what pretty much always happens to whistleblowers.

And there was plenty to blow the whistle about. Mostly, and most obviously, overclaiming. Go back and re-read what researchers on behavioral priming used to say (maybe they still do) about how powerfully subtle cues can completely derail our behavior without our knowing it, or others writing about “wise” interventions that, with tiny tweaks, can change a lifetime of behavioral habit. Or social psychologists (many, many of them) claiming that individual differences in personality (pet peeve alert) are so transient and weak that a quick situational manipulation can wipe them out. Oh really. Then why are these studies so hard to replicate, in the cases when they can be replicated at all?

The above two paragraphs capture about where I was, say, two years ago. But my views have shifted. First, I really did see, firsthand, a couple of places where the big replication projects were being carried out, and I have to say the methods used and quality control were, shall we say, far from optimal.  My attitude shift was also catalyzed to a considerable degree by the experience I had when Kathleen Vohs invited me to participate in the SPSP presentation of her big, multi-site ego-depletion replication study (Funder, 2018). It motivated me to rethink the evaluation of effect sizes (culminating in the article I published recently with Dan Ozer; 2019), and to observe how the fundamental misunderstanding of these numbers bedevils psychological understanding in so many ways.

For example, original reports of many findings reported effects that were just way too large to be plausible (coupled with Ns too small to yield reliable effect size estimates). So, when others tried to replicate them, of course the same effect size wasn’t there, meaning their small-N study (which seemed like it should be enough, given the originally reported effect size) wasn’t significant, leading to the conclusion that “well, this phenomenon just doesn’t exist then.” No! Plausible effect sizes are what we too-blindly were conditioned over the years to regard as “small.”  We still need to learn: Real effects are “small” effects, and so we need to rethink what we consider as small.  And, while we’re at it, let’s recalibrate our views of the N and precision needed to decide that something doesn’t exist.  I won’t go on about that here; Dan Ozer and I wrote about this in some detail, and Dan understands it on a deeper level than I do.

One other lesson I picked up from my experience at Kathleen’s SPSP symposium was less scientific and more disheartening. Some people were rooting for ego depletion to fail. They really were. I saw it on the Twitter. Some of my erstwhile friends and allies were “disappointed” (read: angry) with me for saying I think the effect is real, albeit “small,” and important. Unlike the very beginning of the replication controversy, which I think was characterized by disappointment and confusion, now it is increasingly characterized by skepticism bordering on cynicism, combined, in some cases, with a detectable smidgen of self-righteousness and even schadenfreude. [Insert the necessary qualifications, exceptions, and disavowals here. Certainly, since you are reading this essay, I don’t mean you, of all people. But there are some others out there who come close to meeting this description.]

And to say one more thing about ego depletion, specifically (because most of this little essay really concerns replication concerns in social psychology more broadly, not ego depletion): The boon and bane of the research program was and is its label. As a (semi-closeted) admirer of Freud, I’ve seen how just using the word “ego” is like the proverbial red flag in front of a bull for certain of our colleagues (none of whom know anything about Freud, by the way, but that’s another story).  Calling the phenomenon “ego depletion” got the work tons of attention but also, I suspect, made it a big, fat target. Nobody can sensibly doubt that people get mentally and even morally tired. But the word “ego” just sets some people off. If the phenomenon had been called “psychological fatigue” from the very beginning, the research wouldn’t be so famous, and it wouldn’t have come under such intense fire.

Ok one more thing about ego depletion. I have noticed that Baumeister’s chapter in the Mele book – the one that stimulated this essay –  does not attempt to defend – nor does it even mention – the glucose-related findings or theory, and indeed it seems that portion of the evidence has largely evaporated, and that part of the theory (in hindsight) was probably biologically implausible in the first place. This conclusion has contaminated views of ego depletion itself, leading to the impression the basic phenomenon is poorly supported or even doesn’t exist. Which is wrong, of course. I suspect the only way to rescue the topic is to dump the label. Psychological and moral fatigue is real and important. It should continue to be studied and better understood. Baumeister’s recent chapter says we should find out when it does and does not occur. Yup. That, and much more.

References

Baumeister, R.F. (2019). Self-control, ego depletion and social psychology’s replication crisis. Prepared for A. Mele (Ed.), Surrounding self-control. New York: Oxford. (Appendix to Chapter 2).

Evans, D.R., Boggero, I.A., & Segerstrom, S.C. (2016). The nature of self-regulatory failure and “ego depletion”: Lessons from physical fatigue. Personality and Social Psychology Review, 20, 291-310.

Funder, D.C. (2018, March). Implications of the depletion replication study for meta-science and behavioral research. Symposium presentation, Society for Personality and Social Psychology, Atlanta.

Funder, D.C., & Ozer, D.J. (2019). Evaluating effect size in psychological research: Sense and nonsense. Advances in Methods and Practices in Psychological Science, 2, 156-168.

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349, (6251).

MIsgivings: Some thoughts about “measurement invariance”

As a newcomer to cross-cultural research a few years ago, I soon became aware of the term “measurement invariance,” which typically is given as a necessary condition for using a psychological measurement instrument, such as a personality inventory, in more than one cultural context[1]. At one of the first talks where I presented some then-new data gathered in 20 different countries, using a new instrument developed in our lab (the Riverside Situational Q-sort) a member of the audience asked “what did you do to assess measurement invariance.”  I had no real answer, and my questioner shook his head sadly.

Which, I started to realize, is kind of the generic response when these issues come up. If a researcher gathers data in multiple cultures and doesn’t assess measurement invariance, then the researcher earns scorn – from certain kinds of critics – for ignoring the issue. If the researcher does do the conventional kinds of analyses recommended to assess measurement invariance, the results are often discouraging. The RMSEA’s are out of whack, Delta CFI’s are bigger than .01, and oh my goodness, the item intercepts are not even close to equivalent so scalar invariance is a total joke, not to mention the forlorn hope of attaining “strict” invariance (which sounds harsh, because it is) . A member of a symposium I recently attended exclaimed, “If you can show me some real data where strict measurement invariance was achieved across cultures, I shall buy you a beer!” He had no takers. The following message is approaching the status of conventional wisdom: the lack of equivalence in the properties of psychological measures across cultures means that they cannot be used for cross-cultural comparison and attempts to do so are not just psychometrically ignorant, they are fatally flawed.

As I have become a bit more experienced, however, I have begun to develop some misgivings about this conclusion, and the whole business of “measurement invariance,” which I put in scare quotes because I suspect there is less there, than meets the eye. Below, I shall refer to it simply as MI.

  1. The assessment of MI uses complex methods that (it appears) few researchers really understand, and often makes dichotomous evaluative decisions based on seemingly arbitrary benchmarks. I’m one of those researchers who doesn’t really fully understand conventional MI analyses, and I’ll go out on the limb to confess this only because I strongly suspect I’m not the only one. Indeed, I’m starting to recover from the imposter syndrome I used to suffer around people nodding sagely as they talk about factorial invariance, RMSEA, delta CFI, and equivalence of intercepts. And I have flashbacks of the whole debate about arbitrariness of the .05 p-level for evaluating research results, when I hear experts propound a .01 benchmark for maximum permissible delta CFI. Where, exactly, did that come from?  Does anybody really know? The only answer concerning the origin of this or other benchmarks is that some authoritative figure (or an institution such as the Educational Testing Service) published an article recommending it. But the basis of the recommendation generally remains obscure, and the (I suspect few) researchers who actually go and read said authoritative article will not necessarily be enlightened. But they will obey.
    I suggest that for most and perhaps nearly all empirical cross-cultural researchers (by which I mean the ones who actually gather data), the whole process is a black box where they dump their data in one side (such as an R program), and wait for the output on the other side, with fingers crossed.  And then, almost always, they get bad news.
  2. Discussions of MI often have a prohibitionist tone. It is my impression that a “failure” (that’s the actual word used most often) to achieve MI by conventional criteria is not typically treated as a scientific finding of interest in its own right.  Rather, it is more often given as a reason, even a “violation” (another often-used word) that implies one should not take the cross-cultural data seriously, or sometimes, even look at them.  I actually recently saw a paper in which because MI was not achieved, the authors primly stated that, therefore, they did not examine their data any further. No wonder they didn’t dare, in the face of a recently published warning that “widespread hidden invalidity in the measures we use… pose[s] a threat to many research findings.”
    Such a prohibitionist tone goes too far. First, the amount of non-invariance required to actually throw substantive results into question is far from clear and, as noted above, often is evaluated on the basis of mysterious and seemingly arbitrary benchmarks. Second, the implications of a “failure” of MI depends on the kind of MI one decides to insist on. Do you want to interpret correlations among measures within countries?  Then configural MI is enough, and it’s indeed often found (e.g., Aluja et al., 2019). Do you want to interpret mean differences between countries? Well, then maybe (but not necessarily, see below) you do need “strict” or scalar invariance, which is a very high bar seldom attained.
    The more balanced treatments of failures of scalar invariance may say something like “all questionnaires showed some noninvariance across countries, indicating that caution needs to be exercised when investigating and interpreting mean differences.” I appreciate the careful, moderate tone of this quote but still: what is this finely worded advice supposed to mean? That otherwise you can throw caution to the winds? I don’t think so. Assuming it means anything at all (and I’m not sure it does), I think it means if you don’t have strict MI, your mean differences don’t mean anything. So, you are prohibited to look at them – an attitude that strikes me as, how shall I put this, anti-scientific (see point 5, below).
  3. The repeated disappointments in MI appear at odds with conclusions emerging elsewhere in cross-cultural psychology. An emerging theme, and a real surprise I think, is that cross-cultural differences in psychological attributes and processes are turning out to be smaller than was expected when this field of research really got going, a couple of decades ago. The touted fundamental differences between East and West were not only almost absurdly simplistic (Asia’s a big and diverse place, as is Europe), but also turned out to be smaller and less profound than initially assumed. China contains lots of individualists, and Europe and North America have a fair number of collectivists, and while there still might be overall differences, the distributions overlap considerably. Our own international project, with data from 64 countries, is finding that two measures of happiness, one developed in the US and the other, purported to be profoundly different, developed in Japan, yield correlates and other results that are much more similar than different around the world (and two countries in which the two measures behave especially similarly are, wait for it, the US and Japan). Our two studies of situational experience around the world, one with 20 countries and one with 64 countries, both found that individual experiences within countries were more similar to each other than experiences compared across countries, but the difference was surprisingly small and indeed, just barely reached statistical significance even with N’s in the thousands. More generally, the distinguished and pioneering cross-cultural researcher Juri Allik (2005) has written about how personality variation across countries is (unexpectedly) small compared with variation within countries (see also: Hanel, Maio & Manstead, 2018). In the face of all this, how long can we maintain a conventional wisdom that cultural variation in the basic properties of well-established measurement instruments is typically large, consequential, and maybe even fatal?
    Consider, again, the nature of cross-cultural vs. within-culture variation. One example: Perhaps, indeed, the items on the BFI-2 extraversion scale have a different meaning for someone living in Japan than they do for me. But might they not also, to some degree, have a different meaning for my next-door neighbor than they do for me? And can we assume that the former difference in meaning is really all that different, from the latter difference? Juri Allik’s conclusions give reason to doubt.  Measurement instruments surely have at least somewhat different properties and implications for different individuals. But I don’t see a strong reason to presuppose that these properties and implications necessarily vary to any importantly consequential degree according to whether the individuals in question reside in the same or different countries. Maybe, sometimes, they do. But the burden of proof seems misplaced, given what we are learning about cultural variation elsewhere.
  4. Conventional assessments of MI are completely internal to the measurement instruments. That is, they assess internal validity as opposed to external validity. They focus on the structure of the latent factors of the instruments, and the degree to which this structure is maintained across contexts, and – even more stringently – the intercepts of the items on latent traits or factors.[2] This is all well and good, I suppose, but internal validity is not the same as external validity and the former is actually not even always necessary for the latter. The classic examples in personality psychology are the MMPI and the CPI (California Psychological Inventory), the scales of which have well-established validity-in-use for predicting important outcomes, but which “fail” many conventional psychometric tests of internal reliability and factorial homogeneity.
    A useful future direction, I propose, would be to move away from the almost exclusive focus on the internal properties of our measurement instruments in favor of increased emphasis on external validity. This can and should be done at both the cultural and individual level. At the cultural level, research could assess average levels of measurements with other country-level variables (e.g., Mõttus, Allik & Realo, 2010). For example, are country-level average levels of happiness associated in sensible ways with other variables measured at the country level, such as economic and demographic indicators, or other cultural attributes such as, for example, religiosity? But lest we fall prey to the ecological fallacy, this kind of research must be complemented by investigations at the individual level, assessing to what degree and when the measure’s correlations with other psychological variables are maintained across cultural contexts. For example, does a measure of happiness correlate with other indicators of well-being within some, many, or all countries? This, to me, would be more persuasive evidence for the cross-cultural validity of a measure than even the finest demonstration of configural MI.
    Even better – and even more difficult – a measure used in more than one country could be compared in its associations with actual behavior, something that despite psychology’s self-definition as the “science of behavior” continues to get less attention than it should (Baumeister et al., 2007).  One example is the use of “anchoring vignettes,” in which respondents in different cultures report how they would respond to various situations, and then these responses are compared with their personality scores (Mõttus, 2012).  Another example is a study that assessed differences in sociability between Mexicans and Americans using naturalistic audio recordings as well as self-reports (Ramírez-Esparza et al., 2009)[3]. Research like this may lead to an eventual gold standard for cross-cultural psychology, in which behavioral data, and not just self-reports, are gathered. To do this will be difficult and expensive. But we must, sooner or later.
  5. The data are the data. This is my most important point. Researchers who go to the considerable trouble of gathering data in more than one country should not be discouraged from doing so, should not be prohibited from analyzing their data in any way they find informative, and certainly should not be disadvantaged compared to researchers who avoid cross-cultural complications by gathering data only at their home campus. Of course, interpretations should be appropriately cautious, but this warning is a truism that applies to all research of any kind. You never really know for sure what the scores on your measures mean; all you can do is try to triangulate them with other data and interpret the patterns – and even the mean differences – that emerge the best you can. This is a worthy endeavor and indeed, the essence of scientific activity. The issue of “measurement invariance” should not be allowed to inhibit it.

 

References

Allik, J. (2005). Personality dimensions across cultures. Journal of Personality Disorders, 19, 212-232.

Aluja, A., et al. (2019). Multicultural validation of the Zuckerman-Kuhlman-Aluja Personality Questionnaire Shortened Form (ZKA-PQ/SF) across 18 countries. Assessment, doi: 10.1177/1073191119831770. [Epub ahead of print]

Baumeister, R.F., Vohs, K.D., & Funder, D.C. (2007). Psychology as the science of self-reports and finger movements.  Whatever happened to actual behavior? Perspectives on Psychological Science, 2, 396-403.

Gardiner, G., Sauerberger, K., Members of the International Situations Project, & Funder, D. (2019). Towards meaningful comparisons of personality in large-scale cross-cultural studies. In A. Realo (Ed.), In praise of an inquisitive mind: A Festschrift in honor of Jüri Allik on the occasional of his 70th birthday (pp. 123-139). Tartu: University of Tartu Press.

Hanel, P.H.P., Maio, G.R., & Manstead, A.S.R. (2018). A new way to look at the data: Similarities between groups of people are large and important. Journal of Personality and Social Psychology, 116, 541-562.

Mõttus et al. (2012). Comparability of self-reported conscientiousness across 21 countries. European Journal of Personality, 26, 303-317.

Mõttus, R., Allik, J., & Realo, A. (2010). An attempt to validate national mean scores of Conscientiousness: No necessarily paradoxical findings. Journal of Research in Personality, 44, 630-640.

Plieninger, H. (2017). Mountain or molehill? A simulation study on the impact of response styles. Educational and Psychological Measurement, 77, 32-53.

Ramírez-Esparza, N., Mehl, M.R., Álvarez-Bermúdez, J., & Pennebaker, J.W. (2009). Are Mexicans more or less sociable than Americans? Insights from a naturalistic observation study. Journal of Research in Personality, 43, 1-7.

Acknowledgment

I thank several friends and colleagues for their advice, some of which I took, and some of which I ignored. For their protection I shall maintain their anonymity unless they want to go public via a comment here, on Twitter, or elsewhere.

Footnotes

[1] Other contexts for assessing measurement invariance concern possible changes in the meaning of a measurement instrument across time or for participants of different ages. I am not talking about those applications here.

[2] A new (and simpler) method of assessing the similarity in meaning of measurement instruments across cultures, developed in our lab, also is based entirely on analyses internal to the instrument itself (Gardiner et al., 2019).

[3] I have now run out of examples.

8 Words Psychologists Have Almost Ruined

Psychology has almost ruined some perfectly innocent words. In each case, the first step was to take a useful word from the English language and give it a technical meaning that did not exactly or, in some cases, even approximately, match what it meant to begin with.  This is OK as far as it goes; technical terms are useful and have to come from somewhere.  But the second step, which is taken all too often, is to forget that this was done, and act as if the word still had some or even all of its original meaning. The result: widespread confusion. Examples, starting with the most obvious:

Significant (adj.):

What it originally meant: sufficiently great or important to be worthy of attention; noteworthy[1].

How psychology uses the word: As used in “significance testing,” this word actually, and merely, means not-random. The most succinct – and accurate – interpretation of the meaning of a “significant finding” that I’ve seen is “there’s not nothing going on.”

Why it’s a problem: An undergraduate psychology student, having just taken a stats course, phones home one evening and says, “Mom, something significant happened today!”
Mom: “Oh my goodness, Sweetie, what do you mean?”
Undergraduate: “I mean, there’s less than a 5% chance that what happened was completely random!!!”
Mom: [hangs up]
Of course, this is a hypothetical situation as well as an (admittedly weak) joke. However, it does exemplify the way that “significant” findings are often interpreted and reported as if they are actually important – which is a different matter altogether (as our undergraduate’s mom seems to know).

Suggested improvement: How about we call statistically “significant” findings non-random? As in, “the difference in means had a low probability of being random (p < .05)[2].” Whether the finding is actually significant is a different issue that requires further discussion. A lot of further discussion.

Correct (v.)

What it originally meant: Put right (an error or fault).

How psychology uses the word: To remove one or more sources of variation from a score, as in, “we corrected the scores for gender, SES, and health status.” Or (and this is a real example), “before computing agreement between spouses on their ratings of marital quality, we corrected them for the overall rating of the quality of their marriage.”

Why it’s a problem: The use of the word “correct” (as a verb) implies in ordinary English that the score used to be wrong, but now it’s right. So, it sounds good to be able to say you’ve “corrected” your scores, because now they must be, well, correct.  As a result, things get “corrected” that shouldn’t be, as in cases where the source of variation being removed is an essential part of what is being measured (as in the example just mentioned), and/or when the “corrected” score has had so much of its measurement meat removed that what remains is little more than random noise.

Suggested improvement: Call “corrected” scores “adjusted” scores, instead. This invites the reader to consider whether the adjustment was justified in this case, rather than encouraging the assumption that a number that was previously wrong, has been put right.
Bonus improvement: Quit using partial correlations so much. When you do use them, always also report the zero-order (non-partialled) correlation.
Extra bonus improvement: Read your Egon Brunsik about the pitfalls of experimental as well as statistical control (Brunswik, 1956).

Reliable (adj.)

What it originally meant: Consistently good in quality or performance; able to be trusted.

How psychology uses the word: In at least four ways. (1) Test-retest reliability; Will the person get (close to) the same score on two different occasions. (2) Alternative forms reliability; Will two different versions of a measure give a person (close to) the same score. (3) Internal reliability: Do the items of a measure correlate with each other, thus indicating that they tap into the same underlying (latent) factor. (4) Inter-judge (or inter-rater) reliability: Do two (or more) raters (or judges) agree with each other.

Why it’s a problem: Notice the distance between the dictionary definition and any of its four psychological usages. A reliable measure might not be valid (students are understandably confused the first time they are taught this distinction), and the word sounds so virtuous that (I have the impression) even psychologists have more faith in a “reliable” measure than they should, based on its “reliability” alone.

Suggested improvement: Divide into at least four terms for the usages listed above. Respectively: “test-retest consistency,” “alternative forms similarity,” “item convergence,” and “inter-judge agreement.” And when aggregating items, scores or ratings, namethe exact statistic, as in “the (Cronbach’s) alpha was…”  “Alpha reliability” was always redundant anyway.

Variance (vs. Variation) (n.)

What it originally meant: the fact or quality of being different, divergent, or inconsistent.
Compare this with the definition of the rarely-used term,
Variation: a change or difference in condition, amount, or level, typically with certain limits.

How psychology uses the word: The word “variance” typically is used to identify what we hope to “explain” (see next entry) with our theories and statistical models. It’s the average of the sum of the squared – let me repeat that, the squared – deviations of the observed scores from their mean.

Why it’s a problem: First, look at the two definitions. Which word better describes what we want to explain with our psychological theories and statistical models?  In my opinion, it’s the second word, the one we don’t use. Second, look at what the widely used-term reflects. Squared deviations are computationally convenient and appropriate for certain uses, but also change the scale of the unit being studied and are potentially distorting. For example, Indianapolis is (approximately) 1000 miles from Boston and 2000 miles from Los Angeles, which means that when you are flying from Boston to LA you are 1/3 of the way there when you get to Indianapolis. You are also 1/9 of the squared distance from Boston to LA.  Which of these numbers is more meaningful, or useful for purposes of, say, calculating travel time or jet fuel consumed? Third, as the distinguished cognitive psychologist Robert Abelson once pointed out, presenting results in terms of the amount of variance explained can be misused to “highlight the explanatory weakness of an investigator’s pet variables” (Abelson, 1985, p. 129). In other words, and despite the many attempts to correct the underlying misconceptions, it can still be a devastating critique to say, for example, “your results (r = .30) only explain 9% of the variance.”  Yeah, but they also explain 30% of the (unsquared) variation.
(For further development of this point see Ozer (1985), Funder & Ozer (2019).)

Suggested improvement: Except in technical contexts where (squared) variance must be used for computational purposes, characterize variation in scores in terms of absolute value of the original units. This is usually called M.A.D. (mean absolute deviation).  But my suggestion is to instead use ordinary word “variation” to describe this… well… variation.
Bonus improvement: Stop mindlessly squaring r’s, for pity’s sake (Ozer, 1985).

Explain (v.)

What it originally meant: Make (an idea or situation) clear to someone by describing it in more detail or revealing relevant facts.

How psychology uses the word: Solely in a narrow, statistical sense, to refer to the part of the variance accounted for by another variable via a statistical model (e.g., analysis of variance, or regression).

Why it’s a problem: Psychology is in the business of explaining stuff, or trying our level best to do so. The computation of deviations (squared or not) and accounting for the source of the deviations, as a result of an experimental manipulation and/or its correlation with another variable, is part of this process. But it’s just the beginning, and the word “explained” is too pretentious a label for this data analytic step. Usage of this word also directs our attention towards “explaining” variance, as if that’s all we need to do, rather than explaining (without scare quotes) what’s really going on.

Suggested improvement. Reserve the word “explain” for when you are really explaining something. Like, with a theory. If what you are doing is just accounting for numbers, use the term “account for” instead. (Even that sounds like a bit much to me, but if it’s ok for accountants, I guess it’s ok for me.)

Predict (v.)

What it originally meant: say or estimate that (a specified thing) will happen in the future or will be a consequence of something.

How psychology uses the word: Pretty loosely, to refer to any correlation between variables that allows the value of one variable to be estimated from the value of the other.

Why it’s a problem: Maybe it’s not a big problem, but it is kind of misleading, isn’t it? Actual “prediction” isn’t done very often in basic research. I’ve seen – I’ve been guilty of writing – articles that say behavior or some other outcome can be “predicted” by a personality variable, when no actual predictions, for any particular individual are being made or ever intended to be made.  Applied work is different; among the most robust findings in personality psychology is that conscientiousness (and related traits) can predict job performance, and some of our colleagues are earning a good deal of money doing just that.

Suggested improvement: Use terms that don’t imply quite so much precognition, such as “correlated with” or “associated with” when no actual predictions are being made. Save “prediction” for contexts, such as industrial or medical settings, where an estimate of something that will happen in the future is actually being based on something that was measured today.

Error (n.)

What it originally meant: a mistake.

How psychology uses the word: In two ways. First, “error” refers to unaccounted-for (trying to follow my own advice here) variation (ibid.) in an observed variable over cases, over time, or within experimental conditions. Second, in the study of judgment, “error” refers to any deviation in human judgment from the output of a normative model, which might be anything from set inclusion logic to Bayesian statistics.

Why it’s a problem: Both uses of error are potentially misleading, but the consequences seem more dire in the second usage. In the first case, occasionally one sees lay summaries of research (e.g., in the media) that gets confused by the word “error” and seems to think that if there is some in the data, then the conclusions of the study are a mistake. In the second case, and far worse, a whole generation of cognitive social psychologists came to write as if, and spread the idea that, human judgment is fundamentally characterized by woeful shortcomings. The research on which this idea was based showed that in certain experimental circumstances humans can be induced to produce judgments that deviate from the calculations of a putatively normative model. However, the degree to which such models are realistic or even normative outside of a very narrow range of predefined conditions is, shall we say, a matter of controversy. As I wrote once, errors (as psychologists have historically often used the term), are not necessarily, nor probably even usually, mistakes (Funder, 1987).

Suggested improvement: Just stop using the word “error” except in the rare cases where you have incontrovertible grounds to ascribe a mistake. Possible substitutes for the first usage: random variation, noise. Possible substitute for the second usage: deviation from the prescriptive model.

Self (n.)

What it originally meant: a person’s essential being that distinguishes them from others, especially considered as the object of introspection or reflexive action.

How psychology uses the word: Usually, as part of a hyphenated label that denotes any of a large number of research areas, each of which has its own, sometimes sizable literature; such as: self-esteem (the winner by far), self-efficacy, self-awareness, self-determination, self-discrepancy, self-control… the “Wiktionary” (an online resource I’ve only recently discovered) lists 383 English words prefixed with self.  Not all of these are areas of psychological research, but my goodness, an awful lot of them are.

Why it’s a problem. Two reasons. First, the many areas of inquiry labeled with self-hyphen are only loosely related to each other, if that, which means the “study of the self” doesn’t have much if any central meaning. Second, the actual topic of the “self” is potentially very interesting – what is the essential core of a human’s individuality?  But – apart from among a few lonely humanistic psychologists and fans of William James – this core topic is among the elephants in the room that psychology ignores as it focuses, instead, on the many and more limited self-hyphen topics.

Suggested improvement: Reserve the term “self” for discussions of an individual’s essential being that distinguishes them from others.  Substitute the term “own” for more limited uses, as in “own esteem,” “own efficacy,” “own perception,” etc. Warning: it is almost certainly much, much too late to implement this suggestion or, if I’m being realistic, any of the others.

Coda

Even if the improvements suggested here have no hope of taking hold – which I think is a realistic expectation – I do hope this little essay might help make us a little more thoughtful every time we hear or use these 8 pesky words.

 References

Abelson, R.P. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin, 97, 129-133.

Brunswik, E. (1956). Perception and the representative design of psychological experiments. Berkeley: University of California Press.

Funder, D.C. (1987). Errors and mistakes: Evaluating the accuracy of social judgment. Psychological Bulletin, 101, 75-90.

Funder, D.C., & Ozer, D.J. (2019). Evaluating effect size in psychological research: Sense and nonsense. Advances in Methods and Practices in Psychological Science, 2, 156-168.

Ozer, D.J. (1985). Correlation and the coefficient of determination. Psychological Bulletin, 97, 307-315.

[1] Source of this and other definitions: Lexico (Oxford) online dictionary.

[2] Or, more precisely, “a low probability of arising by chance if there really is no difference,” but that seems kind of wordy.

 

Replication and Open Science for Undergraduates

(Draft of material for forthcoming The Personality Puzzle, 8th edition. New York: W.W. Norton).

[Note: These are two sections of a chapter on Research Methods, and the first section follows a discussion of Null Hypothesis Significance Testing (NHST) and effect size.]

Replication

Beyond the size of a research result, no matter how it is evaluated, lies a second and even more fundamental question:  Is the result dependable, something you could expect to find again and again, or did it merely occur by chance? As was discussed above, null hypothesis significance testing (NHST) is typically used to answer this question, but it is not really up to the job. A much better indication of the stability of results is replication. In other words, do the study again. Statistical significance is all well and good, but there is nothing quite so persuasive as finding the same result repeatedly, with different participants and in different labs (Asendorpf et al., 2013; Funder et al., 2014).[1]

The principle of replication seems straightforward, but it has become remarkably controversial in recent years, not just within psychology, but in many areas of science.  One early spark for the controversy was an article, written by a prominent medical researcher and statistician, entitled “Why most published research findings are false” (Ionnadis, 2005). That title certainly got people’s attention!  The article focused on biomedicine but addressed reasons why findings in many areas of research shouldn’t be completely trusted. These include the proliferation of small studies with weak effects, researchers reporting only selected analyses rather than everything they find, and the undeniable fact that researchers are rewarded, with grants and jobs, for studies that get interesting results.  Another factor is publication bias, the fact that studies with strong results are more likely to be published than studies with weak results – leading to a published literature that makes effects seem stronger than they really are (Polanin, Tanner-Smith & Hennessy, 2016).

Worries about the truth of published findings spread to psychology a few years later, in a big way, when three things happened almost at once.  First, an article in the influential journal Psychological Science outlined how researchers could make almost any data set yield significant findings through techniques such as deleting unusual responses, adjusting results to remove the influence of seemingly extraneous factors, and neglecting to report experimental conditions or even whole experiments that fail to get expected results (Simmons, Nelson & Simonsohn, 2011).  Such questionable research practices (QRP’s) have also become known as p-hacking, a term which refers to hacking around in one’s data until one finds the necessary degree of statistical significance, or p-level, that allows one’s findings to be published. To demonstrate how this could work, Simmons and his team massaged a real data set to “prove” that listening to the Beatles song When I’m 64 actually made participants younger!

Coincidentally, at almost exactly the same time, the prominent psychologist Daryl Bem published an article in a major journal purporting to demonstrate a form of ESP called “precognition,” reacting to stimuli that are presented in the future (Bem, 2011). And then, close on the heels of that stunning event, another well-known psychologist, Diederik Stapel, was exposed for having become famous on the basis of studies in which he, quite literally, faked his data (Bhattacharjee, 2013). The two cases of Bem and Stapel were different because nobody suggested that Bem faked his data, but nobody seemed to be able to repeat his findings, either, suggesting that flawed (but common) practices of data analysis were to blame (Wagenmakers, Wetzels, Borsboom & van der Maas, 2011), in particular, various kinds of p-hacking.  For example, it was suggested that Bem might have published only the studies that successfully demonstrated precognition, not the ones that failed.  One thing the two cases did have in common was that the work of both researchers had passed through the filters of scientific review that were supposed to ensure that published findings can be trusted.

8E Cartoon 3-3

 

And this was just the beginning.  Before too long, many seemingly well-established findings in psychology were called into question when researchers found that they were unable to repeat them in their own laboratories (Open Science Collaboration, 2015).  One example is a study that I described in previous editions of this very book, which purported to demonstrate a phenomenon sometimes called “elderly priming” (Anderson, 2015; Bargh, Chen & Burrows, 1996).  In the study, some college student participants were “primed” with thoughts about old people by having them unscramble words such as “DNIRKWE” (wrinkled), “LDO” (old), and (my favorite) “FALODRI” (Florida).  Others were given scrambles of neutral words such as “thirsty,” “clean,” and “private.”  The remarkable – even forehead-slapping – finding was that when they walked away from the experiment, participants in the first group moved more slowly down the hallway than participants in the second group!  Just being subtly reminded about concepts related to being old, it seemed, is enough to make a person act old.

I reported this fun finding in previous editions because the measurement of walking speed seemed like a great example of B-data, as described in Chapter 2, and also because I thought readers would enjoy learning about it.  That was my mistake!  The original study was based on just a few participants[2] and later attempts to repeat the finding, some of which used much larger samples, were unable to do so (e.g., Anderson, 2015; Doyen, Klein, Pichon & Cleeremans, 2012; Pashler, Harris & Coburn, 2011).  In retrospect, I should have known better.  Not only were the original studies very small, but the finding itself is so remarkable that extra-strong evidence should have been required before I believed it.[3]

The questionable validity of this finding and many others that researchers tried and failed to replicate stimulated lively and sometimes acrimonious exchanges in forums ranging from academic symposia and journal articles to impassioned tweets, blogs, and Facebook posts.  At one point, a prominent researcher referred to colleagues attempting to evaluate the replicability of published findings as “shameless little bullies.” But for the most part, cooler heads prevailed, and insults gave way to positive recommendations for how to make research more dependable in the future (Funder et al., 2014; Shrout & Rodgers, 2018).  These recommendations include using larger numbers of participants than has been traditional, disclosing all methods, sharing data, and reporting studies that don’t work as well as those that do.  The most important recommendation – and one that really should have been followed all along – is to never regard any one study as conclusive proof of anything, no matter who did the study, where it was published, or what its p-level was (Donnellan & Lucas, 2018).  The key attitude of science is – or should be – that all knowledge is provisional. Scientific conclusions are the best interpretations that can be made on the basis of the evidence at hand.  But they are always subject to change.

[Discussion of ethics follows, including deception and protection of research subjects, and then this section:]

Honesty and Open Science

Honesty is another ethical issue common to all research. The past few years have seen a number of scandals in physics, medicine, and psychology in which researchers fabricated their data; the most spectacular case in psychology involved the Dutch researcher Diederik Stapel, mentioned earlier. Lies cause difficulty in all sectors of life, but they are particularly worrisome in research because science is based on truth and trust. Scientific lies, when they happen, undermine the very foundation of the field. If I report about some data that I have found, you might disagree with my interpretation—that is fine, and in science this happens all the time. Working through disagreements about what data mean is an essential scientific activity. But if you cannot be sure that I really even found the data I report, then there is nothing for us to talk about. Even scientists who vehemently disagree on fundamental issues generally take each other’s honesty for granted (contrast this with the situation in politics). If they cannot, then science stops dead in its tracks.

In scientific research, complete honesty is more than simply not faking one’s data.  A lesson that emerged from the controversies about replication, discussed earlier, is that many problems arise when the reporting of data is incomplete, as opposed to false.  For example, it has been a not-uncommon practice for researchers to simply not report studies that didn’t “work,” i.e., that did not obtain the expected or hoped-for result.  And, because of publication bias, few journals are willing to publish negative results in any case.  The study failed, the reasoning goes, which means something must have gone wrong.  So why would anybody want to hear about it?  While this reasoning makes a certain amount of sense, it is also dangerous, because reporting only the studies that work can lead to a misleading picture overall.  If fifty attempts to find precognition fail, for example, and one succeeds, then reporting the single success could make it possible to believe that people can see into the future!

A related problem arises when a researcher does not report results concerning all the experimental conditions, variables, or methods in a study.  Again, the not-unreasonable tendency is only to report the ones that seem most meaningful, and omit aspects of the study that seem uninformative or confusing. In a more subtle kind of publication bias, reviewers and editors of journals might even encourage authors to focus their reports only on the most “interesting” analyses.  But also again, a misleading picture can emerge if a reader of the research does not know what methods were tried or variables were measured that did not yield meaningful results.  In short, there is so much flexibility in the ways a typical psychology study can be analyzed that it’s easy – much too easy – for researchers to inadvertently “p-hack,” which, as mentioned earlier, means that they keep analyzing their data in different ways until they get the statistically significant result that they need (Simmons, Nelson & Simohnson, 2011).

The emerging remedy for these problems is a movement towards what is becoming known as open science, a set of practices intended to move research closer to the ideals on which science was founded.  These practices include fully describing all aspects of all studies, reporting studies that failed as well as those that succeeded, and freely sharing data with other scientists.  An institute called the “Center for Open Science” has become the headquarters for many efforts in this direction, offering internet resources for sharing information.  At the same time, major scientific organizations such as the American Psychological Association are establishing new guidelines for full disclosure of data and analyses (Appelbaum et al., 2018), and there is even a new organization, the Society for the Improvement of Psychological Science (SIPS) devoted exclusively to promoting these goals.

[1] R.A. Fisher, usually credited as the inventor of NHST, wrote “we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment that will rarely fail to give us a statistically significant result” (1966, p 14).

[2] Actually, there were two studies, each with 30 participants, which is a very small number by any standard.

[3] The astronomer Carl Sagan popularized the phrase “extraordinary claims require extraordinary evidence,” but he wasn’t the first to realize the general idea.  David Hume wrote in 1748 that “No testimony is sufficient to establish a miracle, unless the testimony be of such a kind, that its falsehood would be more miraculous than the fact which it endeavors to establish” (Rational Wiki, 2018).

 

 

Thresholds

Part One

I’ve been suffering an acute bout of cognitive dissonance lately, finding myself disagreeing with people I admire, specifically, several of the authors of this article. (The article has 72 authors and I don’t know all of them.)   The gist of the article can be stated very simply and in the authors’ own words: “We propose to change the default P-value threshold for statistical significance for claims of new discoveries from .05 to .005.”  This proposal is soberly, clearly argued and the article makes some good points, the best of which is that, imperfect as this change would be, at least it’s a step in the right direction.  But I respectfully disagree.  Here’s why.

I’m starting to think that p-levels should all be labeled “for entertainment purposes only.”  They give a very very rough idea of the non-randomness of your data, and are kind of interesting to look at. So they’re not completely useless, but they are imprecise at best and almost impossible to interpret at worst*, and so should be treated as only one among many considerations when we decide what we as scientists actually believe.  Other considerations (partial list): prior probabilities (also very rough!), effect size, measurement precision, conceptual coherence, consistency with related findings, and (hats off please) replicability.

Thresholds are an attempt — independently of the other considerations just listed — to let our numbers do our thinking for us. I get why: it’s an attempt to prevent us fooling ourselves. They also give editors and reviewers a seemingly objective criterion for publication, and so make their jobs easier (or maybe even possible). But statistical thresholds are not up to the job we so badly want them to do.  Even if your p-level is .005, you still have to decide whether to believe the finding it pertains to. As the existentialists say, the only choice you cannot make, is the choice to not choose.

This is dissatisfying, I know. If only we had a way to objectively say which findings we should believe, and which ones we shouldn’t!  But we don’t, and pretending we do (which includes playing with the criteria by which we do it) is not, in my view, a step in the right direction.

Part Two

In an interesting email exchange with my friend Simine Vazire**, one of the authors of the article, she replied to my comments above with “I think all of your objections apply to the .05 threshold as much as the .005 threshold.  If that’s true, then your objection is not so much with lowering the threshold, but with having one at all.”

Bingo, that’s exactly right.

Simine also went on to challenge me with a thought experiment.  If I knew two labs, one of which regarded a discovery in their data as real if it attained p < .05, and the other only when the finding attained p < .001, which lab’s finding would I be more inclined to believe?

This thought experiment is indeed challenging.  My answer artfully dodges the question. Rather than attending to their favorite p-values, I’d be inclined to believe the lab whose results make more sense and/or have practical implications, not to mention the ones whose findings can be replicated.  I would also be more skeptical of the lab with lots of “counter-intuitive” findings, which is where Bayes (and the notion of prior probability) comes in.***  Furthermore, I wouldn’t really believe that the two labs with their ostensibly different NHST p-thresholds were actually doing much different on that score, so that wouldn’t influence my choice very much if at all. This is where my cynicism comes in.  I’m as doubtful that the second lab would really throw away a .002 as I am that the first lab would ignore a .06.

Full disclosure: I use rough thresholds myself; for a long time (before Ryne Sherman, then a member of our lab, wrote the R code to let us do randomization tests) our lab had the rule of not even looking tables of (100-item) Q-sort correlates unless at least 20 of them were significant at p < .10.  So there’s two thresholds in one!  But this was, and remains, only one of many considerations on the way to figuring out what (we thought) the data meant. And this is actually what I mean when I say we should use p-levels for entertainment purposes only.

I appreciate the argument that decades of discussion have done little to change how people think and to move them away from using arbitrary thresholds for significance, and I have to admit it’s even possible that lowering the threshold might do some good.  For me, though, the best outcome might be that the paper could stimulate a new conversation about whether we should have thresholds at all.  As my Riverside colleague Bob Rosenthal likes to say, surely god loves .06 almost as much as she loves .05 (or, if you prefer, .006 almost as much as .005). And when you start thinking like that, it’s pretty hard to take a threshold, any threshold, seriously.

Footnotes

*as I learned from listening to Paul Meehl’s lectures on-line, the whole notion of “probability” is philosophically very fraught, and the conversion of the notion of probability (however one understands it) into actual numbers calculated to several digits precision is even more fraught.

**quoted with permission

***counter-intuitive findings have low prior probabilities, by definition.  See Bargain Basement Bayes.

Why doesn’t personality psychology have a replication crisis?

Because It’s Boring

“[Personality psychology] has reduced the chances of being wrong but palpably increased the fact of being boring. In making that transition, personality psychology became more accurate but less broadly interesting.”  — Roy Baumeister (2016, p. 6)

Many fields of research – not just social psychology but also biomedicine, cancer biology, economics, political science, and even physics – are experiencing crises of replicability.  Recent and classic results are challenged by reports that when new investigators try to repeat them, often they simply can’t.  This fact has led to gnashing of teeth and rending of garments, not to mention back-and-forth controversies pitting creativity against rigor (see the article quoted in the epigram), and spawned memorable phrases such as “replication police” and “shameless little bullies.”

But, as the quote above attests, personality psychology seems to be immune.  In particular, I am not aware of any major finding (1) in personality psychology that has experienced the kind of assault on its reliability that has been inflicted upon many findings in social psychology (2).  Why not?  Is it because personality psychology is boring?  Maybe so, and I’ll come back to that point at the end, but first let’s consider some other

Possible Reasons Personality Psychology Does Not Have a Replication Crisis

  1. Personality Psychology Takes Measurement Seriously

The typical study in personality measures some attribute of persons (usually a personality trait) and also measures an outcome such as a behavior, a level of attainment, or an indicator of mental or physical health.  Even though everyone chants the mantra “correlation is not causality,” generally research proceeds on the (generally reasonable) presumption that the trait can be thought of as the independent variable, and the outcome as the dependent variable.  The IV is measured with several different indicators (items) and its reliability its calculated and reported.  The same practice is followed with the DV, and converging evidence is conventionally required that both the IV and the DV are reasonably good indicators of the constructs they are supposed to represent. Compared to other areas of psychology, the N is usually pretty large too.

Contrast this with the typical study in social psychology.  Many have only two levels of the IV, being the two experimental conditions (experimental and control); maybe there are three or four if the experiment is extra-fancy.  But the typical IV is scaled as merely high or low, or even present or absent. For example, subjects might be asked to unscramble words that do or do not have certain classes of content embedded within them.  Neither the reliability nor the generalizability of this manipulation is assessed (would the manipulation have the same effect if used more than once? Is the manipulation related to, or does it have the same effect as, other manipulations of ostensibly the same psychological variable?), much less its size.  The DV might get a bit more attention, in part because unlike the IV it usually has more than two values (e.g., walking speed) and so the reliability of its measurement (say, by two RA’s) might be reported, but the wider generalizability (aka construct validity) of the DV remains unexamined.

And I won’t even mention the problems of low power that go along with small-N studies, and the resulting imprecision of the results.  That one has been hashed out elsewhere, at length, so as I said, I’m not mentioning it.

A truism within personality psychology is that good research begins with good measurement, of both the dependent and independent variables.  Not all areas of psychology pay as much attention to this principle.

  1. Personality Psychology Cares about Effect Size

Results in personality psychology are always reported in terms of effect size, usually the familiar Pearson correlation coefficient.  Social psychology is different (3); social psychologists often state that they don’t care about effect size because in the context of their research the number is nearly meaningless anyway.  The argument goes like this: Because the experimental conditions are designed to be as different from each other as possible, in order to maximize chances of finding out whether anything will happen at all, and also because experiments, by design, control for things that covary in nature, the sizes of the resulting effects don’t really mean anything outside of the experimental context.  All that matters, for purposes of further theory development, is that an effect is found to exist.  The size is only important if you are doing applied work (4).

I actually think this argument has a point, but it reveals an essential limitation of the two-group experiment.  The method can be informative about the direction of causality, and the direction of the effect (positive or negative).  But it can tell us little or nothing about how big, robust and yes, replicable, this finding will turn out to be.

In contrast, close attention to measurement has produced a research literature establishing that

    3.  Many Key Findings of Personality Psychology are Robustly Replicable

These include:

  • Behavior is consistent across situations
  • Personality predicts longevity, job performance, relationship satisfaction and many other important life outcomes
  • Acquaintances largely agree with each other about the personality traits of the people they know well
  • People agree (with some interesting exceptions) with their acquaintances’ assessments of their personality
  • Measures of personality predict central tendencies of density distributions of behavior (for example, a trait measure of extraversion can predict how many extraverted behaviors you will display, on average)
  • Much of the information (not all) in the 17,953 trait words in the unabridged English dictionary can be reduced to a “”Big Five” basic traits: Extraversion, Neuroticism, Agreeableness, Conscientiousness, and Openness to Experience.

This is a very partial list.  But lest I be accused of bias (5), I will also note that:

4. Too Many Findings in Personality Psychology are Robust but Trivial

I actually co-authored a paper with the author of the epigram above (Baumeister, Vohs & Funder, 2007) that, among other things, took personality psychology to task on this very point.  A lot – too much – research in personality psychology correlates one self-report with another self-report.  Can you say “method variance?”  I’ve done such studies myself and they have their uses, and sometimes are they are all one can do, so my overall attitude is forgiving, even while I also believe that there truly is something to forgive.

Trivial findings will replicate! Correlations among different self-report scales can be expected to robustly replicate because the relationships are often built right into the content of the scales themselves.

Studies with self-report scales are common in part because they are so easy to do, but

5. Many Important Findings in Personality Psychology are Very Difficult to Repeat

Some of these findings come from longitudinal studies, in which individuals are repeatedly assessed over long periods of time.  These studies have shown that conscientious people live longer and that the consistency of individual differences is maintained over decades, and have also charted secular trends in personality development, showing how traits such as extraversion and conscientiousness wax and wane over the lifespan.   These findings have been replicated, but only “conceptually” because no two major longitudinal studies have ever used exactly the same methods.  A skeptic would literally need decades to really double-check them.

Other findings might not take decades to reproduce, but are still no walk in the park.  Consider a study from my lab (Fast & Funder, 2008).  This study was actually in one of the issues of JPSP that was targeted by the Center for Open Science replications project.  But nobody tackled it.  Why not?  Our study looked at correlates between personality, as judged by peers, and the frequency with which people used words in different categories, during a life history interview.  To replicate this study, here’s all you have to do: Recruit a sample of 250 undergraduates.  Recruit two peers of each of them to describe their personalities (500 peers in all).  Subject each of these 250 students to a one-hour life-history interview conducted by a licensed clinical psychologist.  Transcribe the recordings of these interviews, delete the interviewer’s comments, and clean up the transcript so that it can undergo linguistic analysis.  Run the transcript through a linguistic analysis program (we used LIWC) and see which categories of word use are related to personality, as judged by peers. Gathering the data in this project took two years.  Transcribing the interviews and cleaning the transcriptions took another two years, and the analyses took around a year beyond that, so about five years of work, in all.   I do NOT know whether the findings would replicate, though we tried hard to use internal checks to reveal results that were as robust as possible.  I would seriously love to see someone else do the same study to see if our results hold up.  What do you think the chances are that anyone ever will?

The kinds of non-trivial studies that Baumeister, Vohs and I advocated, that gather direct measurements of observed and meaningful behavior, are difficult to do, especially with a sufficiently large N, and commensurately a lot of work to replicate.  I’d like to think  — in fact, I do think — that most of these findings would survive direct replication, but who really knows? Hardly anybody has the time, resources, or sufficient motivation to check. In the meantime, these findings remain immune to the replication controversy.

But, going back to the opening quotation, there is one more reason why personality psychology has avoided a replication crisis, and I believe this reason is the most important of all.

6. Personality Psychology Is Not Afraid to be Boring

Modern personality psychology (since 1950 or so) has never really striven to be clever, or cute, or counter-intuitive.  Its historic goal has been to be useful.  The gold standard in personality research is prediction (6).  Personality is measured in order to predict – and understand – behaviors and life outcomes, in order to be useful in psychological diagnosis, personnel selection, identification of at-risk individuals, career counseling, mental health interventions, improvements in quality of life, and many other purposes. Occasionally the findings are surprising, such as the now well-established fact that the trait of conscientiousness predicts not only longevity, but also job performance in every single occupation where it has ever been tested.  Nobody expected its implications to be so far-reaching.  The Big Five personality traits are not exactly surprising, but they aren’t obvious either.  If they were, it wouldn’t have taken 60 years of research to find them, and there wouldn’t still be controversy about them. Still,  studies in personality psychology typically lack the kind of forehead-slapping surprise value that characterizes many of the most famous (and problematical) findings in social psychology.

According to Bargain Basement Bayesian analysis, counterintuitive findings have low prior probabilities, by definition.  And thus, in the absence of extra-strong evidence, they are unlikely to be true and therefore unlikely to be replicable. I am about the one-hundred thousandth blogger to observe that ignoring this principle has gotten social psychology into tons of trouble.  In contrast, the fact that personality psychology never saw counter-intuitive (or, as some might put it, “interesting”) findings as its ultimate goal, seems to have turned out to be a protective factor.

Conclusion

Admittedly, some of advantages of personality psychology are bugs and not features. It isn’t particularly salutary that so many of personality psychology’s findings are trivially replicable because they amount to intercorrelations of self-report scales. And, the fact that some of the most interesting findings are almost immune to replication studies because they are so difficult to repeat, does not necessarily mean all of those findings are true.  Despite appearances, personality psychology probably has replicability issues too.  They are just harder to detect, which makes it even more important for personality researchers to get it right, the first time.  Nobody might come this way again.

Here’s another quote from the same article excerpted at the beginning of this post:

Social psychology might think carefully about how much to follow in personality psychology’s footsteps. Our entire field might end up being one of the losers.

Well, none of us wants to be a “loser,” but the present comparison of the neighboring disciplines leads to a different conclusion.  Social psychology (and science generally) might in fact do well to draw a few lessons from personality psychology: Take measurement seriously. Use large samples.  Care about effect size.  And don’t be afraid to be boring.  More exactly, push back against the dangerous idea that findings have to be surprising or counter-intuitive to be interesting.  How “interesting,” in the end, is a flashy finding that nobody can replicate?

 Footnotes

(1) Or to be honest, any study at all, but I’m trying do a little CYA here.

(2) To name a few: elderly priming, money priming, too much choice, glucose as a remedy for ego depletion, cleanliness and moral judgment, bathing and loneliness, himmicanes, power posing, precognition (this last finding might not really belong to social psychology, but it was published in the field’s leading empirical outlet).

(3) Even when effect sizes are reported, as required by many journal policies, they are otherwise typically ignored.

(4) This is NOT a straw man.  I used to think it was.  See my earlier blog post, which includes verbatim quotes from a prominent (anonymous) social psychologist.

(5) This will happen anyway; see footnote 7.

(6) In the opinion of many, including myself, the best graduate-level textbook ever written on this topic was Personality and Prediction (Wiggins, 1973).  It says a lot about measurement.  Everybody should read it.

(7) An indicator of my bias: I have written a textbook in personality psychology, one which, by the way, I tried very hard to make not-boring.

Acknowledgment

Ryne Sherman and Simine Vazire gave me some helpful tips, but none of this post should be considered their fault.

References

Baumeister, R.F. (2016). Charting the future of social psychology on stormy seas: Winners, losers, and recommendations. Journal of Experimental Social Psychology.

Baumeister, R.F., Vohs, K.D., & Funder, D.C. (2007). Psychology as the science of self-reports and finger movements: Whatever happened to actual behavior? Perspectives on Psychological Science, 2, 396-403.

Fast, L.A., & Funder, D.C. (2008). Personality as manifest in word use: Correlations with self-report, acquaintance-report, and behavior. Journal of Personality and Social Psychology, 94, 334-346.

Wiggins, J.S. (1973). Personality and prediction: Principles of personality assessment. Reading, MA: Addison-Wesley.

 

 

 

 

What if Gilbert is Right?

I. The Story Until Now (For late arrivals to the party)
Over the decades, since about 1970, social psychologists conducted lots of studies, some of which found cute, counter-intuitive effects that gained great attention. After years of private rumblings that many of these studies – especially some of the cutest ones – couldn’t be replicated, a crisis suddenly broke out into the open (1). Failures to replicate famous and even beloved findings began to publicly appear, become well known, and be thoroughly argued-over, not always in the most civil of terms. The “replicability crisis” became a thing.
But how bad was the crisis really? The accumulation of anecdotal stories and one-off failures to replicate was perhaps clarified to some extent by a major project organized by the Center for Open Science (COS), published last November, in which labs around the world tried to replicate 100 studies and, depending on your definition, “replicated” only 36% of them (2).
In the face of all this, some optimists argued that social psychology shouldn’t really feel so bad, because failed replicators might simply be incompetent, if not actually motivated to fail, and the typical cute, counter-intuitive effect is a delicate flower that can only bloom under the most ideal climate and careful cultivation. Optimists of a different variety (including myself) also pointed out that psychology shouldn’t feel so bad, but for a different reason: problems of replicability are far from unique to our field. Failures to reproduce key findings have become seen as serious problems within biology, biochemistry, cardiac medicine, and even – and disturbingly –cancer research. It was widely reported that the massive biotech company Amgen was unable to replicate 47 out of 53 of seemingly promising cancer biology studies. If we have a problem, we are far from alone.

II. And Then Came Last Friday’s News (3)
Prominent psychology professors Daniel Gilbert and Tim Wilson published an article that “overturned” (4) the epic COS study. Specifically, their reanalysis concluded that the study not only didn’t persuasively show that most of the studies it addressed couldn’t be replicated, its data were actually consistent with the possibility that all of the studies were replicable! The article was widely reported not just in press releases but in outlets including the Washington Post, Wired, the Atlantic on line, and the Christian Science Monitor, to name just a few.
Psychologists who had been skeptical of the “replication movement” all along – come one, we know who you are — quickly tweeted, Facebooked and otherwise cheered the happy news. Some even began to wonder out loud whether “draconian” journal reforms adopted to enhance replicability could now be repealed. At the same time, and almost as quickly, members of the aforesaid replication movement – come one, we know who you are too (5) – took close looks at the claims by Gilbert and Co., and within 48 hours a remarkable number of blogs and posts (6) began to refute their statistical approach and challenge the way they summarized some of the purported flaws of the replication studies. I confess I found most of these responses pretty persuasive, but that’s not my point for today. Instead my point is:

III. What if Gilbert is Right?
Let’s stipulate, for the moment, that Gilbert and Co. are correct that the COS project told us nothing worth knowing about the replicability of social psychological research. What then?

IV. The COS Study Is Not the Only, and Was Far From the First, Sign that We Have A Problem.
One point I have seen mentioned elsewhere – and I’ll repeat it here because it’s a good point – is that the COS project was far from being the only evidence that social psychology has a replicability problem. In fact, it came after, not before, widespread worry had been instigated by a series of serious and compelling failures to reproduce very prominent studies, and many personal reports of research careers delayed if not derailed by the attempt to follow-up on lines of research that only certain members of the in-crowd knew were dead ends. As this state of affairs became more public over the past couple of years, the stigma of failing to replicate some famous psychologist’s famous finding began (not entirely!!) to fall away, and a more balanced representation of what the data really show, on all sorts of topics, began to accumulate in public file drawers, data repositories, and outlets for replication studies. The COS study, whatever its merits, came on top, not as a foundation, of all that.

V. Other Fields Have Replicability Problems Too
A point I haven’t, in this context, seen mentioned yet – and my real motivation for writing this post – is that – remember! – the replication crisis was never exclusive to psychology  in the first place. It has affected many other fields of research as well. So, if Gilbert & Co. are right, are we to take it that the concerns in our sister sciences are also overblown? For example, was Amgen wrong? Were all those cancer biology studies perfectly replicable after all? Do biochemistry, molecular biology, and the medical research community share social psychology’s blight of of uncreative, incompetent, shameless little bullies aiming to pull down the best research in their respective fields?
Well, maybe so. But I doubt it. It seems extremely unlikely that the kinds of complaints issued against the studies that failed to replicate psychological findings apply in the same way in these other fields. It seems doubtful that problems in these other fields stem from geographical or temporal differences in social norms, unique aspects of student demographics, changes in wordings of scale items, exact demeanor of research assistants, or other factors of the sort pointed out by Gilbert & Co. as bedeviling attempts to replicate psychological findings. I also have no reason to think that molecular biology is full of shameless little bullies, but I stand ready to be corrected on that point.

VI: The Ultimate Source of Unreliable Scientific Research
So let’s go back to where some of us were before the COS study, when we pointed out that social psychology is not alone in having replication problems. What did this fact imply? Just this: The causes of a scientific literature full of studies that can’t be replicated are not specific to social psychology. The causes are both deeper and broader. They are deeper because they don’t concern concrete details of particular studies, or even properties of particular areas of research. They are broader because they affect all of science.
And the causes are not hard to see. Among them are:
1. An oversaturated talent market full of smart, motivated people anxious to get, or keep, an academic job.
2. A publication system in which the journals that can best get you a job, earn you tenure, or make you a star, are (or until recently have been) edited with standards such as the “JPSP threshold” (of novelty), and the explicit (former) disdain of Psychological Science for mere “bricks in the wall” that represent solid, incrementally useful, but insufficiently “groundbreaking” findings. I have been told that the same kinds of criteria have long prevailed in major journals in other fields of science as well. And of course we all know what kind of article is required to make it into Science.
3. And, even in so-called lesser journals, an insistence on significant findings as a criterion for publication, and a strong preference for reports of perfect, elegant series of studies without a single puzzling data point to be seen. “Messy” studies are left to work their way down the publication food chain, or to never appear at all.
4. An academic star system that radically, disproportionately rewards researchers whose flashy findings get widespread attention not just in our “best” journals but even in the popular media. The rewards can include jobs in the most prestigious universities, endowed chairs, distinguished scholar awards, Ted talks, and even (presumably lucrative) appearances in television commercials! (7)

It is these factors that are, in my opinion, both the ultimate sources of our problem and the best targets for reforming and improving not just psychology, but scientific research in all fields. And, to end on an optimistic note, I think I see signs that useful reforms are happening. People aren’t quite as enthusiastic about cute, counter-intuitive findings as they used to be. Hiring committees are starting to wonder what it really means when a vita shows 40 articles published in 5 years, all of which have perfect patterns of results. Researchers are occasionally openly responding – and getting publicly praised for openly responding — rather than defensively reacting, to questions about their work. (8)

VII. Summary and Moral
The replicability crisis is not just an issue for social psychology, and its causes aren’t unique to social psychology either. Claims that we don’t have a problem, because of various factors that are themselves unique to social psychology, fail to explain why so many other fields have similar concerns. The essential causes of the replicability crisis are cultural and institutional, and transcend specific fields of research. The remedies are too.

Footnotes
(1) The catalyst for this sudden attention appears to have been the nearly simultaneous appearance in JPSP of a study reporting evidence for precognition, and the exposure of massive data fraud by a prominent Dutch social psychologist. While these two cases were unrelated to each other and each exceptional by any standard, together they illuminated the fallibility of peer review and the self-correcting processes of science that were supposed to safeguard against accepting unreliable findings.
(2) Or 47%, or 39% or 68%, again, depending on your definition.
(3) Or a bit earlier, because Science magazine’s embargo was remarkably leaky, beginning with a Harvard press release issued several days before the article it promoted.
(4) To quote their press release; the word does not appear in their article.
(5) Full disclosure. This probably includes me, but I didn’t write a blog about it (until just now).
(6) A few: Dorothy Bishop, Andrew Gelman, Daniel Lakens, Uri Simonsohn, Sanjay Srivastava, Simine Vazire
(7) I strongly recommend reading Diederik Stapel’s vivid account (generously translated by Nick Brown) of how desperately he craved becoming one of these stars, and what this craving motivated him to do.
(8) Admittedly, defensive reactions, amplified in some cases by fan clubs, are still much more common. But I’m looking for positive signs here, and I think I see a few.

Bargain Basement Bayes

One of the more salutary consequences of the “replication crisis” has been a flurry of articles and blog posts re-examining basic statistical issues such as the relations between N and statistical power, the importance of effect size, the interpretation of confidence intervals, and the meaning of probability levels. A lot of the discussion of what is now often called the “new statistics” really amounts to a re-teaching (or first teaching?) of things anybody, certainly anybody with an advanced degree in psychology, should have learned in graduate school if not as an undergraduate. It should not be news, for example, that bigger N’s give you a bigger chance of getting reliable results, including being more likely to find effects that are real and not being fooled into thinking you have found effects when they aren’t real. Nor should anybody who had a decent undergrad stats teacher be surprised to learn that p-levels, effect sizes and N’s are functions of each other, such that if you know any two of them you can compute the third, and that therefore statements like “I don’t care about effect size” are absurd when said by anybody who uses p-levels and N’s.

But that’s not my topic for today. My topic today is Bayes’ theorem, which is an important alternative to the usual statistical methods, but which is rarely taught at the undergraduate or even graduate level. (1)  I am far from expert about Bayesian statistics. This fact gives me an important advantage: I won’t get bogged down in technical details; in fact that would be impossible, because I don’t really understand them. A problem with discussions of Bayes’ theorem that I often see in blogs and articles is that they have a way of being both technical and dogmatic. A lot of ink – virtual and real – has been spilled about the exact right way to compute Bayes Factors and advocating that all statistical analyses should be conducted within a Bayesian framework. I don’t think the technical and dogmatic aspects of these articles are helpful – in fact I think they are mostly harmful – for helping non-experts to appreciate what thinking in a semi-Bayesian way has to offer. So, herewith is my extremely non-technical and very possibly wrong (2) appreciation of what I call Bargain Basement Bayes.

Bayes Formula: Forget about Bayes Formula. I have found that even experts have to look it up every time they use it. For many purposes, it’s not needed at all. However, the principles behind the formula are important. The principles are these:

1. First, Bayes assumes that belief exists in degrees, and assigns numbers to those degrees of belief. If you are certain that something is false, it has a Bayes “probability” of 0. If you are certain it’s true, the probability is 1. If you have absolutely no idea, whatsoever, the probability is .5. Everything else is in between.
Traditional statisticians hate this. They don’t think a single fact, or event, can even have a probability. Instead, they want to compute probabilities that refer to frequencies within a class, such as the number of times out of hundred a result would be greater than a certain magnitude under pure chance given a certain N. But really, who cares? The only reason anybody cares about this traditional kind of probability is because after you compute that nice “frequentist” result, you will use the information to decide what you believe. And, inevitably, you will make that decision with a certain degree of subjective confidence. Traditional statistics ignores and even denies this last step, which is precisely where it goes very, very wrong. In the end, beliefs are held by and decisions based on those beliefs are made by people, not numbers. Sartre once said that even if there is a God, you would still have to decide whether to do what He says. Even if frequentist statistics are exactly correct (3) you still have to decide what to do with them.

2. Second, Bayes begins with what you believed to be true before you got your data. And then it asks, now that you have your data, how much should you change what you used to believe? (4)
Traditional statisticians hate this even more than they hate the idea of putting numbers on subjective beliefs. They go on about “prior probabilities” and worry about how they are determined, observe (correctly) that there is no truly objective way to estimate them, and suspect that the whole process is just a complicated form of inferential cheating. But the traditional model begins by assuming that researchers know and believe absolutely nothing about their research topic. So, as they then must, they will base everything they believe on the results of their single study. If those results show that people can react to stimuli presented in the future, or that you can get people to slow their walks to a crawl by having them unscramble the word “nldekirw” (5) then that is what we have to believe. In the words of a certain winner of the Nobel Prize, “we have no choice.”
Bayes says, oh come on. Your prior belief was that these things were impossible (in the case of ESP) or, once the possibility of elderly priming was explained, that it seemed pretty darned unlikely. That’s what made the findings “counter-intuitive,” after all. Conventional statistics ignores these facts. Bayes acknowledges that claims that are unlikely to be true, a priori, need extra-strong evidence to become believable. I am about the one millionth commentator to observe that social psychology, in particular, has been for too long in thrall to the lure of the “counter intuitive result.” Bayes explains exactly how that got us into so much trouble. Counter-intuitive, by definition, means that the finding had a low Bayesian prior. Therefore, we should have insisted on iron-clad evidence before we started believing all those cute surprising findings, and we didn’t. Maybe some of them are true, who knows at this point. But the clutter of small-N, underpowered single studies with now-you-see-it-now-you-don’t results are in a poor position to tell us which they are. Really, we almost need to start over.

3. Third, Bayes is in the end all about practical decisions. Specifically, it’s about decisions to believe something, and to do something or not, in the real world. It is no accident, I think, that so many Bayesians work in applied settings and focus on topics such as weather forecasting, financial planning, and medical decisions. In all of these domains, the lesson they teach tends to be – as Kahneman and Tversky pointed out long ago – we underuse baserates (6). In medicine, in particular, the implications are just starting to be understood in the case of screening for disease. When the baserate (aka the prior probability) is low, then even highly diagnostic tests are at a very high probability of yielding false positives, which entail significant physical, psychological, and financial costs. Traditional statistical thinking, which ignores baserates, leads one to think that a positive result of a test with 90% accuracy means that the patient has a 90% chance of having the disease. But if the prevalence in the population is 1%, the actual probability given a positive test is less than 10%. In subjective, Bayesian terms of course! Extrapolating this to the context of academic research, the principle implies that we overestimate the diagnosticity of single research studies, especially when the prior probability of the finding is low. I think this is why we were so willing to accept implausible, “counter-intuitive” results on the basis of inadequate evidence. To our current grief.

You don’t have to be able to remember Bayes’ formula to be a Bargain Basement Bayesian. But, as in all worthwhile bargain basements, you can get something valuable at a low cost.

Footnotes
1. In a recent graduate seminar that included students from several departments, I asked who had ever taken a course that taught anything about Bayes.  One person raised her hand.  Interestingly, she was a student in the business school.
2. Hi Simine.
3. They aren’t.
4. Bayes is sometimes called the “belief revision model,” which I think is pretty apt.
5. Wrinkled
6. Unless the data are presented in an accessible, naturalistic format such as seen in the work by Gerd Gigerenzer and his colleagues, which demonstrates how to present Bayesian considerations in terms other than the intimidating-looking formula.

Towards a De-biased Social Psychology: The effects of ideological perspective go beyond politics.

Behavioral and Brain Sciences, in press; subject to final editing before publication

This is a commentary on: Duarte, J. L., Crawford, J. T., Stern, C., Haidt, J., Jussim, L., & Tetlock, P. E.  (in press). Political diversity will improve social psychological science. Behavioral and Brain Sciences. To access the target article, click here.

“A liberal is a man too broadminded to take his own side in a quarrel.” — Robert Frost

Liberals may be too open-minded for their own (ideological) good; they keep finding fault with themselves and this article is a good example. Which is not to say it’s not largely correct. Social and personality psychology obviously lacks ideological diversity, and Duarte and colleagues provide strong circumstantial evidence that the causes include hostile climate, lack of role models, and subtle and not-so-subtle discrimination of the same sort that underlies other lacks of diversity elsewhere in society.

Duarte et al. argue that our science would be better if more “conservatives” were included in the ideological mix. But the point of view that carries this label has changed greatly in recent years. Not so long ago, no conservative would dream of shutting down the government over an ideological dispute, denying the validity of settled science, or passing laws to encourage open carry of weapons on college campuses. Conservatives were conservative. Such people indeed have a lot to contribute to any discussion, including scientific ones. But many modern-day “conservatives” — especially the loudest ones — would better be described as radical, and among their radical characteristics is a pride in anti-intellectualism and willful ignorance. In a call for more conservatives, who are we actually inviting and, I truly wonder, how many even exist? I am not optimistic about the feasibility of finding enough reasonable conservatives to join our field, even if we could overcome all of the barriers the target article so vividly describes. At best, such change is a long-term goal.

In any case, we shouldn’t wait for conservatives to arrive and save us. We need to save ourselves. The target article presents mixed messages about whether de-biasing is feasible. On the one hand, it cites evidence that de-biasing is difficult or impossible. On the other hand, the entire article is an effort at de-biasing. I choose to believe the more optimistic, implicit claim of Duarte et al., which is that we can become more intellectually honest with ourselves and thereby do better science. I find the “mirror-image” test particularly promising. For any finding, we should indeed get into the habit of asking, what if the very same evidence had led to the opposite conclusion?

Politics is the least of it. In focusing on research that seeks to describe how conservatives are cognitively flawed or emotionally inadequate, or on research that treats conservative beliefs as ipso facto irrational, Duarte et al. grasp only at the low-hanging fruit. More pernicious, I believe, are the way ideological predilections bias the conduct and evaluation of research that, on the surface, has nothing to do with politics. An awful lot of research and commentary seems to be driven by our value systems, what we wish were true. So we do studies to show that what we wish were true is true, and attack the research of others that leads to conclusions that do not fit our world view.

Examples are legion. Consider just a few:

Personality and abilities are heritable. This finding is at last taking hold in psychology, after a century’s dominance of belief in a “blank slate.” The data were just too overwhelming. But the idea that people are different at the starting line is heartbreaking to the liberal world-view and encounters resistance even now.

Human nature is a product of evolution. Social psychologists are the last people you would expect to deny that Darwin was right — except when it comes to human behavior, and especially if it has anything to do with sex differences (Winegard et al., 2014). The social psychological alternative to biological evolution is not intelligent design, it’s culture. And as to where culture came from, that’s a problem left for another day.

The Fundamental Attribution Error is, as we all know, the unfortunate human tendency to view behavior as stemming from the characteristics — the traits and beliefs — of the people who perform it. Really, it’s the situation that matters. So, change the situation and you can change the behavior; it’s as simple as that. This belief is very attractive to a liberal world-view, and one does not have to look very far to find examples of how it is used to support various liberal attitudes towards crime and punishment, economic equality, education, and so forth. But the ideological consequences of belief in the overwhelming power of the situation are not consistent. It implies that the judges at Nuremberg committed the Fundamental Attribution Error when they refused to accept the excuse of Nazi generals that they were “only following orders.”

The consistency controversy, which bedeviled the field of personality psychology for decades and which still lingers in various forms, stems from the conviction among many social psychologists that the Fundamental Attribution Error, just mentioned, affects an entire subfield of psychology. Personality psychology, it is sometimes still said, exaggerates the importance of individual differences. But to make a very long story very short, individual differences in behavior are consistent across situations (Kenrick & Funder, 1988) and stable over decades (e.g., Nave et al., 2010). Many important life outcomes including occupational success, marital stability and even longevity can be predicted from personality traits as well as or better than from any other variables (Roberts et al., 2007). And changing behavior is difficult, as any parent trying to get a child to make his bed can tell you; altering attitudes is just as hard, as anyone who has ever tried to change anyone else’s mind in an argument can tell you. Indeed, does anybody ever change their mind about anything? Maybe so, but generally less than the situation would seem to demand. I expect that responses to the article by Duarte et al. will add one more demonstration of how hard it is to change ingrained beliefs.

REFERENCES
Kenrick, D.T., & Funder, D.C. (1988). Profiting from controversy: Lessons from the person-situation debate. American Psychologist, 43, 23-34.
Nave, C.S., Sherman, R.A., Funder, D.C., Hampson, S.E., & Goldberg, L.R. (2010). On the contextual independence of personality: Teachers’ assessments predict directly observed behavior after four decades. Social Psychological and Personality Science, 1, 327-334.
Roberts, B.W., Kuncel, N.R., Shiner, R., Caspi, A., & Goldberg, L.R. (2007). The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science, 2, 313-345.
Winegard BM, Winegard BM, & Deaner RO (2014). Misrepresentations of evolutionary psychology in sex and gender textbooks. Evolutionary Psychology, 12, 474-508.

How to Flunk Uber: A Guest Post by Bob Hogan

How to Flunk Uber

by Robert Hogan

Hogan Assessment Systems

Delia Ephron, a best-selling American author, screenwriter, and playwright, published an essay in the New York Times on August 31st, 2014 entitled “Ouch, My Personality, Reviewed”  that is a superb example of what Freud called “the psychopathology of everyday life.”  She starts the essay by noting that she recently used Uber, the car service for metrosexuals, and the driver told her that if she received one more bad review, “…no driver will pick you up.”  She reports that this feedback triggered some “obsessive” soul searching:  she wondered how she could have created such a bad score as an Uber passenger when she had only used the service 6 times.  She then reviewed her trips, noting that, although she had often behaved badly (“I do get short tempered when I am anxious”), in each case extenuating circumstances caused her behavior.  She even got a bad review after a trip during which she said very little:  “Perhaps I simply am not a nice person and an Uber driver sensed it.”

The essay is interesting because it is prototypical of people who can’t learn from experience.  For example, when Ms. Ephron reviewed the situations in which she mistreated Uber drivers, she spun each incident to show that her behavior should be understood in terms of the circumstances—the driver’s poor performance—and not in terms of her personality.  Perhaps situational explanations are the last refuge of both neurotics and social psychologists?

In addition, although the situations changed, she behaved the same way in each of them:  she complained, she nagged and micro-managed the drivers, she lost her temper, and she broadcast her unhappiness to the world.  Positive behavior may or may not be consistent across situations, but negative behavior certainly is.  And the types of negative behaviors she displayed fit the typology defined by the Hogan Development Survey (HDS), an inventory of the maladaptive behaviors that occur when people are dealing with others with less power and think no one important is watching them.

All her actions had a manipulative intent—Ms. Ephron wanted to compel a fractious driver to obey her.  Her behaviors were tactical in that they gave her short term, one off wins—she got her way; but the behaviors become counterproductive when she has to deal with the same people repeatedly—or when she is dealing with NYC Uber drivers.  Strategic players carefully control what Irving Goffman called “their leaky channels”, the behavioral displays that provide information regarding a player’s character or real self.  The tactical Ms. Ephron seems unable to control her leaky channels.

It was also interesting to learn that, although Ms. Ephron has been in psychotherapy for years, the way she mistreats “little people” seemingly never came up. This highlights the difference between intrapsychic and interpersonal theories of personality.   From an intrapsychic perspective, emotional distress creates problems in relationships; fix the emotional problems and the relationships will take care of themselves.  From an interpersonal perspective, problems in relationships create emotional distress—fix the relationships (behave better) and the emotional problems will take care of themselves.  In the first model, intrapsychic issues disrupt relationships; in the second model, disrupted relationships cause intrapsychic issues.

As further evidence that Ms. Ephron lacks a strategic understanding of social behavior, she is surprised to learn that other people keep score of her behavior.  This means that she pays no attention to her reputation.  But her reputation is the best source of data other people have concerning how to deal with her.  She might not care about her reputation, but those who deal with her do.  All the data suggest that she will have the same reputation with hair dressers, psychotherapists, and purse repair people as she does with the Uber drivers of New York.

Finally, people flunk Uber the same way as they become unemployable and then flunk life—they flunk one interaction at a time.  After every interaction there is an accounting process, after which something is added to or subtracted from peoples’ reputations.  The score accumulates over time and at some point, the Uber drivers refuse to pick them up.  Ms. Ephron is a successful artist, and her success buys her a degree of idiosyncratic credit—she is allowed to misbehave in the artistic community—but there are consequences when she misbehaves in the larger community of ordinary actors.