Thresholds

Part One

I’ve been suffering an acute bout of cognitive dissonance lately, finding myself disagreeing with people I admire, specifically, several of the authors of this article. (The article has 72 authors and I don’t know all of them.)   The gist of the article can be stated very simply and in the authors’ own words: “We propose to change the default P-value threshold for statistical significance for claims of new discoveries from .05 to .005.”  This proposal is soberly, clearly argued and the article makes some good points, the best of which is that, imperfect as this change would be, at least it’s a step in the right direction.  But I respectfully disagree.  Here’s why.

I’m starting to think that p-levels should all be labeled “for entertainment purposes only.”  They give a very very rough idea of the non-randomness of your data, and are kind of interesting to look at. So they’re not completely useless, but they are imprecise at best and almost impossible to interpret at worst*, and so should be treated as only one among many considerations when we decide what we as scientists actually believe.  Other considerations (partial list): prior probabilities (also very rough!), effect size, measurement precision, conceptual coherence, consistency with related findings, and (hats off please) replicability.

Thresholds are an attempt — independently of the other considerations just listed — to let our numbers do our thinking for us. I get why: it’s an attempt to prevent us fooling ourselves. They also give editors and reviewers a seemingly objective criterion for publication, and so make their jobs easier (or maybe even possible). But statistical thresholds are not up to the job we so badly want them to do.  Even if your p-level is .005, you still have to decide whether to believe it. As the existentialists say, the only choice you cannot make, is the choice to not choose.

This is dissatisfying, I know. If only we had a way to objectively say which findings we should believe, and which ones we shouldn’t!  But we don’t, and pretending we do (which includes playing with the criteria by which we do it) is not, in my view, a step in the right direction.

Part Two

In an interesting email exchange with my friend Simine Vazire**, one of the authors of the article, she replied to my comments above with “I think all of your objections apply to the .05 threshold as much as the .005 threshold.  If that’s true, then your objection is not so much with lowering the threshold, but with having one at all.”

Bingo, that’s exactly right.

Simine also went on to challenge me with a thought experiment.  If I knew two labs, one of which regarded a discovery in their data as real if it attained p < .05, and the other only when the finding attained p < .001, which lab’s finding would I be more inclined to believe?

This thought experiment is indeed challenging.  My answer artfully dodges the question. Rather than attending to their favorite p-values, I’d be inclined to believe the lab whose results make more sense and/or have practical implications, not to mention the ones whose findings can be replicated.  I would also be more skeptical of the lab with lots of “counter-intuitive” findings, which is where Bayes (and the notion of prior probability) comes in.***  Furthermore, I wouldn’t really believe that the two labs with their ostensibly different NHST p-thresholds were actually doing much different on that score, so that wouldn’t influence my choice very much if at all. This is where my cynicism comes in.  I’m as doubtful that the second lab would really throw away a .002 as I am that the first lab would ignore a .06.

Full disclosure: I use rough thresholds myself; for a long time (before Ryne Sherman, then a member of our lab, wrote the R code to let us do randomization tests) our lab had the rule of not even looking tables of (100-item) Q-sort correlates unless at least 20 of them were significant at p < .10.  So there’s two thresholds in one!  But this was, and remains, only one of many considerations on the way to figuring out what (we thought) the data meant. And this is actually what I mean when I say we should use p-levels for entertainment purposes only.

I appreciate the argument that decades of discussion have done little to change how people think and to move them away from using arbitrary thresholds for significance, and I have to admit it’s even possible that lowering the threshold might do some good.  For me, though, the best outcome might be that the paper could stimulate a new conversation about whether we should have thresholds at all.  As my Riverside colleague Bob Rosenthal likes to say, surely god loves .06 almost as much as she loves .05 (or, if you prefer, .006 almost as much as .005). And when you start thinking like that, it’s pretty hard to take a threshold, any threshold, seriously.

*as I learned from listening to Paul Meehl’s lectures on-line, the whole notion of “probability” is philosophically very fraught, and the conversion of the notion of probability (however one understands it) into actual numbers calculated to several digits precision is even more fraught.

**quoted with permission

***counter-intuitive findings have low prior probabilities, by definition.  See Bargain Basement Bayes.