Thresholds

Part One

I’ve been suffering an acute bout of cognitive dissonance lately, finding myself disagreeing with people I admire, specifically, several of the authors of this article. (The article has 72 authors and I don’t know all of them.)   The gist of the article can be stated very simply and in the authors’ own words: “We propose to change the default P-value threshold for statistical significance for claims of new discoveries from .05 to .005.”  This proposal is soberly, clearly argued and the article makes some good points, the best of which is that, imperfect as this change would be, at least it’s a step in the right direction.  But I respectfully disagree.  Here’s why.

I’m starting to think that p-levels should all be labeled “for entertainment purposes only.”  They give a very very rough idea of the non-randomness of your data, and are kind of interesting to look at. So they’re not completely useless, but they are imprecise at best and almost impossible to interpret at worst*, and so should be treated as only one among many considerations when we decide what we as scientists actually believe.  Other considerations (partial list): prior probabilities (also very rough!), effect size, measurement precision, conceptual coherence, consistency with related findings, and (hats off please) replicability.

Thresholds are an attempt — independently of the other considerations just listed — to let our numbers do our thinking for us. I get why: it’s an attempt to prevent us fooling ourselves. They also give editors and reviewers a seemingly objective criterion for publication, and so make their jobs easier (or maybe even possible). But statistical thresholds are not up to the job we so badly want them to do.  Even if your p-level is .005, you still have to decide whether to believe the finding it pertains to. As the existentialists say, the only choice you cannot make, is the choice to not choose.

This is dissatisfying, I know. If only we had a way to objectively say which findings we should believe, and which ones we shouldn’t!  But we don’t, and pretending we do (which includes playing with the criteria by which we do it) is not, in my view, a step in the right direction.

Part Two

In an interesting email exchange with my friend Simine Vazire**, one of the authors of the article, she replied to my comments above with “I think all of your objections apply to the .05 threshold as much as the .005 threshold.  If that’s true, then your objection is not so much with lowering the threshold, but with having one at all.”

Bingo, that’s exactly right.

Simine also went on to challenge me with a thought experiment.  If I knew two labs, one of which regarded a discovery in their data as real if it attained p < .05, and the other only when the finding attained p < .001, which lab’s finding would I be more inclined to believe?

This thought experiment is indeed challenging.  My answer artfully dodges the question. Rather than attending to their favorite p-values, I’d be inclined to believe the lab whose results make more sense and/or have practical implications, not to mention the ones whose findings can be replicated.  I would also be more skeptical of the lab with lots of “counter-intuitive” findings, which is where Bayes (and the notion of prior probability) comes in.***  Furthermore, I wouldn’t really believe that the two labs with their ostensibly different NHST p-thresholds were actually doing much different on that score, so that wouldn’t influence my choice very much if at all. This is where my cynicism comes in.  I’m as doubtful that the second lab would really throw away a .002 as I am that the first lab would ignore a .06.

Full disclosure: I use rough thresholds myself; for a long time (before Ryne Sherman, then a member of our lab, wrote the R code to let us do randomization tests) our lab had the rule of not even looking tables of (100-item) Q-sort correlates unless at least 20 of them were significant at p < .10.  So there’s two thresholds in one!  But this was, and remains, only one of many considerations on the way to figuring out what (we thought) the data meant. And this is actually what I mean when I say we should use p-levels for entertainment purposes only.

I appreciate the argument that decades of discussion have done little to change how people think and to move them away from using arbitrary thresholds for significance, and I have to admit it’s even possible that lowering the threshold might do some good.  For me, though, the best outcome might be that the paper could stimulate a new conversation about whether we should have thresholds at all.  As my Riverside colleague Bob Rosenthal likes to say, surely god loves .06 almost as much as she loves .05 (or, if you prefer, .006 almost as much as .005). And when you start thinking like that, it’s pretty hard to take a threshold, any threshold, seriously.

Footnotes

*as I learned from listening to Paul Meehl’s lectures on-line, the whole notion of “probability” is philosophically very fraught, and the conversion of the notion of probability (however one understands it) into actual numbers calculated to several digits precision is even more fraught.

**quoted with permission

***counter-intuitive findings have low prior probabilities, by definition.  See Bargain Basement Bayes.

9 thoughts on “Thresholds

  1. very interesting take, and i agree that p-values more generally should be labeled “for entertainment purposes only”

    >>>>If I knew two labs, one of which regarded a discovery in their data as real if it attained p < .05, and the other only when the finding attained p < .001, which lab’s finding would I be more inclined to believe?

    I'd say this is an easy thought experiment to answer: Rather than focusing on the lab's favorite p-value threshold, I’d place much more confidence in the effects from the lab that (1) reports their findings more transparently (i.e., compliance with relevant reporting guidelines, open/public materials, open/public data, & study/hypothesis pre-registration) and (2) reports successful internal direct replications (if feasible).

  2. I’m reminded of a post on twitter of a panel where a pundit makes the point that the reason why we believe in science and why science will last, as opposed to other systems of interpreting the world like religion, is that science works. By that statement, the author meant that the information gleaned from science can be used across historical periods to do things like build bridges, launch satellites, and treat certain health issues.

    If psychology wants to be a science, it will need to focus on getting its ideas to work similarly. P values are a very weak, and often uninformative way of doing that. Direct replications of an effect, regardless of p-values (i.e., consistent effect sizes), is a better way forward.

    Reifying p-values, whether .05, .005, or .00000005, is paying homage to a time in history where psychologists played at being scientists because they could use a mindless threshold to define their findings as real. We can and should do better.

  3. Regarding the thought experiment, my answer is “I don’t know.” The problem with p-values is that they are based on a test statistic (e.g., t, F, etc.). And a test statistic is (essentially) effect size / standard error. You get a small p-value when your effect size is large relative to the standard error. The problem here is that both pieces of information are valuable on their own! The effect size tells you how strongly two things are related. The standard error tells you have precisely you have measured that relationships. If we want to get serious about setting up a criteria for publication, we need to get serious about precise measurement. That means setting a threshold for standard error – not for p-values. For example, you might decide that you will not take any effect size (say, correlation) seriously unless it is measured with a +/- .05 SE. That would give you (roughly) a 95% confidence interval around your effect size of +/- .10. If you can live with that, then that is what you should be aiming for. If you cannot, then you need to measure more precisely (i.e., get a larger sample, use repeated measures, etc.).

    I wrote an R function that does this as part of the {multicon} package. It is call n4rci(). You tell it your confidence interval width (for a Z transformed correlation) and your alpha level and it gives you the N you’d need to measure an effect with that much precision. For +/- .10, you need 387 people. For +/- .15, you need 173. This is why I get N=200 (or more) for every study I conduct. If I cannot measure the effect size within +/- .15, I don’t care.

  4. Pingback: There’s a debate raging in science about what should count as “significant” - IT AND US

  5. Pingback: There's a debate raging in science about what should count as "significant"

  6. Great thoughts David, as someone who presents a lot of statistics to non-statistical audiences, I believe there is a fundamental mis-understanding and fear of statistics. Thus people retreat back to something they understand, a “hard and fast rule.” A heuristic that tells them even if they don’t understand what someone has done, it’s believable or not. The beloved p-value does just that. Reciting it, loving it, believing it is not about understanding it, but more about a mental short cut that tells the reader they *can* believe it even when they don’t understand the research, or the analyses. It’s not a failure of the statistic itself… it’s just a number after all – and a useful one if you know what it really means. It’s a failure of how we train scientists. Fix the training – fix the problem,

  7. Pingback: There’s a debate raging in science about what should count as “significant” - Kasa News

  8. Always an interesting debate. There are a lot of things one could say about NHST. Most have already been said (which isn’t going to stop me from repeating them 😉

    My opinion? p-values (NHST) aren’t the problem; it’s the way people use (& abuse) them. Setting alpha as part of your decision-making process is just fine as long as the rest of the process holds together (e.g., theory, design and data collection and treatment, model comparison, effect sizes, replication). The idea of setting a quantitative criterion for decisions is solid. The problems start when people try to circumvent that criterion (e.g., p-hacking, creative date exclusion) and/or when they forget How It Works (everything is significant with large samples; even with low alpha, if you rely on one experiment/data collection, you can make a Type 1 error).

    I’m fairly new to this game, so I’m still sorting out how I’m going to deal with the conflicting pressures of publishing process/best practices/what *I* believe is the right way to so it. But I’m really with Ryne on the importance of standard errors. When I look at my own data, before even thinking about how I’m going to write it up/present it, I look at SEs as the primary basis for decision-making (e.g., is this project worth pursuing at all? Is this (non)significant finding just a fluke? Do I keep going, or stop while I’m ahead? Am I thinking about this the wrong way?)

    I’d love to see a publication that includes using SE as part of the decision-making process in its method…?

  9. A useful rule of thumb: Whenever possible and feasible, replicate your own novel finding before you attempt to publish it. This rule of thumb might not solve every problem or answer every objection, but it’s a great place to start.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s