The P-value crisis and the issue of causality

(1)

Editorial

The P-value crisis and the issue of causality

There is a pervasive misunderstanding and misuse of

the P-value, a P-value crisis. Goodman [1] has likened

the 2016 American Statistical Association (ASA) position

paper [2] about this situation to a similar hypothetical

statement by the American Physical Society addressed to engineers and builders relaying the message that “mass and weight are, in reality, different from one another” .

The common mistake in interpreting the P-value is that it represents the probability that the effect size or an association we observe in a study refers to the prob-ability of sheer chance of what we observed not being true. For example, we like to think a P-value of 0.03 indicates what we observed could have been spurious for only 3% and hence true for 97% of the time. This surely is not the case. The P-value stands for the condi-tional likelihood of obtaining an effect size (a difference or association) in a subsequent study that is the same as, or larger than, the one found in the current study, based on the premise that there was, in reality, no dif-ference/association, the so-called null hypothesis, and the data meet all other assumptions of the statistical test used. So why is there this pervasive thought barrier in grasping what the P-value stands for?

I propose a simple lack of appreciation of the degree to which the human brain prioritizes causality is a major reason. This prioritization is due to the obvious interest in causes of any researcher or scientific reader in gen-eral. The eminent artificial intelligence expert J. Pearl im-portantly points out that “human intuition is organized

around causal, not statistical, relations” [3]. By the age

of 3, our brains are already rather apt at causation, while mathematics and counterfactual thinking comes much

later [3]. So whenever we see a P-value, with our almost

intuition, we try to ascribe it a true causality.

On the other hand, provided the number of observa-tions is adequate, the Fisher’s P-value is always a good tool to judge a non-causal effect size or association. However, whether an effect or association is indeed causal most of the time depends on whether there was randomization in the study design and a subsequent intervention, like in a randomized controlled trial (RCT). If so, then the counterfactual thinking in the P-value with its null assumption simply goes: ‘Now I have given this medication to only one of the qualitatively similar, that is randomized, two groups. What is the probability that I would have observed this difference between the two groups at the end of the study, in case there was no real difference?’ So a difference between the groups we observe at the end of the study is rather likely because

of an intervention, like a new drug, we tested. In this study design, differing potential confounders and the dif-ferent prior probabilities caused by these confounders have largely been eliminated by the randomization pro-cess and the P-value we calculate becomes a rather justified counterfactual evidence for causality. Put sim-ply, if there is little evidence for no effect, there must be greater evidence for a real effect provided there was only one candidate effector. It is also important to re-member here that there are two instances of counterfac-tual thinking. One is the ‘What if there is no difference?’ of the null hypothesis and the other is ‘What if I do not apply the intervention I am testing for causality?’, usually achieved by the placebo component of the RCT. It is not commonly remembered that when Fisher taught us the importance of the null hypothesis and the P-value in statistical inference, he also heavily emphasized

ran-domization in experimental design [4]. There surely are

recent attempts to use non-randomized data to assess causality by minimizing the effects of baseline confound-ers between the different arms of a study to achieve, if you will, a quasi-randomization. The propensity score analysis, Mendelian randomization in genetic association

studies or the more recently proposed E-value [5] are

such tools, but it would be fair to say that true random-ized studies remain the gold standard in our search for causality.

All this does not mean that randomized data are the only source of causality in medical research. On occa-sion, vitally important examples of biological causes can appear from mainly observational data, like in the histor-ical example of the association of lung cancer and

smoking [6]. Furthermore, a significant P-value

stem-ming from an RCT is surely not foolproof in supporting causality. The issues of too few observations leading to

non-stable P-values [7], too many observations leading

to ascribing statistical significance to trivial findings—in turn leading to a conclusion of an undeserved clinical significance—as well as multiple testing are not infre-quent abuses of both observational and randomized data. However, on balance, a search for true causality most of the time needs the additional presence of ran-domization followed by an intervention.

Historically the era of RCTs started [8] in close

suc-cession to Fisher’s pioneering ideas about null hypoth-esis testing with the P-value. However, I propose this well-deserved success of the P-value in interpreting the outcomes of RCTs were, unfortunately, soon freely extrapolated to observational, non-randomized data. Fisher’s emphasis on randomization and intervention as

EDIT

ORIAL

Rheumatology

Rheumatology 2020;59:1467–1468 doi:10.1093/rheumatology/keaa152 Advance Access publication 9 May 2020

(2)

important elements of causation was rather neglected. This was all combined with our innate zeal for causality, our brain’s perhaps sluggishness for counterfactual thinking and our urge to prove ourselves. These, I pro-pose, led to the P-value crisis at hand. This crisis, in turn, prompted several suggestions including doing

away with the P-value altogether [9] or, more recently,

raising the conventional threshold of significance to

0.005 [10]. The latter expedient proposal, with its more

powerful P-value might, I am afraid, further increase our imbedded unfortunate and pervasive faith in the P-value as a criterion of causality without a consideration of the study design.

In brief, I propose the main mischief behind the cur-rent P-value crisis is our almost innate impulse to ascribe causality to every and any P-value we come across. We all like to think we know association is not causation. However we frequently forget what we know. This forgetfulness and, on occasion, its immoral exploit by champions of causality, I propose, made up the fer-tile ground on which the crisis we have been discussing has thrived. Reconsidering the history of the P-value might be of help.

Funding: No specific funding was received from any funding bodies in the public, commercial or not-for-profit sectors to carry out the work described in this article.

Disclosure statement: The author has declared no con-flicts of interest.

Hasan Yazici1,a

1_{Division of Rheumatology, Department of Internal Medicine,}

Cerrahpasa Medical School, Istanbul University, Istanbul, Turkey

Accepted 8 March 2020

Correspondence to: Hasan Yazici, Academic Hospital, Nuhkuyusu cad. 94, Uskudar, Istanbul 34662, Turkey. E-mail: [email protected]

a_{Present address: Academic Hospital, Medicine}

(Rheumatology) Uskudar, Istanbul, Turkey

References

1 Goodman SN. Aligning statistical and scientific reasoning. Science 2016;352:1180–1.

2 Wasserstein RL, Lazar NA. The ASA statement on p-values: context, process, and purpose. Am Stat 2016; 70:129–33.

3 Pearl JP, Mackenzie D. The ladder of causation. In: JP Pearl, D Mackenzie, eds. The book of why: the new science of cause and effect. New York: Basic Books, 2018: 23–52.

4 Hall NS. R. A. Fisher and his advocacy of randomization. J Hist Biol 2007;40:295–325.

5 VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Ann Intern Med 2017;167:268–74.

6 Proctor N. The history of the discovery of the cigarette– lung cancer link: evidentiary traditions, corporate denial, global toll. Tob Control 2012;21:87–91.

7 Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The fickle P value generates irreproducible results. Nat Methods 2015;12:179–85.

8 Medical Research Council. Streptomycin treatment of pulmonary tuberculosis: a Medical Research Council investigation. Br Med J 1948;2:769–82.

9 Trafimow D, Marks M. Editorial. Basic Appl Soc Psychol 2015;37:1–2.

10 Benjamin DJ, Berger JD, Johannesson M et al. Redefine statistical significance. Nat Hum Behav 2018;2:6–10. Editorial

1468 https://academic.oup.com/rheumatology