533
views
0
recommends
+1 Recommend
1 collections
    2
    shares
      scite_
      0
      0
      0
      0
      Smart Citations
      0
      0
      0
      0
      Citing PublicationsSupportingMentioningContrasting
      View Citations

      See how this article has been cited at scite.ai

      scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

       
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Clinical Significance: Are You Only Relying on the P-value?

      Published
      research-article
      Bookmark

            Main article text

            The debate on the sole use and interpretation of the P-value has continued in the medical literature for many years. A Nature commentary in 2019 suggested abandoning the conventional use of ‘statistical significance’.(1) In contrast, in 2021, the American Statistical Association (ASA) stated that the “use of P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned”.(2)

            In general, clinicians are inclined to report P-values as a standardized method of assessing the statistical significance of their research findings. In clinical medicine, however, it is crucial to consider a more holistic approach in evaluating the evidence, incorporating other relevant factors to make informed decisions about the impact of the research findings on patient care.

            Hypothesis testing is a powerful tool to observe data from one sample of a population and infer the information to that target population. Decision-making is enabled by statistical hypothesis testing.(3) For example, to determine if there is a difference in the mortality rate in breast cancer patients receiving a surgical intervention versus those treated more conservatively, a null hypothesis of no difference or equality between comparator groups is stated, with an alternate hypothesis, such as ‘Mortality rates differ between patients receiving interventional vs conservative breast cancer treatments’. Direction may be assigned to this hypothesized difference in that ‘Breast cancer patients undergoing surgical intervention have longer survival times than those treated with conservative treatment alone’. A significance level of 0.05, which is the conventional criteria most used in clinical and biological sciences, indicates a 5% probability of concluding that a difference exists when no actual difference exists. The decision-making step of rejecting or failing to reject the null hypothesis is based on the P-value from an appropriate statistical test. A P-value is “the probability of obtaining a result as extreme as (or more extreme than) the one observed if the null hypothesis is true.”(4) In other words, the P-value is the probability that the null hypothesis is true when the null hypothesis has been rejected.

            Too often in clinical medicine, we only rely on the P-value in rejecting or failing to reject the null hypothesis. In fact, most studies do not even report on the power and sample size, forcing one to only consider the P-values provided in a scientific article. A statistically significant finding does not necessarily mean the finding is clinically relevant. For example, in a hypothetical study comparing systolic blood pressure (SBP) reduction between two antihypertensive drugs, a reduction in SBP of 2 mmHg was observed, with a statistically significant result in favor of Drug A, P = 0.01. Considering that a clinically significant reduction of SBP is a minimum of 5-mmHg, the 2-mmHg reduction does not achieve clinical significance regardless of statistical significance. Conversely, a non-significant result, e.g. P = 0.06, simply indicates that there is no sufficient evidence against the null hypothesis.(5)

            The risk of misinterpreting P-values arises from the mistake of attributing the size or direction of an effect to the P-value and failing to recognize that it is only an indication of generalizability from the sample to the population.(6,7) Regarding a 0.05 exact dichotomous cut-off as either having “statistical significance” or not is rather problematic as a P-value of 0.051 shows “no statistical significance,” whilst a P-value of 0.049 indicates “statistical significance,” not taking into account effect size or direction of the effect. Even when Sir Ronald Fisher introduced P < 0.05 as the cut-off point of statistical significance in 1925, he advised not to use it as an absolute rule and argued that, in the end, the interpretation of the P-value is up to the researcher.(8,9). The P-value should be treated as a continuous variable, to be considered in conjunction with the effect size and interpreted bearing in mind the clinical question being tested.

            P-values should not be the only measure of significance we rely on. Additional statistical measures such as effect size and confidence intervals (CIs) should be used to quantify the magnitude of the relationship between measured variables to better understand the results. Of course, statistics alone cannot answer research questions; judgement is critical, as is defining what is considered a clinically important effect.(5)

            What else should be considered?

            1. Effect size

            Choosing a significance level, conventionally P < 0.05, and then reaching statistical significance does not imply that the effect is large but rather that there is evidence to reject the null hypothesis.(3) Effect sizes can be classified as absolute or calculated. Absolute effect sizes are raw differences between means of continuous variables (e.g. length of stay in a hospital unit in days). Calculated effect sizes take into account the variability of the study populations and are calculated by Cohen's d. Cohen's d index equals the difference between two population means divided by the standard deviation (SD) of the differences, i.e., mean1-mean2/SD.(10) The effect of an absolute Cohen's d < 0.2 indicates a trivial effect, 0.2–0.5 a small effect, 0.5–0.8 a moderate effect, and >0.8 a large effect.

            Effect size can also be calculated by the Odds Ratio (OR) when the outcome variable is categorical. Finally, when describing the strength of a linear relationship between two continuous variables, the effect size can be calculated by the correlation coefficient method.(10) For absolute values of a correlation (r), albeit arbitrary, r < 0.2 is considered as very weak, r = 0.5 as moderate and >0.8 as very strong.(11)

            2. Confidence Intervals

            The Confidence Intervals (CI) are intervals estimated for the observed effect size and indicate the precision or variability of the data based on the width of the interval. CIs provide information about statistical significance. If a 95% CI for the difference in means between two groups contains the value “0”, this implies there is no statistical significance (P-value will be ≥0.05).(12) In the 95% CI for a ratio between two comparative groups (OR or RR), no difference between the two groups will be indicated by a value of “1” instead of “0” if the 95% CI contains the value “1”, this implies there is no statistical significance and the P-value will be ≥0.05, and vice versa. Including the 95% CI when reporting clinical data is critical as an additional tool to interpret the results. For example, in a study investigating mortality in patients with myocardial infarction (MI) treated with fibrinolysis and transferred for early angioplasty vs patients treated with standard therapy,(13) mortality was reported in 17.2% in the standard treatment group and 11.0% in the early intervention group. The relative risk (RR) with 95% CI was 0.64 (0.47–0.87), with a P-value of 0.004, clearly showing a 36% reduction in the RR with a 95% confidence that the true reduction in mortality is somewhere between 13% and 53%.(13) The investigator could make a decision on whether or not it is worth implementing the early angioplasty therapy, appreciating the reduction in the RR and the CI.(13)

            “…imperfectly understood confidence intervals are more useful and less dangerous than imperfectly understood p values…”(14)

            3. Fragility Index in Randomized Controlled Trials (RCTs)

            Research reproducibility and replicability are growing concerns in clinical scientific research and the misuse and misinterpretation of P-values are often the reason for the fragility of scientific results. Many researchers are either unaware or do not consider the Fragility Index (FI). Following similar concepts emerging in the early 1990s,(15,16) Walsh's group in 2014 proposed the FI as a statistical metric estimating the robustness of statistical significance reported from clinical trials with binary event outcomes.(17) The FI denotes the minimum number of subjects whose outcome status would have to change from a non-event to an event to convert a statistically significant result into a non-significant result of the study.(17) Statistically significant results of randomized controlled trials (RCTs) often hinge on small numbers of events, and the FI aids in identifying less robust results.(17) A FI of 3 means that if only three subjects change status from a non-event to an event of the outcome variable, the significance of the study's results is lost. Alarmingly, a quarter of the RCTs included in a review of 399 RCTs had a FI of 3 or less, with an overall median (range) FI of only 8 (1–109), whereas a review of 56 multicenter RCTs in critical care reported a median FI (IQR) of only 2 (1–3.5).(17,18) Included here mainly for awareness, the FI is not applicable to other study designs.

            Summary and Recommendations

            More often than not, journals and universities demand exact P-values to accompany results.(1) Hence, complete abandonment of P-values, as some authors suggest, is not currently an option. Researchers, clinicians, journal reviewers, and editors should not place all the accountability of the results from research studies solely on the P-value but should look beyond and consider the robustness of the results in the light of study designs, effect sizes with sample size calculations, range of values of 95% CIs and limitations of the study. In addition, it is mandatory that the results contemplate the current scientific and clinical knowledge.

            References

            1. AmrheinV, GreenlandS, McShaneB. Retire statistical significance. Nature. 2019;567:305–307.

            2. BenjaminiY, De VeauxRD, EfronB, et al. ASA President's Task Force Statement on statistical significance and replicability. CHANCE. 2021;34(4):10–11.

            3. MotulskyH. Intuitive biostatistics: a nonmathematical guide to statistical thinking. 4th ed. Oxford University Press: New York; 2018.

            4. DawsonBT, RobertG. Basic and clinical biostatistics. 4th ed. McGraw-Hill: New York; 2004.

            5. SchoberP, BossersSM, SchwarteLA. Statistical significance versus clinical importance of observed effect sizes: what do P values and confidence intervals really represent? Anesth Analg. 2018;126(3):1068–1072.

            6. RedmondAC, KeenanAM. Understanding statistics. Putting p-values into perspective. J Am Podiatr Med Assoc. 2002;92(5):297–305.

            7. SedgwickP. Understanding P values. BMJ. 2014;349: g4550.

            8. SterneJA, SmithGD. Sifting the evidence—what's wrong with significance tests? Phys Ther. 2001;81(8):1464–1469.

            9. FisherRA. Statistical methods for research workers, 11th ed. rev. Oliver and Boyd: Edinburgh; 1925.

            10. SullivanGM, FeinnR. Using effect size—or why the P value is not enough. J Grad Med Educ. 2012;4(3):279–282.

            11. BMJ. Statistics at square one. 9th ed. London: BMJ Publishing Group; 1997. Available from: https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one.

            12. AkobengAK. Confidence intervals and p-values in clinical decision making. Acta Paediatr. 2008;97(8):1004–1007.

            13. CantorWJ, FitchettD, BorgundvaagB, et al. Routine early angioplasty after fibrinolysis for acute myocardial infarction. N Engl J Med. 2009;360(26):2705–2718.

            14. HoenigJM, HeiseyDM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat. 2001;55(1):19–24.

            15. FeinsteinAR. The unit fragility index: an additional appraisal of “statistical significance” for a contrast of two proportions. J Clin Epidemiol. 1990;43(2):201–209.

            16. WalterSD. Statistical significance and fragility criteria for assessing a difference of two proportions. J Clin Epidemiol. 1991;44(12):1373–1378.

            17. WalshM, SrinathanSK, McAuleyDF, et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a fragility index. J Clin Epidemiol. 2014;67(6):622–628.

            18. RidgeonEE, YoungPJ, BellomoR, et al. The fragility index in multicenter randomized controlled critical care trials. Crit Care Med. 2016;44(7):1278–1284.

            Author and article information

            Journal
            WUP
            Wits Journal of Clinical Medicine
            Wits University Press (5th Floor University Corner, Braamfontein, 2050, Johannesburg, South Africa )
            2618-0189
            2618-0197
            08 July 2024
            : 6
            : 2
            : 113-116
            Affiliations
            [1 ]Health Sciences Research Office, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.
            [2 ]School of Public Health, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.
            [3 ]Department of Surgery, School of Clinical Medicine, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa; NIHR Global Surgery Unit – Surgical Statistics Hub, South Africa.
            Author notes
            [* ] Corresponding Author: Elena.Libhaber@ 123456wits.ac.za
            Author information
            http://orcid.org/0000-0002-7043-4002
            http://orcid.org/0000-0001-6690-4326
            https://orcid.org/0000-0002-3604-7682
            Article
            WJCM
            10.18772/26180197.2024.v6n2a10
            341086c5-d6dc-4440-a7bf-3a09cc27656d
            WITS

            Distributed under the terms of the Creative Commons Attribution Noncommercial NoDerivatives License https://creativecommons.org/licenses/by-nc-nd/4.0/, which permits noncommercial use and distribution in any medium, provided the original author(s) and source are credited, and the original work is not modified.

            History
            Categories
            Statistics @ WJCM

            General medicine,Medicine,Internal medicine

            Comments

            Comment on this article