A Study on Reproducibility and the Reliability of the Hosmer-Lemeshow Test in Published Research

Yang, Audrey; Yang, Karen

doi:10.51387/25-NEJSDS81

The New England Journal of Statistics in Data Science

A Study on Reproducibility and the Reliability of the Hosmer-Lemeshow Test in Published Research

Volume 3, Issue 1 (2025), pp. 73–81

Audrey Yang Karen Yang

https://doi.org/10.51387/25-NEJSDS81

Pub. online: 28 March 2025 Type: Case Study, Application, And/or Practice Article

Open Access

Area: NextGen

Accepted
11 February 2025

Published
28 March 2025

Abstract

This paper discusses two elements of reproducibility in published research. First, it examines whether published results are reproducible with author-supplied data: specifically, whether the authors publish their data, whether authors respond to requests for data when data are claimed to be available upon reasonable request, and whether data provided are usable to reproduce the authors’ results. Second, we seek to substantiate the currently mostly theoretical concerns about the Hosmer-Lemeshow goodness-of-fit test’s lack of power by investigating its usage in practice: in published research, by authors aiming to validate their models. By using the authors’ data to build larger alternative models and doing hypothesis testing to show that the smaller models—validated by Hosmer-Lemeshow—do not adequately capture information that is available in the data, we demonstrate that the Hosmer-Lemeshow goodness of fit test is often incapable of detecting inadequacies in models.

Supplementary material

Supplementary Material

We include our meta-data dataset, described in Section 2.2 of the paper. We also include the R code used to run the regressions and tests.

References

[1]

Akaike, H. (1973). Information theory as an extension of the maximum likelihood principle. In Petrov BN and Csaki F. Second International Symposium on Information Theory. Akademiai Kiado, Budapest, pp. 276–281. MR0483125

[2]

Allison, P. (2013). Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression.

[3]

Campos, L., Rocha, M., Willers, D. and Silva, D. (2016). Characteristics of Patients with Smear-Negative Pulmonary Tuberculosis (TB) in a Region with High TB and HIV Prevalence. PLoS ONE 11(1).

[4]

Chuard, P. J. C., Vrtílek, M., Head, M. L. and Jennions, M. D. (2019). Evidence that nonsignificant results are sometimes preferred: Reverse P-hacking or selective reporting? PLoS Biol 17(1).

[5]

Faraway, J. J. (2004) Extending the Linear Model with R. Chapman and Hall/CRC. MR2192856

[6]

Federer, L. M., Belter, C. W., Joubert, D. J., Livinski, A., Lu, Y.-L., Snyders, L. N., et al. (2018). Data sharing in PLOS ONE: An analysis of Data Availability Statements. PLoS ONE 13(5).

[7]

Fiar, M., Greiner, B., Huber, C., Katok, E. and Ozkes, A. I. (2023). Reproducibility in Management Science. Management Science 70 1115–1125. https://doi.org/70(3):1343-1356.

[8]

Gebeyehu, E., Nigatu, D. and Engidawork, E. (2019). Helicobacter pylori eradication rate of standard triple therapy and factors affecting eradication rate at Bahir Dar city administration, Northwest Ethiopia: A prospective follow up study. PLoS ONE 14(6). https://doi.org/10.1371/journal.pone.0217645.

[9]

Hosmer, D. W. and Lemesbow, S. (1980). Goodness of fit tests for the multiple logistic regression model. Communications in Statistics – Theory and Methods 9(10) 1043–1069. https://doi.org/10.1080/03610928008827941. https://www.tandfonline.com/doi/pdf/10.1080/03610928008827941.

[10]

Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression. John Wiley & Sons, Inc. https://doi.org/10.1002/9781118596333.ch21. MR3287463

[11]

Kibi, S., Shaholli, D., Barletta, V. I., Vezza, F., Gelardini, M., Ardizzone, C., Grassucci, D. and La Torre, G. (2023). Knowledge, Attitude, and Behavior toward COVID-19 Vaccination in Young Italians. Vaccines 11(1). https://doi.org/10.3390/vaccines11010183.

[12]

Lai, X. and Liu, L. (2018). A simple test procedure in standardizing the power of Hosmer–Lemeshow test in large data sets. Journal of Statistical Computation and Simulation 88(13) 2463–2472. https://doi.org/10.1080/00949655.2018.1467912. MR3818450

[13]

Lu, C. and Yang, Y. (2018). On assessing binary regression models based on ungrouped data. Biometrics 75(1) 5–12. https://doi.org/10.1111/biom.12969. MR3953702

[14]

Mithra, P., Unnikrishnan, B., T, R., Kumar, N., Holla, R. and Rathi, P. (2021). Paternal Involvement in and Sociodemographic Correlates of Infant and Young Child Feeding in a District in Coastal South India: A Cross-Sectional Study. Frontiers in Public Health 9. https://doi.org/10.3389/fpubh.2021.661058.

[15]

Peterer, L., Ossendorf, C., Jensen, K. O., et al. (2019). Implementation of new standard operating procedures for geriatric trauma patients with multiple injuries: a single level I trauma centre study. BMC Geriatr 19(359).

[16]

Tedersoo, L., Küngas, R., Oras, E., et al. (2021). Data sharing practices and data availability upon request differ across scientific disciplines. Sci Data 8(192).

[17]

VanDerHeyden, N. and Cox, T. B. (2008). Chapter 6 – Trauma Scoring. In J. A. Asensio and D. D. Trunkey, eds. Current Therapy of Trauma and Surgical Critical Care 26–32 Mosby, Philadelphia. https://doi.org/10.1016/B978-0-323-04418-9.50010-2. https://www.sciencedirect.com/science/article/pii/B9780323044189500102.

[18]

Wang, J.-L., Han, C., Yang, F.-L., Wang, M.-S. and He, Y. (2021). Normal cerebrospinal fluid protein and associated clinical characteristics in children with tuberculous meningitis. Annals of Medicine 53(1) 885–889. PMID: 34124971. https://doi.org/10.1080/07853890.2021.1937692.

[19]

Wasserstein, R. L. and Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 70(2) 129–133. https://doi.org/10.1080/00031305.2016.1154108.

[20]

Woolston, C. (2015). Psychology journal bans P values. Nature 519(9). https://doi.org/10.1038/519009f.

[21]

Zhang, J., Ding, J. and Yang, Y. (2021). Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning. Journal of the American Statistical Association 118(542) 1115–1125. https://doi.org/10.1080/01621459.2021.1979010. MR4595481

[22]

Zhu, Y., Liu, S., Chen, W., Liu, B., Zhang, F., Lv, H., et al. (2019). Epidemiology of low-energy lower extremity fracture in Chinese populations aged 50 years and above. PLoS ONE 14(1).

Full article Related articles

Open access article under the CC BY license.

Keywords

Hosmer-Lemeshow test Reverse p-hacking Goodness-of-fit Logistic regression Reproducibility

Metrics

since December 2021

146

Article info
views

Full article
views

PDF
downloads

XML
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file