The New England Journal of Statistics in Data Science logo


  • Help
Login Register

  1. Home
  2. Issues
  3. Volume 3, Issue 1 (2025)
  4. A Study on Reproducibility and the Relia ...

The New England Journal of Statistics in Data Science

Submit your article Information Become a Peer-reviewer
  • Article info
  • Full article
  • Related articles
  • More
    Article info Full article Related articles

A Study on Reproducibility and the Reliability of the Hosmer-Lemeshow Test in Published Research
Volume 3, Issue 1 (2025), pp. 73–81
Audrey Yang   Karen Yang  

Authors

 
Placeholder
https://doi.org/10.51387/25-NEJSDS81
Pub. online: 28 March 2025      Type: Case Study, Application, And/or Practice Article      Open accessOpen Access
Area: NextGen

Accepted
11 February 2025
Published
28 March 2025

Abstract

This paper discusses two elements of reproducibility in published research. First, it examines whether published results are reproducible with author-supplied data: specifically, whether the authors publish their data, whether authors respond to requests for data when data are claimed to be available upon reasonable request, and whether data provided are usable to reproduce the authors’ results. Second, we seek to substantiate the currently mostly theoretical concerns about the Hosmer-Lemeshow goodness-of-fit test’s lack of power by investigating its usage in practice: in published research, by authors aiming to validate their models. By using the authors’ data to build larger alternative models and doing hypothesis testing to show that the smaller models—validated by Hosmer-Lemeshow—do not adequately capture information that is available in the data, we demonstrate that the Hosmer-Lemeshow goodness of fit test is often incapable of detecting inadequacies in models.

Supplementary material

 Supplementary Material
We include our meta-data dataset, described in Section 2.2 of the paper. We also include the R code used to run the regressions and tests.

References

[1] 
Akaike, H. (1973). Information theory as an extension of the maximum likelihood principle. In Petrov BN and Csaki F. Second International Symposium on Information Theory. Akademiai Kiado, Budapest, pp. 276–281. MR0483125
[2] 
Allison, P. (2013). Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression.
[3] 
Campos, L., Rocha, M., Willers, D. and Silva, D. (2016). Characteristics of Patients with Smear-Negative Pulmonary Tuberculosis (TB) in a Region with High TB and HIV Prevalence. PLoS ONE 11(1).
[4] 
Chuard, P. J. C., Vrtílek, M., Head, M. L. and Jennions, M. D. (2019). Evidence that nonsignificant results are sometimes preferred: Reverse P-hacking or selective reporting? PLoS Biol 17(1).
[5] 
Faraway, J. J. (2004) Extending the Linear Model with R. Chapman and Hall/CRC. MR2192856
[6] 
Federer, L. M., Belter, C. W., Joubert, D. J., Livinski, A., Lu, Y.-L., Snyders, L. N., et al. (2018). Data sharing in PLOS ONE: An analysis of Data Availability Statements. PLoS ONE 13(5).
[7] 
Fiar, M., Greiner, B., Huber, C., Katok, E. and Ozkes, A. I. (2023). Reproducibility in Management Science. Management Science 70 1115–1125. https://doi.org/70(3):1343-1356.
[8] 
Gebeyehu, E., Nigatu, D. and Engidawork, E. (2019). Helicobacter pylori eradication rate of standard triple therapy and factors affecting eradication rate at Bahir Dar city administration, Northwest Ethiopia: A prospective follow up study. PLoS ONE 14(6). https://doi.org/10.1371/journal.pone.0217645.
[9] 
Hosmer, D. W. and Lemesbow, S. (1980). Goodness of fit tests for the multiple logistic regression model. Communications in Statistics – Theory and Methods 9(10) 1043–1069. https://doi.org/10.1080/03610928008827941. https://www.tandfonline.com/doi/pdf/10.1080/03610928008827941.
[10] 
Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression. John Wiley & Sons, Inc. https://doi.org/10.1002/9781118596333.ch21. MR3287463
[11] 
Kibi, S., Shaholli, D., Barletta, V. I., Vezza, F., Gelardini, M., Ardizzone, C., Grassucci, D. and La Torre, G. (2023). Knowledge, Attitude, and Behavior toward COVID-19 Vaccination in Young Italians. Vaccines 11(1). https://doi.org/10.3390/vaccines11010183.
[12] 
Lai, X. and Liu, L. (2018). A simple test procedure in standardizing the power of Hosmer–Lemeshow test in large data sets. Journal of Statistical Computation and Simulation 88(13) 2463–2472. https://doi.org/10.1080/00949655.2018.1467912. MR3818450
[13] 
Lu, C. and Yang, Y. (2018). On assessing binary regression models based on ungrouped data. Biometrics 75(1) 5–12. https://doi.org/10.1111/biom.12969. MR3953702
[14] 
Mithra, P., Unnikrishnan, B., T, R., Kumar, N., Holla, R. and Rathi, P. (2021). Paternal Involvement in and Sociodemographic Correlates of Infant and Young Child Feeding in a District in Coastal South India: A Cross-Sectional Study. Frontiers in Public Health 9. https://doi.org/10.3389/fpubh.2021.661058.
[15] 
Peterer, L., Ossendorf, C., Jensen, K. O., et al. (2019). Implementation of new standard operating procedures for geriatric trauma patients with multiple injuries: a single level I trauma centre study. BMC Geriatr 19(359).
[16] 
Tedersoo, L., Küngas, R., Oras, E., et al. (2021). Data sharing practices and data availability upon request differ across scientific disciplines. Sci Data 8(192).
[17] 
VanDerHeyden, N. and Cox, T. B. (2008). Chapter 6 – Trauma Scoring. In J. A. Asensio and D. D. Trunkey, eds. Current Therapy of Trauma and Surgical Critical Care 26–32 Mosby, Philadelphia. https://doi.org/10.1016/B978-0-323-04418-9.50010-2. https://www.sciencedirect.com/science/article/pii/B9780323044189500102.
[18] 
Wang, J.-L., Han, C., Yang, F.-L., Wang, M.-S. and He, Y. (2021). Normal cerebrospinal fluid protein and associated clinical characteristics in children with tuberculous meningitis. Annals of Medicine 53(1) 885–889. PMID: 34124971. https://doi.org/10.1080/07853890.2021.1937692.
[19] 
Wasserstein, R. L. and Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 70(2) 129–133. https://doi.org/10.1080/00031305.2016.1154108.
[20] 
Woolston, C. (2015). Psychology journal bans P values. Nature 519(9). https://doi.org/10.1038/519009f.
[21] 
Zhang, J., Ding, J. and Yang, Y. (2021). Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning. Journal of the American Statistical Association 118(542) 1115–1125. https://doi.org/10.1080/01621459.2021.1979010. MR4595481
[22] 
Zhu, Y., Liu, S., Chen, W., Liu, B., Zhang, F., Lv, H., et al. (2019). Epidemiology of low-energy lower extremity fracture in Chinese populations aged 50 years and above. PLoS ONE 14(1).

Full article Related articles PDF XML
Full article Related articles PDF XML

Copyright
© 2025 New England Statistical Society
by logo by logo
Open access article under the CC BY license.

Keywords
Hosmer-Lemeshow test Reverse p-hacking Goodness-of-fit Logistic regression Reproducibility

Metrics
since December 2021
59

Article info
views

26

Full article
views

27

PDF
downloads

20

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

The New England Journal of Statistics in Data Science

  • ISSN: 2693-7166
  • Copyright © 2021 New England Statistical Society

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer
Powered by PubliMill  •  Privacy policy