A Study on Reproducibility and the Reliability of the Hosmer-Lemeshow Test in Published Research
Volume 3, Issue 1 (2025), pp. 73–81
Pub. online: 28 March 2025
Type: Statistical Methodology
Open Access
Accepted
11 February 2025
11 February 2025
Published
28 March 2025
28 March 2025
Abstract
This paper discusses two elements of reproducibility in published research. First, it examines whether published results are reproducible with author-supplied data: specifically, whether the authors publish their data, whether authors respond to requests for data when data are claimed to be available upon reasonable request, and whether data provided are usable to reproduce the authors’ results. Second, we seek to substantiate the currently mostly theoretical concerns about the Hosmer-Lemeshow goodness-of-fit test’s lack of power by investigating its usage in practice: in published research, by authors aiming to validate their models. By using the authors’ data to build larger alternative models and doing hypothesis testing to show that the smaller models—validated by Hosmer-Lemeshow—do not adequately capture information that is available in the data, we demonstrate that the Hosmer-Lemeshow goodness of fit test is often incapable of detecting inadequacies in models.
Supplementary material
Supplementary MaterialWe include our meta-data dataset, described in Section 2.2 of the paper. We also include the R code used to run the regressions and tests.
References
Akaike, H. (1973). Information theory as an extension of the maximum likelihood principle. In Petrov BN and Csaki F. Second International Symposium on Information Theory. Akademiai Kiado, Budapest, pp. 276–281. MR0483125
Faraway, J. J. (2004) Extending the Linear Model with R. Chapman and Hall/CRC. MR2192856
Fiar, M., Greiner, B., Huber, C., Katok, E. and Ozkes, A. I. (2023). Reproducibility in Management Science. Management Science 70 1115–1125. https://doi.org/70(3):1343-1356.
Gebeyehu, E., Nigatu, D. and Engidawork, E. (2019). Helicobacter pylori eradication rate of standard triple therapy and factors affecting eradication rate at Bahir Dar city administration, Northwest Ethiopia: A prospective follow up study. PLoS ONE 14(6). https://doi.org/10.1371/journal.pone.0217645.
Hosmer, D. W. and Lemesbow, S. (1980). Goodness of fit tests for the multiple logistic regression model. Communications in Statistics – Theory and Methods 9(10) 1043–1069. https://doi.org/10.1080/03610928008827941. https://www.tandfonline.com/doi/pdf/10.1080/03610928008827941.
Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression. John Wiley & Sons, Inc. https://doi.org/10.1002/9781118596333.ch21. MR3287463
Kibi, S., Shaholli, D., Barletta, V. I., Vezza, F., Gelardini, M., Ardizzone, C., Grassucci, D. and La Torre, G. (2023). Knowledge, Attitude, and Behavior toward COVID-19 Vaccination in Young Italians. Vaccines 11(1). https://doi.org/10.3390/vaccines11010183.
Lai, X. and Liu, L. (2018). A simple test procedure in standardizing the power of Hosmer–Lemeshow test in large data sets. Journal of Statistical Computation and Simulation 88(13) 2463–2472. https://doi.org/10.1080/00949655.2018.1467912. MR3818450
Lu, C. and Yang, Y. (2018). On assessing binary regression models based on ungrouped data. Biometrics 75(1) 5–12. https://doi.org/10.1111/biom.12969. MR3953702
Mithra, P., Unnikrishnan, B., T, R., Kumar, N., Holla, R. and Rathi, P. (2021). Paternal Involvement in and Sociodemographic Correlates of Infant and Young Child Feeding in a District in Coastal South India: A Cross-Sectional Study. Frontiers in Public Health 9. https://doi.org/10.3389/fpubh.2021.661058.
VanDerHeyden, N. and Cox, T. B. (2008). Chapter 6 – Trauma Scoring. In J. A. Asensio and D. D. Trunkey, eds. Current Therapy of Trauma and Surgical Critical Care 26–32 Mosby, Philadelphia. https://doi.org/10.1016/B978-0-323-04418-9.50010-2. https://www.sciencedirect.com/science/article/pii/B9780323044189500102.
Wang, J.-L., Han, C., Yang, F.-L., Wang, M.-S. and He, Y. (2021). Normal cerebrospinal fluid protein and associated clinical characteristics in children with tuberculous meningitis. Annals of Medicine 53(1) 885–889. PMID: 34124971. https://doi.org/10.1080/07853890.2021.1937692.
Wasserstein, R. L. and Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 70(2) 129–133. https://doi.org/10.1080/00031305.2016.1154108.
Woolston, C. (2015). Psychology journal bans P values. Nature 519(9). https://doi.org/10.1038/519009f.
Zhang, J., Ding, J. and Yang, Y. (2021). Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning. Journal of the American Statistical Association 118(542) 1115–1125. https://doi.org/10.1080/01621459.2021.1979010. MR4595481