The New England Journal of Statistics in Data Science logo


  • Help
Login Register

  1. Home
  2. Issues
  3. Volume 2, Issue 2 (2024)
  4. A Safe Hosmer-Lemeshow Test

The New England Journal of Statistics in Data Science

Submit your article Information Become a Peer-reviewer
  • Article info
  • Full article
  • Related articles
  • More
    Article info Full article Related articles

A Safe Hosmer-Lemeshow Test
Volume 2, Issue 2 (2024), pp. 175–189
Alexander Henzi 1   Marius Puke 1   Timo Dimitriadis     All authors (4)

Authors

 
Placeholder
https://doi.org/10.51387/23-NEJSDS56
Pub. online: 18 December 2023      Type: Methodology Article      Open accessOpen Access
Area: Statistical Methodology

1 The first two authors contributed equally to this work.

Accepted
2 December 2023
Published
18 December 2023

Abstract

This article proposes an alternative to the Hosmer-Lemeshow (HL) test for evaluating the calibration of probability forecasts for binary events. The approach is based on e-values, a new tool for hypothesis testing. An e-value is a random variable with expected value less or equal to one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a p-value. Our test uses online isotonic regression to estimate the calibration curve as a ‘betting strategy’ against the null hypothesis. We show that the test has power against essentially all alternatives, which makes it theoretically superior to the HL test and at the same time resolves the well-known instability problem of the latter. A simulation study shows that a feasible version of the proposed eHL test can detect slight miscalibrations in practically relevant sample sizes, but trades its universal validity and power guarantees against a reduced empirical power compared to the HL test in a classical simulation setup. We illustrate our test on recalibrated predictions for credit card defaults during the Taiwan credit card crisis, where the classical HL test delivers equivocal results.

References

[1] 
Allison, P. J. Measures of fit for logistic regression. Paper 1485-2014, SAS Global Forum 2014, Washington DC.
[2] 
Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T. and Silverman, E. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics 26(4) 641–647 (1955). https://doi.org/10.1214/aoms/1177728423. MR0073895
[3] 
Bertolini, G., D’Amico, R., Nardi, D., Tinazzi, A. and Apolone, G. One model, several results: the paradox of the Hosmer-Lemeshow goodness-of-fit test for the logistic regression model. Journal of Epidemiology and Biostatistics 5(4) 251–253 (2000).
[4] 
Breiman, L. Bagging predictors. Machine Learning 24(5) 123–140 (1996).
[5] 
Brunk, H. D. Conditional expectation given a σ-lattice and applications. Annals of Mathematical Statistics 36(5) 1339–1350 (1965). https://doi.org/10.1214/aoms/1177699895. MR0185629
[6] 
Canary, J. D., Blizzard, L., Barry, R. P., Hosmer, D. W. and Quinn, S. J. A comparison of the Hosmer–Lemeshow, Pigeon–Heyse, and Tsiatis goodness-of-fit tests for binary logistic regression under two grouping methods. Communications in Statistics – Simulation and Computation 46(3) 1871–1894 (2017). https://doi.org/10.1080/03610918.2015.1017583. MR3625254
[7] 
Casgrain, P., Larsson, M. and Ziegel, J. Anytime-valid sequential testing for elicitable functionals via supermartingales. Bernoulli (2022). To appear.
[8] 
Dimitriadis, T., Dümbgen, L., Henzi, A., Puke, M. and Ziegel, J. Honest calibration assessment for binary outcome predictions. Biometrika 110(3) 663–680 (2023). https://doi.org/10.1093/biomet/asac068. MR4627777
[9] 
Dimitriadis, T., Gneiting, T. and Jordan, A. I. Stable reliability diagrams for probabilistic classifiers. Proceedings of the National Academy of Sciences 118(8), e2016191118 (2021). https://doi.org/10.1073/pnas.2016191118. MR4275118
[10] 
Dua, D. and Graff, C. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. Accessible at http://archive.ics.uci.edu/ml.
[11] 
Duan, B., Ramdas, A. and Wasserman, L. Interactive rank testing by betting. In Conference on Causal Learning and Reasoning 177 201–235 (2022).
[12] 
Flach, P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press, Cambridge, UK (2012). https://doi.org/10.1017/CBO9780511973000. MR3088204
[13] 
Grünwald, P., de Heide, R. and Koolen, W. Safe testing (2020). Preprint. arXiv:1906.07801. https://doi.org/10.1007/978-3-642-39091-3_21. MR3108509
[14] 
Guo, C., Pleiss, G., Sun, Y. and Weinberger, K. Q. On calibration of modern neural networks. (D. Precup and Y. W. Teh, eds.) In Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research 70 1321–1330 (2017).
[15] 
Henzi, A. and Ziegel, J. Valid sequential inference on probability forecast performance. Biometrika 109(3) 647–663 (2022). https://doi.org/10.1093/biomet/asab047. MR4472840
[16] 
Hosmer, D. W. and Hjort, N. L. Goodness-of-fit processes for logistic regression: simulation results. Statistics in Medicine 21(18) 2723–2738 (2002).
[17] 
Hosmer, D. W., Hosmer, T., Le Cessie, S. and Lemeshow, S. A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine 16(9) 965–980 (1997).
[18] 
Hosmer, D. W. and Lemeshow, S. Goodness of fit tests for the multiple logistic regression model. Communications in Statistics – Theory and Methods 9(10) 1043–1069 (1980).
[19] 
Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. Applied Logistic Regression. Wiley, Hoboken, NJ (2013). https://doi.org/10.1002/9781118548387
[20] 
Kotlowski, W., Koolen, W. M. and Malek, A. Random permutation online isotonic regression. In Advances in Neural Information Processing Systems (2017).
[21] 
Kotowski, W., Koolen, W. M. and Malek, A. Online isotonic regression. In Annual Conference on Learning Theory (COLT-16) 49 1165–1189 (2016).
[22] 
Kuss, O. Global goodness-of-fit tests in logistic regression with sparse data. Statistics in Medicine 21(24) 3789–3801 (2002).
[23] 
Lee, L. Y., Cazier, J.-B., Angelis, V., Arnold, R., Bisht, V., Campton, N. A., Chackathayil, J., Cheng, V. W., Curley, H. M., Fittall, M. W., Freeman-Mills, L., Gennatas, S., Goel, A., Hartley, S., Hughes, D. J., Kerr, D., Lee, A. J., Lee, R. J., McGrath, S. E., Middleton, C. P., Murugaesu, N., Newsom-Davis, T., Okines, A. F., Olsson-Brown, A. C., Palles, C., Pan, Y., Pettengell, R., Powles, T., Protheroe, E. A., Purshouse, K., Sharma-Oates, A., Sivakumar, S., Smith, A. J., Starkey, T., Turnbull, C. D., Várnai, C., Yousaf, N., Team, U. C. M. P., Kerr, R. and Middleton, G. Covid-19 mortality in patients with cancer on chemotherapy or other anticancer treatments: a prospective cohort study. The Lancet 395(10241) 1919–1926 (2020).
[24] 
Lo, H.-Y. and Harvey, N. Shopping without pain: Compulsive buying and the effects of credit card availability in Europe and the Far East. Journal of Economic Psychology 32(1) 79–92 (2011).
[25] 
Nattino, G., Pennell, M. L. and Lemeshow, S. Assessing the goodness of fit of logistic regression models in large samples: A modification of the Hosmer-Lemeshow test. Biometrics 76(2) 549–560 (2020). https://doi.org/10.1111/biom.13249. MR4125279
[26] 
Neblett Fanfair, R., Benedict, K., Bos, J., Bennett, S. D., Lo, Y.-C., Adebanjo, T., Etienne, K., Deak, E., Derado, G., Shieh, W.-J., Drew, C., Zaki, S., Sugerman, D., Gade, L., Thompson, E. H., Sutton, D. A., Engelthaler, D. M., Schupp, J. M., Brandt, M. E., Harris, J. R., Lockhart, S. R., Turabelidze, G. and Park, B. J. Necrotizing cutaneous mucormycosis after a tornado in Joplin, Missouri, in 2011. New England Journal of Medicine 367(23) 2214–2225 (2012).
[27] 
Orabona, F. and Jun, K.-S. Tight concentrations and confidence sequences from the regret of universal portfolio (2021). Preprint. arXiv:2110.14099.
[28] 
Ostrosky-Zeichner, L., Harrington, R., Azie, N., Yang, H., Li, N., Zhao, H., Koo, V. and Wu, E. Q. A risk score for fluconazole failure among patients with candidemia. Antimicrobial Agents and Chemotherapy 61(5) e02091–16 (2017).
[29] 
Paul, P., Pennell, M. L. and Lemeshow, S. Standardizing the power of the Hosmer–Lemeshow goodness of fit test in large data sets. Statistics in Medicine 32(1) 67–80 (2013). https://doi.org/10.1002/sim.5525. MR3017884
[30] 
Rissanen, J. and Roos, T. Conditional NML universal models. In 2007 Information Theory and Applications Workshop 337–341 (2007).
[31] 
Shafer, G. Testing by betting: A strategy for statistical and scientific communication. Journal of the Royal Statistical Society: Series A (Statistics in Society) 184. 407–431 (2021). https://doi.org/10.1111/rssa.12647. MR4255905
[32] 
Shafer, G. and Vovk, V. Game-Theoretic Foundations for Probability and Finance. Wiley, Hoboken, NJ (2019). https://doi.org/10.1002/9781118548035
[33] 
Shekhar, S. and Ramdas, A. Nonparametric two-sample testing by betting. IEEE Transactions on Information Theory (2023). To appear. https://doi.org/10.1109/TIT.2023.3305867
[34] 
Strieder, D. and Drton, M. On the choice of the splitting ratio for the split likelihood ratio test. Electronic Journal of Statistics 16(2) 6631–6650 (2022). https://doi.org/10.1214/22-ejs2099. MR4527023
[35] 
Tse, T. and Davison, A. C. A note on universal inference. Stat 11(1), e501 (2022). https://doi.org/10.1002/sta4.501. MR4529724
[36] 
Vannitsem, S., Wilks, D. S. and Messner, J. Statistical Postprocessing of Ensemble Forecasts. Elsevier, Amsterdam (2018).
[37] 
Vovk, V., Petej, I. and Fedorova, V. Large-scale probabilistic predictors with and without guarantees of validity. In Advances in Neural Information Processing Systems (2015).
[38] 
Vovk, V. and Wang, R. E-values: Calibration, combination and applications. The Annals of Statistics 49(3) 1736–1754 (2021). https://doi.org/10.1214/20-aos2020. MR4298879
[39] 
Wang, R. and Ramdas, A. False discovery rate control with e-values. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(3) 822–852 (2022). MR4460577
[40] 
Wasserman, L., Ramdas, A. and Balakrishnan, S. Universal inference. Proceedings of the National Academy of Sciences 117(29) 16880–16890 (2020). https://doi.org/10.1073/pnas.1922664117. MR4242731
[41] 
Waudby-Smith, I. and Ramdas, A. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology (2023). To appear.
[42] 
Xie, X.-J., Pendergast, J. and Clarke, W. Increasing the power: A practical approach to goodness-of-fit test for logistic regression models with continuous predictors. Computational Statistics & Data Analysis 52. 2703–2713 (2008). https://doi.org/10.1016/j.csda.2007.09.027. MR2419536
[43] 
Yeh, I.-C. and Lien, C.-h. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36(2) 2473–2480 (2009).
[44] 
Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02 694–699. Association for Computing Machinery, New York, NY, USA (2002).

Full article Related articles PDF XML
Full article Related articles PDF XML

Copyright
© 2024 New England Statistical Society
by logo by logo
Open access article under the CC BY license.

Keywords
E-value Probability forecast Calibration validation Goodness-of-fit Isotonic regression

Funding
A. Henzi and J. Ziegel gratefully acknowledge financial support from the Swiss National Science Foundation. T. Dimitriadis gratefully acknowledges financial support from the German Research Foundation (DFG) through grant number 502572912.

Metrics
since December 2021
713

Article info
views

162

Full article
views

215

PDF
downloads

48

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

The New England Journal of Statistics in Data Science

  • ISSN: 2693-7166
  • Copyright © 2021 New England Statistical Society

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer
Powered by PubliMill  •  Privacy policy