A Safe Hosmer-Lemeshow Test
Volume 2, Issue 2 (2024), pp. 175–189
Pub. online: 18 December 2023
Type: Methodology Article
Open Access
Area: Statistical Methodology
1
The first two authors contributed equally to this work.
Accepted
2 December 2023
2 December 2023
Published
18 December 2023
18 December 2023
Abstract
This article proposes an alternative to the Hosmer-Lemeshow (HL) test for evaluating the calibration of probability forecasts for binary events. The approach is based on e-values, a new tool for hypothesis testing. An e-value is a random variable with expected value less or equal to one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a p-value. Our test uses online isotonic regression to estimate the calibration curve as a ‘betting strategy’ against the null hypothesis. We show that the test has power against essentially all alternatives, which makes it theoretically superior to the HL test and at the same time resolves the well-known instability problem of the latter. A simulation study shows that a feasible version of the proposed eHL test can detect slight miscalibrations in practically relevant sample sizes, but trades its universal validity and power guarantees against a reduced empirical power compared to the HL test in a classical simulation setup. We illustrate our test on recalibrated predictions for credit card defaults during the Taiwan credit card crisis, where the classical HL test delivers equivocal results.
References
Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T. and Silverman, E. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics 26(4) 641–647 (1955). https://doi.org/10.1214/aoms/1177728423. MR0073895
Brunk, H. D. Conditional expectation given a σ-lattice and applications. Annals of Mathematical Statistics 36(5) 1339–1350 (1965). https://doi.org/10.1214/aoms/1177699895. MR0185629
Canary, J. D., Blizzard, L., Barry, R. P., Hosmer, D. W. and Quinn, S. J. A comparison of the Hosmer–Lemeshow, Pigeon–Heyse, and Tsiatis goodness-of-fit tests for binary logistic regression under two grouping methods. Communications in Statistics – Simulation and Computation 46(3) 1871–1894 (2017). https://doi.org/10.1080/03610918.2015.1017583. MR3625254
Dimitriadis, T., Dümbgen, L., Henzi, A., Puke, M. and Ziegel, J. Honest calibration assessment for binary outcome predictions. Biometrika 110(3) 663–680 (2023). https://doi.org/10.1093/biomet/asac068. MR4627777
Dimitriadis, T., Gneiting, T. and Jordan, A. I. Stable reliability diagrams for probabilistic classifiers. Proceedings of the National Academy of Sciences 118(8), e2016191118 (2021). https://doi.org/10.1073/pnas.2016191118. MR4275118
Dua, D. and Graff, C. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. Accessible at http://archive.ics.uci.edu/ml.
Flach, P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press, Cambridge, UK (2012). https://doi.org/10.1017/CBO9780511973000. MR3088204
Grünwald, P., de Heide, R. and Koolen, W. Safe testing (2020). Preprint. arXiv:1906.07801. https://doi.org/10.1007/978-3-642-39091-3_21. MR3108509
Henzi, A. and Ziegel, J. Valid sequential inference on probability forecast performance. Biometrika 109(3) 647–663 (2022). https://doi.org/10.1093/biomet/asab047. MR4472840
Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. Applied Logistic Regression. Wiley, Hoboken, NJ (2013). https://doi.org/10.1002/9781118548387
Lee, L. Y., Cazier, J.-B., Angelis, V., Arnold, R., Bisht, V., Campton, N. A., Chackathayil, J., Cheng, V. W., Curley, H. M., Fittall, M. W., Freeman-Mills, L., Gennatas, S., Goel, A., Hartley, S., Hughes, D. J., Kerr, D., Lee, A. J., Lee, R. J., McGrath, S. E., Middleton, C. P., Murugaesu, N., Newsom-Davis, T., Okines, A. F., Olsson-Brown, A. C., Palles, C., Pan, Y., Pettengell, R., Powles, T., Protheroe, E. A., Purshouse, K., Sharma-Oates, A., Sivakumar, S., Smith, A. J., Starkey, T., Turnbull, C. D., Várnai, C., Yousaf, N., Team, U. C. M. P., Kerr, R. and Middleton, G. Covid-19 mortality in patients with cancer on chemotherapy or other anticancer treatments: a prospective cohort study. The Lancet 395(10241) 1919–1926 (2020).
Nattino, G., Pennell, M. L. and Lemeshow, S. Assessing the goodness of fit of logistic regression models in large samples: A modification of the Hosmer-Lemeshow test. Biometrics 76(2) 549–560 (2020). https://doi.org/10.1111/biom.13249. MR4125279
Neblett Fanfair, R., Benedict, K., Bos, J., Bennett, S. D., Lo, Y.-C., Adebanjo, T., Etienne, K., Deak, E., Derado, G., Shieh, W.-J., Drew, C., Zaki, S., Sugerman, D., Gade, L., Thompson, E. H., Sutton, D. A., Engelthaler, D. M., Schupp, J. M., Brandt, M. E., Harris, J. R., Lockhart, S. R., Turabelidze, G. and Park, B. J. Necrotizing cutaneous mucormycosis after a tornado in Joplin, Missouri, in 2011. New England Journal of Medicine 367(23) 2214–2225 (2012).
Orabona, F. and Jun, K.-S. Tight concentrations and confidence sequences from the regret of universal portfolio (2021). Preprint. arXiv:2110.14099.
Paul, P., Pennell, M. L. and Lemeshow, S. Standardizing the power of the Hosmer–Lemeshow goodness of fit test in large data sets. Statistics in Medicine 32(1) 67–80 (2013). https://doi.org/10.1002/sim.5525. MR3017884
Shafer, G. Testing by betting: A strategy for statistical and scientific communication. Journal of the Royal Statistical Society: Series A (Statistics in Society) 184. 407–431 (2021). https://doi.org/10.1111/rssa.12647. MR4255905
Shafer, G. and Vovk, V. Game-Theoretic Foundations for Probability and Finance. Wiley, Hoboken, NJ (2019). https://doi.org/10.1002/9781118548035
Shekhar, S. and Ramdas, A. Nonparametric two-sample testing by betting. IEEE Transactions on Information Theory (2023). To appear. https://doi.org/10.1109/TIT.2023.3305867
Strieder, D. and Drton, M. On the choice of the splitting ratio for the split likelihood ratio test. Electronic Journal of Statistics 16(2) 6631–6650 (2022). https://doi.org/10.1214/22-ejs2099. MR4527023
Tse, T. and Davison, A. C. A note on universal inference. Stat 11(1), e501 (2022). https://doi.org/10.1002/sta4.501. MR4529724
Vovk, V. and Wang, R. E-values: Calibration, combination and applications. The Annals of Statistics 49(3) 1736–1754 (2021). https://doi.org/10.1214/20-aos2020. MR4298879
Wang, R. and Ramdas, A. False discovery rate control with e-values. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(3) 822–852 (2022). MR4460577
Wasserman, L., Ramdas, A. and Balakrishnan, S. Universal inference. Proceedings of the National Academy of Sciences 117(29) 16880–16890 (2020). https://doi.org/10.1073/pnas.1922664117. MR4242731
Xie, X.-J., Pendergast, J. and Clarke, W. Increasing the power: A practical approach to goodness-of-fit test for logistic regression models with continuous predictors. Computational Statistics & Data Analysis 52. 2703–2713 (2008). https://doi.org/10.1016/j.csda.2007.09.027. MR2419536