A safe Hosmer-Lemeshow test

This article proposes an alternative to the Hosmer-Lemeshow (HL) test for evaluating the calibration of probability forecasts for binary events. The approach is based on e-values, a new tool for hypothesis testing. An e-value is a random variable with expected value less or equal to one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a p-value. Our test uses online isotonic regression to estimate the calibration curve as a `betting strategy' against the null hypothesis. We show that the test has power against essentially all alternatives, which makes it theoretically superior to the HL test and at the same time resolves the well-known instability problem of the latter. A simulation study shows that a feasible version of the proposed eHL test can detect slight miscalibrations in practically relevant sample sizes, but trades its universal validity and power guarantees against a reduced empirical power compared to the HL test in a classical simulation setup.We illustrate our test on recalibrated predictions for credit card defaults during the Taiwan credit card crisis, where the classical HL test delivers equivocal results.


Introduction
Suppose that we have a sample of observations (p i , y i ) n i=1 from (P i , Y i ) n i=1 such that (P i , Y i ) has the same distribution as (P, Y ) ∈ [0, 1] × {0, 1}, i = 1, . . ., n.The interpretation is that P i is a prediction for the probability that Y i = 1.The random variables are defined on some underlying probability space (Ω, F) and P denotes all probability measures on (Ω, F).Hosmer and Lemeshow (1980) propose a test for the null hypothesis of perfect calibration H HL,n = {P ∈ P | E P (Y i |P i ) = P i P-almost surely, i = 1, . . ., n}.
The Hosmer-Lemeshow (henceforth HL) test is based on partitioning the interval [0, 1] in g ∈ N bins and counting the observed numbers of events, o 1g , and no event occurrences, o 0g , in each bin.Based on that binning and counting procedure, the HL test statistic to test for perfect calibration of the probability predictions is The first two authors contributed equally to this work.
where e 1k and e 0k are the expected event and no event occurrences in bin k, respectively (Hosmer et al., 2013).Under the null hypothesis, C asymptotically follows a χ 2 -distribution with g degrees of freedom given that the sample (P i , Y i ) n i=1 was not used for model estimation (and g −2 degrees of freedom otherwise).
Technically, the choice of the binning procedure is up the user of the HL test and is conventionally implemented via quantile based binning strategies with g = 10, resulting in equally populated bins (decile-of-risk).Less commonly, the test is based on equidistantly spaced bins, where the unit interval (or the range of prediction values) is divided into g equidistant bins.
While little attention is devoted to the binning procedure in practical applications, it implicitly determines the set of alternatives the test has power against (Dimitriadis et al., 2022, Section 5), such that the test result is often highly sensitive to the exact implementation of the binning; see e.g., Hosmer et al. (1997); Bertolini et al. (2000); Kuss (2002) and our empirical application in Section 4. Nevertheless, the HL test is still the literature's favorite for checking the calibration of binary prediction models and commonly used in current and highly influential medical and epidemiological studies; see amongst many others Neblett Fanfair et al. (2012); Ostrosky-Zeichner et al. (2017); Lee et al. (2020).
In this article, we suggest a safe and stable HL test based on e-values (that we describe below) and isotonic regression (Ayer et al., 1955;Brunk, 1965).The test is henceforth called eHL test.Dimitriadis et al. (2021) recently propose the use of isotonic regression to resolve the closely related instability issue stemming from binning approaches in so-called reliability diagrams in forecast evaluation.While feasible inference on the isotonic regression for classical testing procedures is hampered by complicated asymptotic distributions and an inconsistency of the bootstrap, the e-values adopted here prove to be an appealing alternative in this setting.Based on online isotonic regression studied by Kotlowski et al. (2017), we show that (an ideal version of) our eHL test has power against essentially all deviations from calibration, which makes it theoretically superior to the classical HL test.
E-values, where 'e' abbreviates the word 'expectation', were proposed recently as an alternative to p-values in testing problems.In a nutshell, an e-value is a realization of a non-negative random variable whose expected value is at most one under a given null hypothesis.This already signals that an e-value itself allows for meaningful interpretations since an e-value greater than one provides evidence against the null hypothesis.Additionally, the multiplicative inverse of an e-value is a conservative p-value by Markov's inequality.From a game-theoretic perspective, the e-value has a simple financial meaning in the sense that the e-value can be seen as the factor by which a skeptic multiplies her money when betting against the null hypothesis; see Shafer and Vovk (2019); Shafer (2021).
An important advantage of e-values over p-values is their uncomplicated behavior in combinations: the arithmetic average of e-values also is an e-value, likewise the product of independent or successive e-values; see Shafer (2021); Grünwald et al. (2020); Wang and Ramdas (2020).In practice, this appeals because more evidence can be added later, i.e. evidence across studies can easily be combined.
The proposed eHL test offers a safe alternative to a fragile state-of-the-art approach by avoiding ad-hoc choices and software instabilities.It can be regarded as an application of the Universal Inference approach of Wasserman et al. (2020).While this method allows to construct valid tests under only weak assumptions, it has been observed that this validity often comes at the price of a diminished power (Strieder and Drton, 2022;Tse and Davison, 2022).In Section 2, we show that an ideal -but computationally infeasible -variant of the eHL test does have guaranteed power to detect essentially all violations of calibration.Our proof relies on connections between the proposed e-value and the regret in random permutation online isotonic regression, which is studied by Kotlowski et al. (2017).It has been observed that power guarantees for anytime-valid tests can be obtained by means of regret bounds of online prediction methods, see for example the discussion in Casgrain et al. (2022).Our result demonstrates that such a connection also exists in the batch case of e-values for a fixed sample size n due to connections with the online random permutation setting.
In Section 3, we compare a feasible version of the eHL test to the classical HL test in a simulation study.As expected, we find that the eHL test has conservative rejection rates under the null hypothesis and quickly develops power under model mispecification.While its empirical test power is lower than the one of the classical HL test, we do not consider this to be problematic as HL tests are often carried out in cases of vast data sets and are even criticized as being "too powerful" in that they reject essentially all, even acceptably well calibrated models (Paul et al., 2013;Nattino et al., 2020).See Dimitriadis et al. (2022) for an alternative solution to this problem based on confidence bands.
We apply the eHL test in Section 4 to predictions of a logistic regression model for the binary event of credit card defaults in Taiwan in 2005, where over-issuing of credit cards lead to many default payments and a subsequent credit card crisis (Yeh and Lien, 2009).The eHL test provides clear evidence against calibration of the logistic model predictions, and further illustrates that recalibration methods work well.In contrast, the classical HL test based on different natural binning choices delivers equivocal results with p-values ranging from 0 to 0.91 for a single prediction method, implying that a researcher could have cherry-picked the binning specification and hence the test result to her will.

Preliminaries
An e-variable for H HL,n is a non-negative random variable E (that is allowed to take the value +∞) such that E P (E) ≤ 1 for all P ∈ P.An e-value is a realization of an e-variable.An e-variable E always yields a valid p-variable 1/E (a p-value is a realized p-variable) by Markov's inequality, since We reject the null hypothesis H HL,n if we observe a large value of E. If we want to ensure a classical p-guarantee then we have to determine the rejection region for a given α by (2).Vovk and Wang (2021) show that this is essentially the only way to transform an e-variable into a p-variable.We say that an e-variable has the alternative hypothesis

Sample size one
We first construct e-variables for the sample size one Hosmer-Lemeshow null hypothesis In the special case here, e-variables are likelihood ratios conditional on P .Indeed, if q ∈ [0, 1], an e-variable for H HL,1 is given by The variable E q (P, Y ) is clearly non-negative, and for P ∈ H HL,1 , To find alternative hypotheses for the e-variable E q , let π = E Q (Y | P ).Then, is strictly larger one if and only if, π > P and q > P , or, π < P and q < P , i.e., if π and q are to the same side of P .This shows that if q < P , E q has the alternative and if q > P , E q has the alternative It is possible to show that basically any e-variable for H HL,1 is of the form E = E q (P, Y ) for some q (depending on P ) but this requires some more arguments; it follows by the construction in Henzi and Ziegel (2022), see also Waudby-Smith and Ramdas (2021).The connection of E q (P, Y ) to the e-variables in Henzi and Ziegel (2022) of type E = 1 + λD with D ≥ −1 such that E P (D) = 0 for P ∈ H HL,1 , follows from the fact that λ in this representation can be bijectively mapped to q.In this context, is an e-variable for H HL,1 for any λ that is σ(P )-measurable with −(1/P ) ≤ λ ≤ 1/(1 − P ).If P = 1, there is no restriction on λ from above, and analogously if P = 0, there is no restriction from below.By choosing λ = (P − q)/(P (1 − P )), we obtain that E = E q (P, Y ).
Clearly, the e-variable E q (P, Y ) may take the value infinity if either P = 0 and Y = 1 or P = 1 and Y = 0 occurs; a single observation Y = 1 or Y = 0 is sufficient to reject the hypothesis of calibration with certainty if the predicted probabilities are in {0, 1}.For the remainder of the theoretical part of this paper, we will always make the assumption P(P ∈ {0, 1}) = 0 to exclude these special but uninteresting cases.

Combining e-values in the iid case
We assume now that (P i , Y i ) n i=1 are independent and identically distributed (iid).For testing H HL,n , we suggest the e-variable where q i is σ(P 1 , . . ., P i , Y 1 , . . ., Y i−1 )-measurable.For P ∈ H HL,n , we have where we used the equivalent representation of E q (P, Y ) in (5).In particular, from the above derivation it is easy to see that (E id HL,n ) n∈N is a test martingale.The e-variable E id HL,n depends on the ordering of (P i , Y i ) n i=1 through the choice of q i .Let S n denote all permutations of {1, . . ., n}, and for σ ∈ S n define E σ HL,n as E id HL,n for the random variables (P σ(i) , Y σ(i) ) n i=1 instead of (P i , Y i ) n i=1 .Generally, is not an e-variable for H HL,n , so one would guess that there are opportunities to fish for (spurious) significance by choosing some specific ordering of a sample of observations (p i , y i ) n i=1 .If there is a natural ordering of the observations such as a time stamp then the problem usually does not occur in applications since a different ordering of the observations is hard to justify.Indeed, when the observations are sequential (and possibly dependent), the e-variable defined at (6) is also an e-variable for the hypothesis Contrary to classical theory, the sequential case is easier to treat than the iid case and has been the focus of many works employing e-values including for example Waudby-Smith and Ramdas (2021); Henzi and Ziegel (2022).
Coming back to our situation with iid data, an alternative to (6) could be This strategy is essentially the merging technique for independent e-values in Section 4 of Vovk and Wang (2021), and the object of interest in this article.

An ideal test with power guarantees
The statistic E HL,n,sym is an e-variable solely under the requirement that for i = 1, . . ., n and all permutations σ, the probabilities q σ(i) in E σ HL,n are a measurable function of (P σ(j) , Y σ(j) ), j = 1, . . ., i − 1, and of P σ(i) .In the following, we write q σ,σ(i) = f i (P σ(1) , . . ., P σ(i) , Y σ(1) , . . ., Y σ(i−1) ), using the same algorithm f i for constructing q σ,σ(i) based on P σ(1) , . . ., P σ(i) , Y σ(1) , . . ., Y σ(i−1) for all permutations σ.The challenge is then how to choose the functions f 1 , . . ., f n such that the test has power.As argued by Grünwald et al. (2020); Shafer (2021), a suitable measure of power for e-values is the growth rate E Q [log(E)] under an alternative distribution Q, so that ideally, E grows exponentially fast in the sample size if the null hypothesis is violated.
Our algorithm for choosing f 1 , . . ., f n is inspired by permutation online isotonic regression, studied extensively by Kotlowski et al. (2017).In machine learning applications, isotonic regresison is an established method for the recalibration of binary classifiers; see e.g.Zadrozny and Elkan (2002) or Flach (2012).Recently, Dimitriadis et al. (2021) related the isotonic regression approach to reliability diagrams, which are a key diagnostic tool in evaluating probability forecast for binary events, especially in meteorology.Our results demonstrate that isotonic regression is also suitable for constructing universal tests of calibration.
To introduce the algorithm for constructing our e-variable, let p 1 , . . ., p i ∈ [0, 1] be probability predictions and y 1 , . . ., y i ∈ {0, 1} be the corresponding outcomes.Then the isotonic regression of y 1 , . . ., y i on p 1 , . . ., p i can be described as the maximizer of over all g 1 , . . ., g i such that g k ≤ g l if p k ≤ p l .Notice that the quantity in ( 7) is simply a normalized version of the logarithmic score, and the maximizer does not depend on the fact that we normalize by p y j j (1 − p j ) 1−y j .Moreover notice that up to rescaling by 1/i, this criterion also equals the sample version of E Q [log(E)] when the e-variable E is the likelihood ratio between the probabilities g j and p j .A unique maximizer exists -unique since we exclude the cases p j = 0 and y j = 1 or p j = 1 and y j = 0 for some j -and can be computed efficiently with the PAV-Algorithm (Ayer et al., 1955).This estimator only defines a recalibrated version of p 1 , . . ., p i , and a method is required to define the regression at a p i+1 ∈ [0, 1] not contained in the sample.To obtain out-of-sample predictions with small regret in terms of log-loss, we rely on a strategy originally proposed by Vovk et al. (2015) and applied by Kotlowski et al. (2017) to derive regret bounds for isotonic regression in an online setting.The out-of-sample value at p j+1 is defined as follows, where g i+1,1 and g i+1,0 are the (i+1)-th component the isotonic regression of p i , . . ., p i , p i+1 with observations y 1 , . . ., y i , 1 or y 1 , . . ., y i , 0, respectively.That is, to define the isotonic regression at the unseen p i+1 , we fit two isotonic regression in which we include p i+1 in the sample with artificial observations of 1 and of 0 respectively, and take the ratio (8) as recalibrated probability.
The definition ( 8) is extended to the case i = 0 by setting g 1,1 = g 1,0 = 0.5.The workflow to construct E HL,n,sym is then described in Algorithm 1.
Then, (i) there exists a Q-almost-surely unique maximizer π * ∈ F ↑,[0,1] of R Q ; (ii) for a version of π * from part (i), let then D(Q) ≥ 0, with equality if and only if Q ∈ H HL,1 ; (iii) the e-value E HL,n,sym constructed with Algorithm 1 satisfies for an universal constant C > 0, and hence The integrability assumption ( 9) is solely required to prove parts (i) and (ii) of the theorem, and the lower bound on E HL,n,sym and the expectation of its logarithm in fact hold for any π ∈ F ↑,[0,1] .However, part (iii) only becomes useful in conjunction with (i) and (ii): the fact that D(Q) ≥ 0 with equality if and only if Q ∈ H HL,1 implies that the test has positive growth rate for all alternative distributions Q if n is large enough.This is a surprising result, since it might seem that restricting our estimator of E Q [Y |P ] to isotonic functions in P implies some restriction on the class of alternatives against which the test has power -which is not the case.
Part (iii) of Theorem 2.1 only gives a diverging lower bound on the expected value of log(E HL,n,sym ).However, under assumption (9) the average growth rate satisfies the strong law of large numbers, and since D(Q) > 0 for Q ∈ H HL,1 , this implies that E HL,n,sym → ∞ almost surely as n → ∞.In particular, for any Type-I error α and desired power 1 − β, there exists a sample size N such that , then by Helly's selection theorem, there exists a subsequence (π n k ) k∈N converging pointwise to some

and equality holds if and only if
The nonnegatitivy in part (ii) holds because F ↑,[0,1] contains the identity function, and we only have to prove that D(Q) > 0 if Q ∈ H HL,1 .For this, it is sufficient to show that there exists some π with R Q (π) > 0. We start with some results about the existence of certain expected values.Since Y |P is Bernoulli with expectation π(P ), we have hence we obtain Let now π(P ) be a version of the conditional expectation of π(P ) with respect to the sigma lattice generated by P , which π(P ) satisfies the following properties: ] for all B in the σ-field generated by π. ( 13) Equation ( 11) holds by definition of the conditional expectation given a sigma lattice, and ( 12) and ( 13) are by Brunk (1965, Equations (3.9) and (3.11)).By (13), we have π(P ) = P almost surely if and only if π(P ) = P almost surely.By definition of π * (P ), we know that The goal is now to prove with equality if and only if π(P ) = P Q-almost-surely.Notice that π is only defined on the support of P , but one can assume without loss of generality that it is defined on the whole interval [0, 1] by right-continuous constant extrapolation in parts where it is not defined.Since π is increasing, there exist at most countably many disjoint intervals A i ⊆ [0, 1], i ∈ I, on which π is constant with some value c i ∈ [0, 1].Furthermore, there are at most countably many disjoint intervals B j , j ∈ J , whose union equals [0, 1] \ i∈I A i .We now make a few case distinctions.Fix i with c i > 0 and assume that A i = [a i , b i ]; the following arguments can be easily modified for the case that A i is (half-)open.Define the function Then h i (P ) is square integrable due to (10), h i is decreasing, and constant outside of [a i , b i ], so that 0 where the last step is due to the fact that π(P ) = c i for P ∈ A i .Hence If c i = 0, the above inequality is still true because (13) implies that π(P ) = π(P ) = 0 for P ∈ A i in that case, and we define 0 log(0) := 0.
Similarly, for c i < 1 we define which is square integrable and increasing.As before, 0 ≤ E Q [(π(P ) − π(P ))h i (P )] so we obtain which also holds if π(P ) = 1 on A i .Hence we have shown that and equality holds if and only if π(P ) = P Q-almost-surely on A i , since the Kullback-Leibler divergence is non-negative.
For part (iii), the inequality of arithmetic and geometric mean implies that The term inside the exponential can be written as which is the negative of the entropic loss defined in Section 4.4 of Kotlowski et al. (2017) 2017) that for all K ∈ N,

A feasible version of the test
The ideal test described in Algorithm 1 cannot be implemented for practically relevant n, as it requires to compute e-values over all n! permutations of {1, . . ., n}.Even for a single permutation σ, the inner loop in Algorithm 1 has computational complexity of O(n 2 ): it requires computing 2n isotonic regressions to generate out-of-sample predictions.We suggest to address these problems above by the simplified version in Algorithm 2, which can be regarded as a version of the split likelihood ratio test by Wasserman et al. (2020).
2: E HL,n ← 0 3: for b = 1, . . ., B do 4: randomly select ns pairs (Y i , P i ), j ∈ S b = {i 1 , . . ., i ns }, without replacement 5: estimate the isotonic of regression of (Y i , P i ), i ∈ S b , by maximizing (7) 6: generate predictions q i for E[Y |P i ], i ∈ {1, . . ., n} \ S b , from the isotonic regression 7: A delicate point in Algorithm 2 is Step 6, where one needs to generate out-of-sample predictions from the isotonic regression fit.Naive extrapolation approaches could lead to predicted probabilities q i ∈ {0, 1} and hence an e-value of zero if either q i = 0 and Y i = 1 or q i = 1 and Y i = 0. Let p 1 < • • • < p m denote the distinct values of P i , i ∈ S b , and π1 ≤ • • • ≤ πm the corresponding values of the isotonic regression.A well known result about isotonic regression states that there exists a partition of S b into index sets I 1 , . . ., I d such that πj is the empirical mean of the Y i with indices in I j , πj = 1 #I j i∈I j Y i .
To remedy the problem of predictions in {0, 1}, we propose to apply the smoothed Laplace predictor, equivalent to Jeffreys' prior in binomial proportion estimation, For out-of-sample predictions at P i ∈ {p 1 , . . ., p m }, one can then apply linear interpolation where it is now guaranteed that q i ∈ (0, 1).

Simulations
This section evaluates the empirical performance of the feasible version of the proposed test in Section 2.5 together with sensible values of the splitting fraction s ∈ (0, 1).We follow the simulation setup of Hosmer et al. (1997) with a quadratic misspecification in assessing HL-type tests, which is, if at all, just slightly modified in more recent contributions (Hosmer and Hjort, 2002;Xie et al., 2008;Allison, 2014;Canary et al., 2017;Nattino et al., 2020).Replication material for the simulations and the application in Section 4 in the statistical software R is available under https://github.com/marius-cp/eHL.
For i = 1, . . ., 2n with n ∈ {1024, 2048, 4096, 8192}, we simulate the iid covariate X i iid ∼ U(−3, 3) and let the response variables Y i ∼ Bernoulli(π i ) be independent, where the true conditional event probability πi follows a logistic transformation of a quadratic model We split the simulated data into an estimation set and validation set, both of size n.Based on the data in the estimation set, we estimate the parameters of a linear, and hence misspecified, logistic regression model by maximum likelihood and denote the parameter estimates by β 0 , β 1 .The probability of a positive outcome is then predicted by We vary the severity of the misspecification, expressed through the magnitude of β 2 .Following Hosmer et al. (1997), we characterize the "lack of linearity" through the conditions π(−3) = j − 0.00733745, π(−1.5)= 0.05 and π(3) = 0.95 such that the value j = 0 results in the very accurate approximation β 2 ≈ 0, i.e., a linear effect of X i on the log odds-ratio.We consider a sequence of 51 equally spaced values of j in the interval [0, 0.1].Notice that for each choice of j, the values of β 0 and β 1 are also determined by these conditions.
Table 1: Rejection rates in percentage points of the classical HL test and our eHL test under the null hypothesis with j = 0 and the true regression parameters (β 0 , β 1 ) in ( 18) at a significance level of 5%.We treat an e-value above 20 as a rejection in the eHL test.Table 1 reports rejection rates of the tests over 1000 simulation replications, where we set β 2 = 0 (i.e., j = 0), and use the true regression parameters (β 0 , β 1 ) in ( 18) to guarantee that the null hypothesis H HL,n holds.For the classical HL test, we use ten equally populated (quantilespaced) bins, where the exact procedure follows the method Q R described in Appendix A. For the feasible eHL test of Section 2.5, we use the splitting fractions s ∈ {1/3, 1/2, 2/3}.To limit computation time, we choose a relatively low amount of bootstrap replications B = 10 in the eHL test as we are mainly interested in rejection rates averaged across simulation replications, and hence, stability of the test is less of a concern as e.g., in the subsequent empirical application.
Here and in the following, we treat e-values above 20 as a test rejection at the 5% significance level.The table shows that all tests are well sized, where all eHL versions exhibit rejection frequencies much below the nominal value of 5%, which is not unusual for tests based on evalues.
Figure 1 analyzes the tests' behaviour under the alternative hypotheses induced by j > 0. Notice that we use the true parameters (β 0 , β 1 ) in ( 18) for j = 0 but estimates β 0 , β 1 for any j > 0 as the pseudo-true parameters are unknown under model mispecification.In this analysis, we further include an oracle version of the eHL test, whose e-values are optimal in the sense that they are based on q i = πi , i.e., the practically unknown true conditional event probabilities.The oracle eHL version with s = 1/2 facilitates a fair comparison with the feasible HL test based on the same splitting factor.The left panel of the figure shows classical test rejection rates for a nominal significance level of 5%.Following the explanations in Section 2.4 together with Grünwald et al. (2020) and Shafer (2021), a suitable measure of power for e-values is the growth rate E log(E HL,n ) , which is shown in the right panel Figure 1, where we approximate the expectation by the average log e-value over the simulation replications.We restrict attention to n ∈ {1024, 4096} as the other sample sizes do not yield further insights.
We find that all tests develop power for increasing mispecification j.E.g., the feasible eHL versions already have substantial power for both sample sizes against an alternative with j ≈ 0.043, which Hosmer et al. (1997) interpret as a value inducing only 'slight' mispecification.(Notice that j ≈ 0.043 equals 0.05 in their parametrization.)There seems to be little difference among the feasible eHL tests when using different splitting fractions s, and hence, we do not find arguments to deviate from the natural choice of s = 1/2, which we continue to use in the application.
The higher power of the classical HL test can be explained by the required sample split in the eHL test, and the estimation error in assigning suitable values for q i .The two oracle eHL tests make these steps redundant and hence achieve comparable power to the classical HL test.Perhaps surprisingly, the difference between the two oracle eHL tests with different s is smaller than the respective difference to the feasible test versions based on estimated q i 's, which means that tuning the test to a specific alternative through the q i 's is the main empirical challenge of the eHL test.
Notice that the often overlooked bin specification in the classical HL test implicitly deter- mines the set of alternatives the test has power against as e.g.illustrated in Dimitriadis et al. (2022, Section 5).As the sample split in the eHL test allows for estimating a suitable alternative, Theorem 2.1 shows that the (ideal version of the) eHL test has power against all alternatives.Turning to the growth rates of the feasible eHL tests, we find that larger choices of s perform better for slight model mispecifications (small j) while the opposite is true for large mispecifications.This can be explained since as discussed around (3)-( 4), πi must be on the 'correct' side of P i to gain power, which might be violated for small s (and n) under slight mispecifications.

Application: Credit Card Defaults in Taiwan
In this application, we analyze (re-)calibration of probability predictions for the binary event of credit card defaults in Taiwan in the year 2005.In that time period, banks in Taiwan overissued credit cards, also to unqualified applicants, who at the same time overused the cards for consumption, resulting in severe credit card debts and damaged consumer finance confidence (Yeh and Lien, 2009;Lo and Harvey, 2011).This crisis calls for improved and in particular calibrated probability predictions for credit card defaults that can be used for a thorough risk management and improved financial regulations.
For our analysis, we use a data set of m = 30 000 credit card holders from Taiwan in 2005, that is publicly available from the UCI Machine Learning Repository (Dua and Graff, 2019;Yeh and Lien, 2009) under https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.We randomly split the data into an estimation and a Recalibration set R with M = 12000 observations each, and a Validation set V containing the remaining n = 6000 observations.We use the estimation set to fit a standard logistic regression model based on all predictor variables by maximum likelihood and compute the model predictions on the recalibration and validation sets, respectively.We run all the following tests on the validation set.
Table 2 reports the e-values of the feasible version of our calibration test described in Section 2.5 based on B = 10000 bootstrap replication and with a splitting factor of s = 1/2 that is motivated by our simulation results.We further report the range between the smallest and largest p-value of the classical HL test, where the different p-values result from five different, but natural binning procedures using g = 5, . . ., 20 bins, respectively.We provide further details on these implementation choices in Appendix A.
The predictions from the logistic model result in an e-value far beyond the value of 20 in Table 2, hence implying that these predictions are clearly miscalibrated.In this setting, all implementation choices of the classical HL test agree and deliver p-values very close to zero.The second row of the table shows that even when using all observations in the "increased estimation set" comprising the estimation and the recalibration set, the situation barely changes and both the eHL and HL tests agree (under all implementation choices).
As a consequence of these clear rejections, we now aim at isotonically recalibrating the probability predictions, a technique that proved valuable in other disciplines (Guo et al., 2017;Vannitsem et al., 2018), where it is also called "post-processing".For this, we estimate an isotonic regression on the recalibration set R and generate recalibrated predictions by transforming the logistic predictions on V with the estimated isotonic regression function.Table 2 shows that our calibration test has an e-value just above 20, i.e., a weak rejection when interpreted as a (conservative) test at the 5% level.
As a nonparametric method, the isotonic regression is known to involve substantial estimation noise that might adversely affect the recalibrated predictions.Hence, we stabilize the estimation through the classical bagging (bootstrap aggregation) method of Breiman (1996).In detail, we draw B = 100 bootstrap samples R b , b = 1, . . ., B of size M from the recalibration set R and estimate the isotonic regression on each bootstrap sample R b .The final predictions are obtained by recalibrating with the average of the estimated isotonic regression functions, displayed in Figure 2.
The last row of Table 2 shows an e-value of approximately 6 implying only very weak evidence against the null hypothesis of calibration, once again illustrating the practical strength of both, bagging and recalibration methods.The estimated re-calibration function displayed in Figure 2  substantially deviates from the diagonal.
For these two recalibration methods, the various natural implementation choices of the HL test, further described in Appendix A, result in p-values ranging between essentially 0 and 0.91 (and 0.53 respectively).The corresponding p-value histograms in Figure 4 (and the detailed results in Table 3) show the continuum of p-values, where the null hypothesis is rejected in approximately half of the cases at the 5% level, implying that a researcher can essentially tailor the test decision to her will.As already noted by Hosmer et al. (1997); Bertolini et al. (2000); Kuss (2002), this is a disconcerting state of affairs for a commonly used testing procedure and calls for more robust alternatives, such as the eHL test proposed in this paper.Appendix A further shows that the feasible eHL test version is affected less by such instabilities arising from the repeated sample splits, at least if B is chosen sufficiently large as in this application.

Discussion
This article proposes an e-test for perfect calibration, which is a safe testing counterpart to the widely used Hosmer-Lemeshow test.The proposed eHL test follows a simple betting interpretation (see Shafer (2021)) where the e-value can be seen as the factor by which we multiply the bet against the hypothesis of perfect calibration.Intuitively, when accumulating money by the bet, we gain evidence against the null.Here, the e-value depends on the probability prediction, its corresponding realization, and an arbitrary value, which we suggest estimating in a two-step approach by isotonic regression.Furthermore, we assess the empirical performance of the test to detect quadratic model misspecifications.The simulations show that in samples of more than 2000 observations, the eHL test allows to reliably detect levels of quadratic misspecification, which Hosmer et al. (1997, p. 973) denote to be slight.The intrinsic flexibility of the e-values allows the application of stable data-driven methods (here isotonic regression) instead of the typical binning and counting technique in the HL test.However, this flexibility comes at the cost of lower power in small samples of less than 2000 observations.
The feasible version of our test is based on random splits of the training data.Since the null hypothesis H HL,n requires calibration conditional on the predictions P 1 , . . ., P n , one also obtains a valid tests if the splits are performed systematically based on P 1 , . . ., P n , for example, by choosing two subsets with similar distribution of the P i in order for the isotonic regression estimator to extrapolate well.Systematic sampling approaches to increase the power have  already been applied by Duan et al. (2022) for testing treatment effects.
Our article focuses on the batch setting where a fixed sample of size n is available, rather than the online setting in which (P i , Y i ), i ∈ N, arrive sequentially.However, the fact that powerful tests based on isotonic regression can be constructed in the batch setting suggests that similar approaches may be fruitful for online testing.Kot lowski et al. (2016, Section 7.1) describe algorithms with sublinear regret for online isotonic regression (without the random permutation setting).We believe that in conjunction with parts (i) and (ii) of our Theorem 2.1, it is possible to derive power guarantees for sequential calibration tests where E[Y |P ] is estimated sequentially with isotonic regression.We leave such extensions for future work.
Table 3: p-values of the HL test based on various binning choices described in the text for the two recalibrated prediction methods from Section 4.

Isotonic recalibration
Bagged isotonic recalibration be consequential for the test result in some instances.This is illustrated by Table 3, which reports p-values for the classical HL test based on the five binning methods discussed above using g = 5, . . ., 20 bins respectively for the two recalibration methods used in the application in Section 4. We find that the p-values vary substantially in both, using different numbers of bins and different binning implementations.Maybe surprisingly, even for a fixed g, the subtleties in the four quantile-based binning choices lead to widely varying p-values.
In contrast to the classical HL test, the theoretical version of the eHL test described in Algorithm 1 is tuning parameter free due to the use of the isotonic regression method.This is unfortunately not true for the feasible eHL test described in Algorithm 2 that might be sensitive to the chosen sampling splits in the bootstrap-like replications.In particular, one has to choose the number of replications B large enough such that the resulting e-values are not sensitive to the random numbers (i.e., the 'random seed') that determine the sample split.
To analyse this effect in our practical data example, Figure 4 visualizes the empirical distribution of the e-values (for tests based on different random splits), for varying bootstrap replications B ∈ {100, 500, 1000, 10000}.While there is indeed some variation in the test result for smaller values of B, the e-values are relatively stable for B = 10000, the choice we employ in the empirical application.E.g., for the isotonic recalibration method, essentially all e-values are between 16 and 27, implying (conservative) p-values between 1/27 ≈ 0.037 and 1/16 = 0.0625.Similarly, the p-values in the bagged isotonic recalibration implied by the respective e-values range between 1/8 = 0.125 and 1/4 = 0.25.In contrast, the variation of the HL test p-values in Table 3 is much more substantial and includes clear test rejections as well as many p-values above any commonly chosen significance level.

Figure 1 :
Figure 1: Rejection (left) and e-value growth (right) rates for the classical HL (cHL) test, the feasible eHL test and an oracle eHL test for a range of splitting factors s.The oracle eHL test is based on the true π i .The x-axis contains the severity of model mispecification, and the vertically aligned plots correspond to different sample sizes.

Figure 2 :
Figure2: Bagged isotonic recalibration curve of the logit predictions.The blue curve shows the mean, and the red band the range of the pointwise 1% and 99% quantiles, over all bagging iterations.

Figure 3 :
Figure 3: Histograms of p-values of the classical out-of-sample HL test based the five binning procedures given in Appendix A based on 5-20 bins, respectively, resulting in a total of 80 test results.
, and the expectation E σ [•] is with respect to the uniform distribution over all permutations σ of {1, . . ., n}.It follows from Lemma 2.1, Theorem 4.3 and the proof of Theorem 4.1 of Kotlowski et al. (

Table 2 :
E-values of the eHL and the range of p-values of the classical HL test, the latter stemming from 80 reasonable binning procedures as detailed in Table3and Appendix A.Specifically, the binary response variable Y i ∈ {0, 1} contains information on whether a default payment, Y i = 1, occurred for customer i = 1, . . ., m.We observe a relatively high rate of 22.12% of default payments in the data set that reflects the above mentioned credit card crisis.The data set further includes 23 explanatory variables, containing information on the amount of given credit, gender, education, marital status, age, and various historical payment records for the past six months.