1 Introduction
Suppose that we have observations ${({p_{i}},{y_{i}})_{i=1}^{n}}$ of independent and identically distributed (iid) random variables ${({P_{i}},{Y_{i}})_{i=1}^{n}}$ with $({P_{i}},{Y_{i}})\in [0,1]\times \{0,1\}$, $i=1,\dots ,n$. The interpretation is that ${P_{i}}$ is a prediction for the probability that ${Y_{i}}=1$. The random variables are defined on some underlying probability space $(\Omega ,\mathcal{F})$ and $\mathcal{P}$ denotes all probability measures on $(\Omega ,\mathcal{F})$. Hosmer and Lemeshow [18] propose a test for the null hypothesis of perfect calibration
The Hosmer-Lemeshow (henceforth HL) test is based on partitioning the interval $[0,1]$ in $g\in \mathbb{N}$ bins and counting the observed numbers of events, ${o_{1g}}$, and no event occurrences, ${o_{0g}}$, in each bin. Based on that binning and counting procedure, the HL test statistic to test for perfect calibration of the probability predictions is
where ${\widehat{e}_{1k}}$ and ${\widehat{e}_{0k}}$ are the expected event and no event occurrences in bin k, respectively [19]. Under the null hypothesis, $\widehat{C}$ asymptotically follows a ${\chi ^{2}}$-distribution with g degrees of freedom given that the sample ${({P_{i}},{Y_{i}})_{i=1}^{n}}$ was not used for model estimation (and $g-2$ degrees of freedom otherwise).
(1.2)
\[ \widehat{C}={\sum \limits_{k=1}^{g}}\left[\frac{{({o_{1k}}-{\widehat{e}_{1k}})^{2}}}{{\widehat{e}_{1k}}}+\frac{{({o_{0k}}-{\widehat{e}_{0k}})^{2}}}{{\widehat{e}_{0k}}}\right],\]Technically, the choice of the binning procedure is up the user of the HL test and is conventionally implemented via quantile based binning strategies with $g=10$, resulting in equally populated bins (decile-of-risk). Less commonly, the test is based on equidistantly spaced bins, where the unit interval (or the range of prediction values) is divided into g equidistant bins. While little attention is devoted to the binning procedure in practical applications, it implicitly determines the set of alternatives the test has power against [8, Section 5], such that the test result is often highly sensitive to the exact implementation of the binning; see e.g., [17, 3, 22] and our empirical application in Section 4. Nevertheless, the HL test is still the literature’s favorite for checking the calibration of binary prediction models and commonly used in current and highly influential medical and epidemiological studies; see amongst many others [26, 28, 23].
In this article, we suggest a safe and stable HL test based on e-values (that we describe below) and isotonic regression [2, 5]. The test is henceforth called eHL test. Dimitriadis et al. [9] recently propose the use of isotonic regression to resolve the closely related instability issue stemming from binning approaches in so-called reliability diagrams in forecast evaluation. While feasible inference on the isotonic regression for classical testing procedures is hampered by complicated asymptotic distributions and an inconsistency of the bootstrap, the e-values adopted here prove to be an appealing alternative in this setting. Based on online isotonic regression studied by [20], we show that (an ideal version of) our eHL test has power against essentially all deviations from calibration, which makes it theoretically superior to the classical HL test.
E-values, where ‘e’ abbreviates the word ‘expectation’, were proposed recently as an alternative to p-values in testing problems. In a nutshell, an e-value is a realization of a non-negative random variable whose expected value is at most one under a given null hypothesis. This already signals that an e-value itself allows for meaningful interpretations since an e-value greater than one provides evidence against the null hypothesis. Additionally, the multiplicative inverse of an e-value is a conservative p-value by Markov’s inequality. From a game-theoretic perspective, the e-value has a simple financial meaning in the sense that the e-value can be seen as the factor by which a skeptic multiplies her money when betting against the null hypothesis; see [32, 31].
An important advantage of e-values over p-values is their uncomplicated behavior in combinations: the arithmetic average of e-values also is an e-value, likewise the product of independent or successive e-values; see [31, 13, 39]. In practice, this appeals because more evidence can be added later, i.e. evidence across studies can easily be combined.
The proposed eHL test offers a safe alternative to a fragile state-of-the-art approach by avoiding ad-hoc choices and software instabilities. It can be regarded as an application of the Universal Inference approach of [40]. While this method allows to construct valid tests under only weak assumptions, it has been observed that this validity often comes at the price of a diminished power [34, 35]. In Section 2, we show that an ideal – but computationally infeasible – variant of the eHL test does have guaranteed power to detect essentially all violations of calibration. Our proof relies on connections between the proposed e-value and the regret in random permutation online isotonic regression, which is studied by [20]. It has been observed that power guarantees for anytime-valid tests can be obtained by means of regret bounds of online prediction methods, see for example the discussion in [7]. Previously, [27] and [33] exploit that connection in the cases of sequential mean and two sample testing, respectively. Our result demonstrates that such a connection also exists in the batch case of e-values for a fixed sample size n due to connections with the online random permutation setting.
In Section 3, we compare a feasible version of the eHL test to the classical HL test in a simulation study. As expected, we find that the eHL test has conservative rejection rates under the null hypothesis and quickly develops power under model misspecification. While its empirical test power is lower than the one of the classical HL test, we do not consider this to be problematic as HL tests are often carried out in cases of vast data sets and are even criticized as being “too powerful” in that they reject essentially all, even acceptably well calibrated models [29, 25]. See [8] for an alternative solution to this problem based on confidence bands.
We apply the eHL test in Section 4 to predictions of a logistic regression model for the binary event of credit card defaults in Taiwan in 2005, where over-issuing of credit cards lead to many default payments and a subsequent credit card crisis [43]. The eHL test provides clear evidence against calibration of the logistic model predictions, and further illustrates that recalibration methods work well. In contrast, the classical HL test based on different natural binning choices delivers equivocal results with p-values ranging from 0 to 0.91 for a single prediction method, implying that a researcher could have cherry-picked the binning specification and hence the test result to her will.
2 Construction of HL e-Values
2.1 Preliminaries
An e-variable for ${\mathcal{H}_{\text{HL},n}}$ is a non-negative random variable E (that is allowed to take the value $+\infty $) such that ${\mathbb{E}_{\mathbb{P}}}(E)\le 1$ for all $\mathbb{P}\in {\mathcal{H}_{\text{HL},n}}$. An e-value is a realization of an e-variable. An e-variable E always yields a valid p-variable $1/E$ (a p-value is a realized p-variable) by Markov’s inequality, since
We reject the null hypothesis ${\mathcal{H}_{\text{HL},n}}$ if we observe a large value of E. If we want to ensure a classical p-guarantee then we have to determine the rejection region for a given α by (2.1). Vovk and Wang [38] show that this is essentially the only way to transform an e-variable into a p-variable. We say that an e-variable has the alternative hypothesis ${\mathcal{H}^{\prime }}\subset \mathcal{P}$ if ${\mathbb{E}_{\mathbb{Q}}}(E)\gt 1$ for all $\mathbb{Q}\in {\mathcal{H}^{\prime }}$.
(2.1)
\[ \mathbb{P}\Big(\frac{1}{E}\le \alpha \Big)\le \alpha {\mathbb{E}_{\mathbb{P}}}(E)\le \alpha ,\hspace{0.1667em}\hspace{2.5pt}\mathbb{P}\in {\mathcal{H}_{\text{HL},n}}.\]2.2 Sample Size One
We first construct e-variables for the sample size one Hosmer-Lemeshow null hypothesis
and if $q\gt P$, ${E_{q}}$ has the alternative
\[ {\mathcal{H}_{\text{HL},1}}=\{\mathbb{P}\in \mathcal{P}\hspace{0.2778em}|\hspace{0.2778em}{\mathbb{E}_{\mathbb{P}}}(Y|P)=P\}.\]
In the special case here, e-variables are likelihood ratios conditional on P. Indeed, if $q\in [0,1]$, an e-variable for ${\mathcal{H}_{\text{HL},1}}$ is given by
\[ {E_{q}}(P,Y)\hspace{-0.1667em}=\hspace{-0.1667em}\frac{{q^{Y}}{(1-q)^{1-Y}}}{{P^{Y}}{(1-P)^{1-Y}}}=\left\{\begin{array}{l@{\hskip10.0pt}l}q/P,\hspace{1em}& \text{if}\hspace{2.5pt}Y=1\text{,}\\ {} (1-q)/(1-P),\hspace{1em}& \text{if}\hspace{2.5pt}Y=0\text{.}\end{array}\right.\]
The variable ${E_{q}}(P,Y)$ is clearly non-negative, and for $\mathbb{P}\in {\mathcal{H}_{\text{HL},1}}$,
\[\begin{aligned}{}{\mathbb{E}_{\mathbb{P}}}({E_{q}}(P,Y))& ={\mathbb{E}_{\mathbb{P}}}\left({\mathbb{E}_{\mathbb{P}}}(Y\mid P)\frac{q}{P}+{\mathbb{E}_{\mathbb{P}}}(1-Y\mid P)\frac{1-q}{1-P}\right)\\ {} & ={\mathbb{E}_{\mathbb{P}}}\left(P\frac{q}{P}+(1-P)\frac{1-q}{1-P}\right)=1.\end{aligned}\]
To find alternative hypotheses for the e-variable ${E_{q}}$, let $\bar{\pi }={\mathbb{E}_{\mathbb{Q}}}(Y\mid P)$. Then,
\[\begin{aligned}{}{\mathbb{E}_{\mathbb{Q}}}({E_{q}}(P,Y)\mid P)& =\bar{\pi }\frac{q}{P}+(1-\bar{\pi })\frac{1-q}{1-P}\end{aligned}\]
is strictly larger one if and only if, $\bar{\pi }\gt P$ and $q\gt P$, or, $\bar{\pi }\lt P$ and $q\lt P$, i.e., if π and q are to the same side of P. This shows that if $q\lt P$, ${E_{q}}$ has the alternative
(2.2)
\[ {\mathcal{H}^{\prime }}=\{\mathbb{Q}\in \mathcal{P}\mid {\mathbb{E}_{\mathbb{Q}}}(Y\mid P)\lt P\},\]It is possible to show that basically any e-variable for ${\mathcal{H}_{\text{HL},1}}$ is of the form $E={E_{q}}(P,Y)$ for some q (depending on P) but this requires some more arguments; it follows by the construction in [15], see also [41]. The connection of ${E_{q}}(P,Y)$ to the e-variables in [15] of type $E=1+\lambda D$ with $D\ge -1$ such that ${\mathbb{E}_{\mathbb{P}}}(D)=0$ for $\mathbb{P}\in {\mathcal{H}_{\text{HL},1}}$, follows from the fact that λ in this representation can be bijectively mapped to q. In this context,
is an e-variable for ${\mathcal{H}_{\text{HL},1}}$ for any λ that is $\sigma (P)$-measurable with $-(1/P)\le \lambda \le 1/(1-P)$. If $P=1$, there is no restriction on λ from above, and analogously if $P=0$, there is no restriction from below. By choosing $\lambda =(P-q)/(P(1-P))$, we obtain that $E={E_{q}}(P,Y)$.
Clearly, the e-variable ${E_{q}}(P,Y)$ may take the value infinity if either $P=0$ and $Y=1$ or $P=1$ and $Y=0$ occurs; a single observation $Y=1$ or $Y=0$ is sufficient to reject the hypothesis of calibration with certainty if the predicted probabilities are in $\{0,1\}$. For the remainder of the theoretical part of this paper, we will always make the assumption $\mathbb{P}(P\in \{0,1\})=0$ to exclude these special but uninteresting cases.
2.3 Combining e-Values in the iid Case
For testing ${\mathcal{H}_{\text{HL},n}}$, we suggest the e-variable
where ${q_{i}}$ is $\sigma ({P_{1}},\dots ,{P_{i}},{Y_{1}},\dots ,{Y_{i-1}})$-measurable. For $\mathbb{P}\in {\mathcal{H}_{\text{HL},n}}$, the expectation ${\mathbb{E}_{\mathbb{P}}}{E_{\text{HL},n}^{\text{id}}}$ equals
\[\begin{aligned}{}& {\mathbb{E}_{\mathbb{P}}}\Big({\mathbb{E}_{\mathbb{P}}}\Big({\prod \limits_{i=1}^{n}}{E_{{q_{i}}}}({P_{i}},{Y_{i}})|{P_{1}},\dots ,{P_{n}},{Y_{1}},\dots ,{Y_{n-1}}\Big)\Big)\\ {} & ={\mathbb{E}_{\mathbb{P}}}\Big({\prod \limits_{i=1}^{n-1}}{E_{{q_{i}}}}({P_{i}},{Y_{i}})\\ {} & \hspace{1em}\times {\mathbb{E}_{\mathbb{P}}}\Big({E_{{q_{n}}}}({P_{n}},{Y_{n}})|{P_{1}},\dots ,{P_{n}},{Y_{1}},\dots ,{Y_{n-1}}\Big)\Big)\\ {} & ={\mathbb{E}_{\mathbb{P}}}\Big({\prod \limits_{i=1}^{n-1}}{E_{{q_{i}}}}({P_{i}},{Y_{i}})\Big(1+\\ {} & \hspace{1em}\hspace{2em}\frac{{P_{n}}-{q_{n}}}{{P_{n}}(1-{P_{n}})}{\mathbb{E}_{\mathbb{P}}}({P_{n}}-{Y_{n}}|{P_{1}},\dots ,{P_{n}},{Y_{1}},\dots ,{Y_{n-1}})\Big)\Big)\\ {} & ={\mathbb{E}_{\mathbb{P}}}\Big({\prod \limits_{i=1}^{n-1}}{E_{{q_{i}}}}({P_{i}},{Y_{i}})\Big(1+\frac{{P_{n}}-{q_{n}}}{{P_{n}}(1-{P_{n}})}{\mathbb{E}_{\mathbb{P}}}({P_{n}}-{Y_{n}}|{P_{n}})\Big)\Big)\\ {} & ={\mathbb{E}_{\mathbb{P}}}\Big({\prod \limits_{i=1}^{n-1}}{E_{{q_{i}}}}({P_{i}},{Y_{i}})\Big)={\mathbb{E}_{\mathbb{P}}}{E_{\text{HL},n-1}^{\text{id}}}=\cdots =1,\end{aligned}\]
where we used the equivalent representation of ${E_{q}}(P,Y)$ in (2.4). In particular, from the above derivation it is easy to see that ${({E_{\text{HL},n}^{\text{id}}})_{n\in \mathbb{N}}}$ is a test martingale.The e-variable ${E_{\text{HL},n}^{\text{id}}}$ depends on the ordering of ${({P_{i}},{Y_{i}})_{i=1}^{n}}$ through the choice of ${q_{i}}$. Let ${S_{n}}$ denote all permutations of $\{1,\dots ,n\}$, and for $\sigma \in {S_{n}}$ define ${E_{\text{HL},n}^{\sigma }}$ as ${E_{\text{HL},n}^{\text{id}}}$ for the random variables ${({P_{\sigma (i)}},{Y_{\sigma (i)}})_{i=1}^{n}}$ instead of ${({P_{i}},{Y_{i}})_{i=1}^{n}}$. Generally,
is not an e-variable for ${\mathcal{H}_{\text{HL},n}}$, so one would guess that there are opportunities to fish for (spurious) significance by choosing some specific ordering of a sample of observations ${({p_{i}},{y_{i}})_{i=1}^{n}}$. In constructing a ‘safe’ HL test, we are particularly focused on avoiding such instabilities through order-dependencies. Nevertheless, order-dependent strategies might be considered if they preclude possibilities to fish for significance.
In contrast, if there is a natural ordering of the observations such as a time stamp then the problem usually does not occur in applications since a different ordering of the observations is hard to justify. Indeed, when the observations are sequential (and possibly dependent), the e-variable defined at (2.5) is also an e-variable for the hypothesis
\[\begin{aligned}{}{\mathcal{H}_{\text{HL},n,seq}}& =\{\mathbb{P}\in \mathcal{P}\hspace{0.2778em}|\hspace{0.2778em}{\mathbb{E}_{\mathbb{P}}}({Y_{i}}|{P_{1}},\dots ,{P_{i}},{Y_{1}},\dots ,{Y_{i-1}})={P_{i}}\hspace{0.2778em}\\ {} & \hspace{2em}\hspace{2em}\mathbb{P}\text{-almost surely},\hspace{0.2778em}i=1,\dots ,n\}.\end{aligned}\]
Contrary to classical theory, the sequential case is easier to treat than the iid case and has been the focus of many works employing e-values including for example [41, 15].Coming back to our situation with iid data, an alternative to (2.5) could be
\[ {E_{\text{HL},n,\text{sym}}}=\frac{1}{n!}\sum \limits_{\sigma \in {S_{n}}}{E_{\text{HL},n}^{\sigma }}.\]
This strategy is essentially the merging technique for independent e-values in Section 4 of [38], and the object of interest in this article.2.4 An Ideal Test with Power Guarantees
The statistic ${E_{\text{HL},n,\text{sym}}}$ is an e-variable solely under the requirement that for $i=1,\dots ,n$ and all permutations σ, the probabilities ${q_{\sigma (i)}}$ in ${E_{\text{HL},n}^{\sigma }}$ are a measurable function of $({P_{\sigma (j)}},{Y_{\sigma (j)}})$, $j=1,\dots ,i-1$, and of ${P_{\sigma (i)}}$. In the following, we write
\[ {q_{\sigma ,\sigma (i)}}={f_{i}}({P_{\sigma (1)}},\dots ,{P_{\sigma (i)}},{Y_{\sigma (1)}},\dots ,{Y_{\sigma (i-1)}}),\]
using the same algorithm ${f_{i}}$ for constructing ${q_{\sigma ,\sigma (i)}}$ based on ${P_{\sigma (1)}},\dots ,{P_{\sigma (i)}},{Y_{\sigma (1)}},\dots ,{Y_{\sigma (i-1)}}$ for all permutations σ. The challenge is then how to choose the functions ${f_{1}},\dots ,{f_{n}}$ such that the test has power. As argued by [13, 31], a suitable measure of power for e-values is the growth rate ${\mathbb{E}_{\mathbb{Q}}}[\log (E)]$ under an alternative distribution $\mathbb{Q}$, so that ideally, E grows exponentially fast in the sample size if the null hypothesis is violated.Our algorithm for choosing ${f_{1}},\dots ,{f_{n}}$ is inspired by permutation online isotonic regression, studied extensively by [20]. In machine learning applications, isotonic regression is an established method for the recalibration of binary classifiers; see e.g. [44] or [12]. Recently, [9] related the isotonic regression approach to reliability diagrams, which are a key diagnostic tool in evaluating probability forecast for binary events, especially in meteorology. Our results demonstrate that isotonic regression is also suitable for constructing universal tests of calibration.
To introduce the algorithm for constructing our e-variable, let ${p_{1}},\dots ,{p_{i}}\in [0,1]$ be probability predictions and ${y_{1}},\dots ,{y_{i}}\in \{0,1\}$ be the corresponding outcomes. Then the isotonic regression of ${y_{1}},\dots ,{y_{i}}$ on ${p_{1}},\dots ,{p_{i}}$ can be described as the maximizer of
over all ${g_{1}},\dots ,{g_{i}}$ such that ${g_{k}}\le {g_{l}}$ if ${p_{k}}\le {p_{l}}$. Notice that the quantity in (2.6) is simply a normalized version of the logarithmic score, and the maximizer does not depend on the fact that we normalize by ${p_{j}^{{y_{j}}}}{(1-{p_{j}})^{1-{y_{j}}}}$. Moreover notice that up to rescaling by $1/i$, this criterion also equals the sample version of ${\mathbb{E}_{\mathbb{Q}}}[\log (E)]$ when the e-variable E is the likelihood ratio between the probabilities ${g_{j}}$ and ${p_{j}}$. A unique maximizer exists — unique since we exclude the cases ${p_{j}}=0$ and ${y_{j}}=1$ or ${p_{j}}=1$ and ${y_{j}}=0$ for some j — and can be computed efficiently with the PAV-Algorithm [2]. This estimator only defines a recalibrated version of ${p_{1}},\dots ,{p_{i}}$, and a method is required to define the regression at a ${p_{i+1}}\in [0,1]$ not contained in the sample. To obtain out-of-sample predictions with small regret in terms of log-loss, we rely on a strategy of [30] that was adapted to isotonic regression by [37], and applied by [20] to derive regret bounds for isotonic regression in an online setting. The out-of-sample value at ${p_{j+1}}$ is defined as follows,
where ${g_{i+1,1}}$ and ${g_{i+1,0}}$ are the $(i+1)$-th component the isotonic regression of ${p_{i}},\dots ,{p_{i}},{p_{i+1}}$ with observations ${y_{1}},\dots ,{y_{i}},1$ or ${y_{1}},\dots ,{y_{i}},0$, respectively. That is, to define the isotonic regression at the unseen ${p_{i+1}}$, we fit two isotonic regression in which we include ${p_{i+1}}$ in the sample with artificial observations of 1 and of 0 respectively, and take the ratio (2.7) as recalibrated probability. The definition (2.7) is extended to the case $i=0$ by setting ${g_{1,1}}={g_{1,0}}=0.5$. The workflow to construct ${E_{\text{HL},n,\text{sym}}}$ is then described in Algorithm 1.
(2.6)
\[ \begin{aligned}{}{\hat{R}_{n}}({g_{1}},\dots ,{g_{i}})& =\hat{R}({g_{1}},\dots ,{g_{i}};\hspace{0.1667em}{p_{1}},\dots ,{p_{i}},{y_{1}},\dots ,{y_{i}})\\ {} & ={\sum \limits_{j=1}^{i}}\log \hspace{0.1667em}\left({\left(\frac{{g_{j}}}{{p_{j}}}\right)^{{y_{j}}}}{\left(\frac{1-{g_{j}}}{1-{p_{j}}}\right)^{1-{y_{j}}}}\right),\end{aligned}\](2.7)
\[ {f_{i+1}}({p_{1}},\dots ,{p_{i}},{p_{i+1}},{y_{1}},\dots ,{y_{i}})=\frac{{g_{i+1,1}}}{{g_{i+1,1}}+1-{g_{i+1,0}}},\]To state a result about the power of ${E_{\text{HL},n,\text{sym}}}$, we need a population version of the isotonic regression estimator. For a function $\pi :[0,1]\to [0,1]$, let
\[ {R_{\mathbb{Q}}}(\pi )={\mathbb{E}_{\mathbb{Q}}}\left[\log \left({(\pi (P)/P)^{Y}}{((1-\pi (P))/(1-P))^{1-Y}}\right)\right]\]
if this expectation exists. Let ${\mathcal{F}_{\uparrow ,[0,1]}}$ be the set of nondecreasing functions $\pi :[0,1]\to [0,1]$. If $\mathbb{Q}$ is the empirical distribution of $({P_{1}},{Y_{1}}),\dots ,({P_{n}},{Y_{n}})$, then it is easy to see that ${R_{\mathbb{Q}}}$ coincides with the target function $\hat{R}$ of the usual isotonic regression in finite samples. With these definitions, we can state the following result about the power of ${E_{\text{HL},n,\text{sym}}}$.Theorem 2.1.
Let $({P_{1}},{Y_{1}}),\dots ,({P_{n}},{Y_{n}}),(P,Y)$ be iid with distribution $\mathbb{Q}$ such that
Then,
-
(i) there exists a $\mathbb{Q}$-almost-surely unique maximizer ${\pi ^{\ast }}\in {\mathcal{F}_{\uparrow ,[0,1]}}$ of ${R_{\mathbb{Q}}}$;
-
(ii) for a version of ${\pi ^{\ast }}$ from part (i), let\[\begin{aligned}{}D(\mathbb{Q})& ={R_{\mathbb{Q}}}({\pi ^{\ast }})\\ {} & ={\mathbb{E}_{\mathbb{Q}}}[\log {({\pi ^{\ast }}(P)/P)^{Y}}{((1-{\pi ^{\ast }}(P))/(1-P))^{1-Y}}];\end{aligned}\]then $D(\mathbb{Q})\ge 0$, with equality if and only if $\mathbb{Q}\in {\mathcal{H}_{\mathrm{HL},1}}$;
-
(iii) the e-value ${E_{\mathrm{HL},n,\mathrm{sym}}}$ from Algorithm 1 satisfies\[\begin{aligned}{}& {E_{\mathrm{HL},n,\mathrm{sym}}}\\ {} & \hspace{1em}\ge \exp \Bigg({\sum \limits_{i=1}^{n}}\log \hspace{0.1667em}{\left(\frac{{\pi ^{\ast }}({P_{i}})}{{P_{i}}}\right)^{{Y_{i}}}}{\left(\frac{1-{\pi ^{\ast }}({P_{i}})}{1-{P_{i}}}\right)^{1-{Y_{i}}}}\\ {} & \hspace{2em}-C\sqrt{n\log {(n)^{2}}}\Bigg)\end{aligned}\]for an universal constant $C\gt 0$, and hence
The integrability assumption (2.8) is solely required to prove parts (i) and (ii) of the theorem, and the lower bound on ${E_{\text{HL},n,\text{sym}}}$ and the expectation of its logarithm in fact hold for any $\pi \in {\mathcal{F}_{\uparrow ,[0,1]}}$. However, part (iii) only becomes useful in conjunction with (i) and (ii): the fact that $D(\mathbb{Q})\ge 0$ with equality if and only if $\mathbb{Q}\in {\mathcal{H}_{\text{HL},1}}$ implies that the test has positive growth rate for all alternative distributions $\mathbb{Q}$ if n is large enough. This is a surprising result, since it might seem that restricting our estimator of ${\mathbb{E}_{\mathbb{Q}}}[Y|P]$ to isotonic functions in P implies some restriction on the class of alternatives against which the test has power — which is not the case. If $p\mapsto {\mathbb{E}_{\mathbb{Q}}}[Y|P=p]$ is non-decreasing, then ${\mathbb{E}_{\mathbb{Q}}}[Y|P]={\pi ^{\ast }}(P)$ almost surely and $D(\mathbb{Q})$ equals
\[ \underset{\pi :[0,1]\to [0,1]}{\max }{\mathbb{E}_{\mathbb{Q}}}\left[\log \left({(\pi (P)/P)^{Y}}{((1-\pi (P))/(1-P))^{1-Y}}\right)\right]\hspace{-0.1667em},\]
which follows by applying, in the expression above, the tower property of conditional expectations and strict concavity of $p\mapsto p\log (p)+(1-p)\log (1-p)$. Hence, in that case our test is asymptotically growth rate optimal in the sense that
\[ \underset{n\to \infty }{\liminf }\frac{{\mathbb{E}_{\mathbb{Q}}}[\log ({E_{\text{HL},n,\text{sym}}})]}{n}\ge D(\mathbb{Q})\]
is maximal among the growth rates of all tests for the hypothesis of calibration. In the case when $p\mapsto {\mathbb{E}_{\mathbb{Q}}}[Y|P=p]$ is not increasing, our test is still optimal among all tests with non-decreasing alternative probabilities $\pi :[0,1]\to [0,1]$, by definition of the optimal isotonic approximation ${\pi ^{\ast }}$. How large the difference in growth rate to the optimal test is depends on how strongly the isotonicity assumption is violated, and is difficult to quantify in general.Apart from the asymptotic growth rate, the power of the test also depends on the regret, which is of order $O({n^{1/2}}\log (n))$ for the algorithm presented. With a more complex exponential weights strategy given in Section 7.1 of [21], instead of the method in (2.7), one can achieve a smaller regret of order $O({n^{1/3}}\log {(n)^{2/3}})$, which matches the optimal rate up to logarithmic factors. To see that this indeed yields a valid test, notice that the hypothesis ${\mathcal{H}_{\mathrm{HL},n}}$ is about the conditional expectations of ${Y_{i}}$ given ${P_{i}}$ and, hence, one can assume that ${P_{1}},\dots ,{P_{n}}$ are given in advance to the learner. In particular, the probabilities ${q_{\sigma ,\sigma (i)}}$ then also depend on ${P_{\sigma (i+1)}},\dots ,{P_{\sigma (n)}}$, but they do not depend on ${Y_{\sigma (i+1)}},\dots ,{Y_{\sigma (n)}}$. This is not allowed in the online permutation isotonic regression setting, where the learner has to make the prediction for ${Y_{\sigma (i)}}$ without knowledge of ${P_{\sigma (i+1)}},\dots ,{P_{\sigma (n)}}$.
Part (iii) of Theorem 2.1 only gives a diverging lower bound on the expected value of $\log ({E_{\text{HL},n,\text{sym}}})$. However, under assumption (2.8) the average growth rate
\[ \frac{1}{n}{\sum \limits_{i=1}^{n}}\log \hspace{0.1667em}{\left(\frac{{\pi ^{\ast }}({P_{i}})}{{P_{i}}}\right)^{{Y_{i}}}}{\left(\frac{1-{\pi ^{\ast }}({P_{i}})}{1-{P_{i}}}\right)^{1-{Y_{i}}}}\]
satisfies the strong law of large numbers, and since $D(\mathbb{Q})\gt 0$ for $\mathbb{Q}\notin {\mathcal{H}_{\text{HL},1}}$, this implies that ${E_{\text{HL},n,\text{sym}}}\to \infty $ almost surely as $n\to \infty $. In particular, for any Type-I error α and desired power $1-\beta $, there exists a sample size N such that $\mathbb{Q}({E_{\text{HL},N,\text{sym}}}\ge 1/\alpha )\ge 1-\beta $ for $N\ge n$.Proof of Theorem 2.1.
Part (i) is a consequence of the following facts. If ${({\pi _{n}})_{n\in \mathbb{N}}}$ is a sequence in ${\mathcal{F}_{\uparrow ,[0,1]}}$ such that ${\lim \nolimits_{n\to \infty }}{R_{\mathbb{Q}}}({\pi _{n}})={\sup _{\pi \in {\mathcal{F}_{\uparrow ,[0,1]}}}}R(\pi )$, then by Helly’s selection theorem, there exists a subsequence ${({\pi _{{n_{k}}}})_{k\in \mathbb{N}}}$ converging pointwise to some ${\pi ^{\ast }}\in {\mathcal{F}_{\uparrow ,[0,1]}}$. The function $\pi (P)\mapsto \log \left({(\pi (P)/P)^{Y}}{((1-\pi (P))/(1-P))^{1-Y}}\right)$ inside the expectation in the definition of ${R_{\mathbb{Q}}}$ is strictly concave, and the set ${\mathcal{F}_{\uparrow ,[0,1]}}$ is convex. Hence ${R_{\mathbb{Q}}}({\pi ^{\ast }})\ge {R_{\mathbb{Q}}}(\pi )$ for all $\pi \in {\mathcal{F}_{\uparrow ,[0,1]}}$, and equality holds if and only if $\pi ={\pi ^{\ast }}$ $\mathbb{Q}$-almost-surely, provided that ${R_{\mathbb{Q}}}({\pi ^{\ast }})$ is finite, which is shown below.
The nonnegativity in part (ii) holds because ${\mathcal{F}_{\uparrow ,[0,1]}}$ contains the identity function, and we only have to prove that $D(\mathbb{Q})\gt 0$ if $\mathbb{Q}\notin {\mathcal{H}_{\text{HL},1}}$. For this, it is sufficient to show that there exists some π with ${R_{\mathbb{Q}}}(\pi )\gt 0$. We start with some results about the existence of certain expected values. Since $Y|P$ is Bernoulli with expectation $\bar{\pi }(P)$, we have
Let now $\tilde{\pi }(P)$ be a version of the conditional expectation of $\bar{\pi }(P)$ with respect to the sigma lattice generated by P, which $\tilde{\pi }(P)$ satisfies the following properties:
Equation (2.10) holds by definition of the conditional expectation given a sigma lattice, and (2.11) and (2.12) are by [5, Equations (3.9) and (3.11)]. By (2.12), we have $\bar{\pi }(P)=P$ almost surely if and only if $\tilde{\pi }(P)=P$ almost surely. By definition of ${\pi ^{\ast }}(P)$, we know that
\[\begin{aligned}{}& \hspace{1em}\hspace{2.5pt}{\mathbb{E}_{\mathbb{Q}}}[\log \hspace{0.1667em}{({\pi ^{\ast }}(P)/P)^{Y}}{((1-{\pi ^{\ast }}(P))/(1-P))^{1-Y}}\mid P]\\ {} & =\bar{\pi }(P)\log ({\pi ^{\ast }}(P)/P)\hspace{-0.1667em}+\hspace{-0.1667em}(1\hspace{-0.1667em}-\hspace{-0.1667em}\bar{\pi }(P))\log ((1\hspace{-0.1667em}-\hspace{-0.1667em}{\pi ^{\ast }}(P))/(1\hspace{-0.1667em}-\hspace{-0.1667em}P))\end{aligned}\]
and the nonnegativity of the Kullback-Leibler divergence implies
\[\begin{aligned}{}& \hspace{1em}\hspace{2.5pt}\bar{\pi }(P)\log ({\pi ^{\ast }}(P)/P)\hspace{-0.1667em}+\hspace{-0.1667em}(1-\bar{\pi }(P))\log ((1\hspace{-0.1667em}-\hspace{-0.1667em}{\pi ^{\ast }}(P))/(1\hspace{-0.1667em}-\hspace{-0.1667em}P))\\ {} & \le \bar{\pi }(P)\log (\bar{\pi }(P)/P)+(1-\bar{\pi }(P))\log ((1\hspace{-0.1667em}-\hspace{-0.1667em}\bar{\pi }(P))/(1\hspace{-0.1667em}-\hspace{-0.1667em}P)),\end{aligned}\]
hence we obtain
(2.9)
\[\begin{aligned}{}0& \le {\mathbb{E}_{\mathbb{Q}}}[\log \hspace{0.1667em}{({\pi ^{\ast }}(P)/P)^{Y}}{((1-{\pi ^{\ast }}(P))/(1-P))^{1-Y}}]\\ {} & \le {\mathbb{E}_{\mathbb{Q}}}[\bar{\pi }(P)\log (\bar{\pi }(P)/P)\\ {} & \hspace{2em}\hspace{2em}+(1-\bar{\pi }(P))\log ((1-\bar{\pi }(P))/(1-P))]\\ {} & \le {\mathbb{E}_{\mathbb{Q}}}[|\log (P)|+|\log (1-P)|]\\ {} & \le \sqrt{{\mathbb{E}_{\mathbb{Q}}}[\log {(P)^{2}}]}+\sqrt{{\mathbb{E}_{\mathbb{Q}}}[\log {(1-P)^{2}}]}\lt \infty .\end{aligned}\](2.11)
\[\begin{aligned}{}& {\mathbb{E}_{\mathbb{Q}}}[(\bar{\pi }(P)-\tilde{\pi }(P))h(P)]\le 0\hspace{2.5pt}\text{for all increasing}\hspace{2.5pt}h\hspace{2.5pt}\text{such that}\hspace{2.5pt}\\ {} & \hspace{1em}{\mathbb{E}_{\mathbb{Q}}}[h{(P)^{2}}]\lt \infty ;\end{aligned}\](2.12)
\[\begin{aligned}{}& {\mathbb{E}_{\mathbb{Q}}}[\bar{\pi }(P){1_{B}}(P)]={\mathbb{E}_{\mathbb{Q}}}[\tilde{\pi }(P){1_{B}}(P)]\hspace{2.5pt}\text{for all}\hspace{2.5pt}B\hspace{2.5pt}\text{in the}\hspace{2.5pt}\\ {} & \hspace{1em}\sigma \text{-field generated by}\hspace{2.5pt}\tilde{\pi }.\end{aligned}\]
\[\begin{aligned}{}& {\mathbb{E}_{\mathbb{Q}}}[\log \hspace{0.1667em}{(\tilde{\pi }(P)/P)^{Y}}{((1-\tilde{\pi }(P))/(1-P))^{1-Y}}]\\ {} & \hspace{1em}\le {\mathbb{E}_{\mathbb{Q}}}[\log \hspace{0.1667em}{({\pi ^{\ast }}(P)/P)^{Y}}{((1-{\pi ^{\ast }}(P))/(1-P))^{1-Y}}].\end{aligned}\]
The goal is now to prove
\[ {\mathbb{E}_{\mathbb{Q}}}\left[\log \left({(\tilde{\pi }(P)/P)^{Y}}{((1-\tilde{\pi }(P))/(1-P))^{1-Y}}\right)\right]\ge 0,\]
with equality if and only if $\tilde{\pi }(P)=P$ $\mathbb{Q}$-almost-surely. Notice that $\tilde{\pi }$ is only defined on the support of P, but one can assume without loss of generality that it is defined on the whole interval $[0,1]$ by right-continuous constant extrapolation in parts where it is not defined. Since $\tilde{\pi }$ is increasing, there exist at most countably many disjoint intervals ${A_{i}}\subseteq [0,1]$, $i\in \mathcal{I}$, on which $\tilde{\pi }$ is constant with some value ${c_{i}}\in [0,1]$. Furthermore, there are at most countably many disjoint intervals ${B_{j}}$, $j\in \mathcal{J}$, whose union equals $[0,1]\setminus {\textstyle\bigcup _{i\in \mathcal{I}}}{A_{i}}$. We now make a few case distinctions.Fix i with ${c_{i}}\gt 0$ and assume that ${A_{i}}=[{a_{i}},{b_{i}}]$; the following arguments can be easily modified for the case that ${A_{i}}$ is (half-)open. Define the function
\[ {h_{i}}(x)=\left\{\begin{array}{l@{\hskip10.0pt}l}\log (1/{a_{i}})+1,\hspace{2.5pt}\hspace{1em}& \hspace{2.5pt}\text{if}\hspace{2.5pt}x\lt {a_{i}},\\ {} \log (1/x),\hspace{1em}& \hspace{2.5pt}\text{if}\hspace{2.5pt}x\in [{a_{i}},{b_{i}}],\\ {} -1,\hspace{1em}& \hspace{2.5pt}\text{if}\hspace{2.5pt}x\gt {b_{i}}.\end{array}\right.\]
Then ${h_{i}}(P)$ is square integrable due to (2.9), ${h_{i}}$ is decreasing, and constant outside of $[{a_{i}},{b_{i}}]$, so that
\[\begin{aligned}{}0& \stackrel{\text{(2.11)}}{\le }{\mathbb{E}_{\mathbb{Q}}}[(\bar{\pi }(P)-\tilde{\pi }(P)){h_{i}}(P)]\\ {} & \stackrel{\text{(2.12)}}{=}{\mathbb{E}_{\mathbb{Q}}}[(\bar{\pi }(P)-\tilde{\pi }(P))\log (1/P){1_{{A_{i}}}}(P)]\\ {} & \stackrel{\text{(2.12)}}{=}{\mathbb{E}_{\mathbb{Q}}}[(\bar{\pi }(P)-\tilde{\pi }(P))\log ({c_{i}}/P){1_{{A_{i}}}}(P)]\\ {} & \hspace{0.1667em}\hspace{0.1667em}\hspace{0.1667em}\hspace{0.1667em}=\hspace{0.1667em}\hspace{0.1667em}\hspace{0.1667em}\hspace{0.1667em}{\mathbb{E}_{\mathbb{Q}}}[(\bar{\pi }(P)-\tilde{\pi }(P))\log (\tilde{\pi }(P)/P){1_{{A_{i}}}}(P)],\end{aligned}\]
where the last step is due to the fact that $\tilde{\pi }(P)={c_{i}}$ for $P\in {A_{i}}$. Hence
\[\begin{aligned}{}& {\mathbb{E}_{\mathbb{Q}}}[\bar{\pi }(P)\log (\tilde{\pi }(P)/P){1_{{A_{i}}}}(P)]\\ {} & \hspace{1em}\ge {\mathbb{E}_{\mathbb{Q}}}[\tilde{\pi }(P)\log (\tilde{\pi }(P)/P){1_{{A_{i}}}}(P)].\end{aligned}\]
If ${c_{i}}=0$, the above inequality is still true because (2.12) implies that $\tilde{\pi }(P)=\bar{\pi }(P)=0$ for $P\in {A_{i}}$ in that case, and we define $0\log (0):=0$.Similarly, for ${c_{i}}\lt 1$ we define
\[ {h_{i}}(x)=\left\{\begin{array}{l@{\hskip10.0pt}l}\log (1/(1-{b_{i}}))+1,\hspace{2.5pt}\hspace{1em}& \hspace{2.5pt}\text{if}\hspace{2.5pt}x\gt {b_{i}},\\ {} \log (1/(1-x)),\hspace{1em}& \hspace{2.5pt}\text{if}\hspace{2.5pt}x\in [{a_{i}},{b_{i}}],\\ {} -1,\hspace{1em}& \hspace{2.5pt}\text{if}\hspace{2.5pt}x\lt {a_{i}},\end{array}\right.\]
which is square integrable and increasing. As before,
\[\begin{aligned}{}0& \stackrel{\text{(2.11)}}{\le }{\mathbb{E}_{\mathbb{Q}}}[(\tilde{\pi }(P)-\bar{\pi }(P)){h_{i}}(P)]\\ {} & \stackrel{\text{(2.12)}}{=}{\mathbb{E}_{\mathbb{Q}}}[(\tilde{\pi }(P)-\bar{\pi }(P))\log (1/(1-P)){1_{{A_{i}}}}(P)]\\ {} & \stackrel{\text{(2.12)}}{=}{\mathbb{E}_{\mathbb{Q}}}[(\tilde{\pi }(P)-\bar{\pi }(P))\log ((1-{c_{i}})/(1-P)){1_{{A_{i}}}}(P)]\\ {} & \hspace{0.1667em}\hspace{0.1667em}\hspace{0.1667em}\hspace{0.1667em}=\hspace{0.1667em}\hspace{0.1667em}\hspace{0.1667em}\hspace{0.1667em}{\mathbb{E}_{\mathbb{Q}}}[(1-\bar{\pi }(P)-(1-\tilde{\pi }(P)))\\ {} & \hspace{1em}\hspace{2em}\times \log ((1-\tilde{\pi }(P))/(1-P)){1_{{A_{i}}}}(P)],\end{aligned}\]
so we obtain
\[\begin{aligned}{}& {\mathbb{E}_{\mathbb{Q}}}[(1-\bar{\pi }(P))\log ((1-\tilde{\pi }(P))/(1-P)){1_{{A_{i}}}}(P)]\\ {} & \hspace{1em}\ge {\mathbb{E}_{\mathbb{Q}}}[(1-\tilde{\pi }(P))\log ((1-\tilde{\pi }(P))/(1-P)){1_{{A_{i}}}}(P)],\end{aligned}\]
which also holds if $\tilde{\pi }(P)=1$ on ${A_{i}}$. Hence we have shown that
\[\begin{aligned}{}0& \le {\mathbb{E}_{\mathbb{Q}}}[{1_{{A_{i}}}}(P)(\tilde{\pi }(P)\log (\tilde{\pi }(P)/P)\\ {} & \hspace{2em}\hspace{2em}+(1-\tilde{\pi }(P))\log ((1-\tilde{\pi }(P))/(1-P)))]\\ {} & \le {\mathbb{E}_{\mathbb{Q}}}[{1_{{A_{i}}}}(P)(\bar{\pi }(P)\log (\tilde{\pi }(P)/P)\\ {} & \hspace{2em}\hspace{2em}+(1-\bar{\pi }(P))\log ((1-\tilde{\pi }(P))/(1-P)))]\\ {} & ={\mathbb{E}_{\mathbb{Q}}}[{1_{{A_{i}}}}(P)\log \hspace{0.1667em}{(\tilde{\pi }(P)/P)^{Y}}{((1-\tilde{\pi }(P))/(1-P))^{1-Y}}],\end{aligned}\]
and equality holds if and only if $\tilde{\pi }(P)=P$ $\mathbb{Q}$-almost-surely on ${A_{i}}$, since the Kullback-Leibler divergence is non-negative.Consider now an interval ${B_{j}}$. Since $\tilde{\pi }$ is strictly increasing on ${B_{j}}$, the sigma field generated by $\tilde{\pi }$ contains all Borel sets which are subsets of ${B_{j}}$. Then (2.12) implies that $\tilde{\pi }(P)=\bar{\pi }(P)$ $\mathbb{Q}$-almost-surely on ${B_{j}}$, hence
\[\begin{aligned}{}& {\mathbb{E}_{\mathbb{Q}}}[{1_{{B_{j}}}}(P)(\bar{\pi }(P)\log (\tilde{\pi }(P)/P)\\ {} & \hspace{2em}+(1-\bar{\pi }(P))\log ((1-\tilde{\pi }(P))/(1-P)))]\\ {} & ={\mathbb{E}_{\mathbb{Q}}}[{1_{{B_{j}}}}(P)(\tilde{\pi }(P)\log (\tilde{\pi }(P)/P)\\ {} & \hspace{2em}\hspace{1em}+(1-\tilde{\pi }(P))\log ((1-\tilde{\pi }(P))/(1-P)))]\ge 0\end{aligned}\]
with equality if and only if $\tilde{\pi }(P)=P$ $\mathbb{Q}$-almost-surely on ${B_{j}}$.With the above derivations, we obtain that for any finite number of indices ${i_{1}},\dots ,{i_{n}}\in \mathcal{I}$, ${j_{1}},\dots ,{j_{n}}\in \mathcal{J}$ and
Since the integrand in (2.13) is non-negative and the integrand in (2.14) dominated pointwise by
Equality in (2.15) holds if and only if $\tilde{\pi }(P)=P$ almost surely.
\[ {C_{n}}=\left({\bigcup \limits_{k=1}^{n}}{A_{{i_{k}}}}\right)\cup \left({\bigcup \limits_{l=1}^{n}}{B_{{j_{l}}}}\right),\]
the following inequalities hold,
\[ M(P)\hspace{-0.1667em}=\hspace{-0.1667em}\bar{\pi }(P)\log (\bar{\pi }(P)/P)\hspace{-0.1667em}+\hspace{-0.1667em}(1\hspace{-0.1667em}-\hspace{-0.1667em}\bar{\pi }(P))\log ((1-\bar{\pi }(P))/(1-P))\]
with ${\mathbb{E}_{\mathbb{Q}}}[M(P)]\lt \infty $, we can choose index sequences such that ${\textstyle\bigcup _{n=1}^{N}}{C_{n}}$ increases to $[0,1]$, and apply Fatou’s Lemma and the dominated convergence theorem to obtain
(2.15)
\[\begin{aligned}{}0& \le {\mathbb{E}_{\mathbb{Q}}}[\tilde{\pi }(P)\log (\tilde{\pi }(P)/P)\\ {} & \hspace{2em}\hspace{2em}+(1-\tilde{\pi }(P))\log ((1-\tilde{\pi }(P))/(1-P))]\\ {} & \le {\mathbb{E}_{\mathbb{Q}}}[\bar{\pi }(P)\log (\tilde{\pi }(P)/P)\\ {} & \hspace{2em}\hspace{2em}+(1-\bar{\pi }(P))\log ((1-\tilde{\pi }(P))/(1-P))].\end{aligned}\]For part (iii), the inequality of arithmetic and geometric mean implies that
\[\begin{aligned}{}E& \ge {\left(\prod \limits_{\sigma \in {\mathcal{S}_{n}}}{\prod \limits_{i=1}^{n}}\frac{{q_{\sigma ,\sigma (i)}^{{Y_{\sigma (i)}}}}{(1-{q_{\sigma ,\sigma (i)}})^{1-{Y_{\sigma (i)}}}}}{{P_{\sigma (i)}^{{Y_{\sigma (i)}}}}{(1-{P_{\sigma (i)}})^{1-{Y_{\sigma (i)}}}}}\right)^{1/n!}}\\ {} & =\exp \left(\frac{1}{n!}\sum \limits_{\sigma \in {\mathcal{S}_{n}}}{\sum \limits_{i=1}^{n}}\log \frac{{q_{\sigma ,\sigma (i)}^{{Y_{\sigma (i)}}}}{(1-{q_{\sigma ,\sigma (i)}})^{1-{Y_{\sigma (i)}}}}}{{P_{\sigma (i)}^{{Y_{\sigma (i)}}}}{(1-{P_{\sigma (i)}})^{1-{Y_{\sigma (i)}}}}}\right).\end{aligned}\]
The term inside the exponential can be written as
\[ L={\mathbb{E}_{\sigma }}\left[{\sum \limits_{i=1}^{n}}\log \frac{{q_{\sigma ,\sigma (i)}^{{Y_{\sigma (i)}}}}{(1-{q_{\sigma ,\sigma (i)}})^{1-{Y_{\sigma (i)}}}}}{{P_{\sigma (i)}^{{Y_{\sigma (i)}}}}{(1-{P_{\sigma (i)}})^{1-{Y_{\sigma (i)}}}}}\right],\]
which is the negative of the entropic loss defined in Section 4.4 of [20], and the expectation ${\mathbb{E}_{\sigma }}[\cdot ]$ is with respect to the uniform distribution over all permutations σ of $\{1,\dots ,n\}$. It follows from Lemma 2.1, Theorem 4.3 and the proof of Theorem 4.1 of [20] that for all $K\in \mathbb{N}$,
\[\begin{aligned}{}& L-{\sum \limits_{i=1}^{n}}\log \frac{{\hat{\pi }_{i}^{{Y_{i}}}}{(1-{\hat{\pi }_{i}})^{1-{Y_{i}}}}}{{P_{i}^{{Y_{i}}}}{(1-{P_{\sigma (i)}})^{1-{Y_{i}}}}}\\ {} & \hspace{1em}\ge -{\sum \limits_{k=1}^{n}}\left(\frac{2}{K}+\frac{4K}{k}\log (1+k)\right),\end{aligned}\]
where ${\hat{\pi }_{1}},\dots ,{\hat{\pi }_{n}}$ is the isotonic regression of ${Y_{1}},\dots ,{Y_{n}}$ on ${P_{1}},\dots ,{P_{n}}$, i.e. the maximizer of
\[ ({g_{1}},\dots ,{g_{n}})\mapsto \hat{R}({g_{1}},\dots ,{g_{n}};\hspace{0.1667em}{P_{1}},\dots ,{P_{n}},{Y_{1}},\dots ,{Y_{n}}),\]
as defined at (2.6). The result now follows because
\[\begin{aligned}{}& \hat{R}({\hat{\pi }_{1}},\dots ,{\hat{\pi }_{n}};\hspace{0.1667em}{P_{1}},\dots ,{P_{n}},{Y_{1}},\dots ,{Y_{n}})\\ {} & \hspace{1em}\ge \hat{R}({\pi ^{\ast }}({P_{1}}),\dots ,{\pi ^{\ast }}({g_{n}});\hspace{0.1667em}{P_{1}},\dots ,{P_{n}},{Y_{1}},\dots ,{Y_{n}})\end{aligned}\]
and ${\textstyle\sum _{k=1}^{n}}(2/K+4K\log (1+k)/k)=\mathcal{O}(\sqrt{n{(\log (n))^{2}}})$ for K of order $\sqrt{n/{(\log (n))^{2}}}$. □Remark.
A alternative idea for constructing an e-value for ${\mathcal{H}_{\text{HL},n}}$ could be a Bayesian approach inspired by the original procedure of the HL-test. Conditional on ${P_{1}},\dots ,{P_{n}}$, the likelihood of ${Y_{1}},\dots ,{Y_{n}}$ is fully specified. For the likelihood under the alternative, one could put a meta-prior on the number g of quantile based bins. Conditional on g, the bin probabilities are then given by a Dirichlet prior. If the hyper-parameters of the Dirichlet prior are chosen independently of the data, then the resulting likelihood ratio (e-value) does not depend on the ordering of the data. However, if one desires to choose the hyper-parameters in a data-adaptive manner, then a similar procedure with averaging over permutations as in our proposed e-value, or a repeated sample splitting approach seem necessary to avoid instabilities due to data ordering in the iid case. We do not expect to obtain universal power guarantees with this approach since it is based on binning and at least some continuity assumptions on the distribution under the alternative seem necessary.
2.5 A Feasible Version of the Test
The ideal test described in Algorithm 1 cannot be implemented for practically relevant n, as it requires to compute e-values over all $n!$ permutations of $\{1,\dots ,n\}$. Even for a single permutation σ, the inner loop in Algorithm 1 has computational complexity of $\mathcal{O}({n^{2}})$: it requires computing $2n$ isotonic regressions to generate out-of-sample predictions. We suggest to address these problems above by the simplified version in Algorithm 2, which can be regarded as a version of the split likelihood ratio test by [40].
A delicate point in Algorithm 2 is Step 6, where one needs to generate out-of-sample predictions from the isotonic regression fit. Naive extrapolation approaches could lead to predicted probabilities ${q_{i}}\in \{0,1\}$ and hence an e-value of zero if either ${q_{i}}=0$ and ${Y_{i}}=1$ or ${q_{i}}=1$ and ${Y_{i}}=0$.
Let ${p_{1}}\lt \cdots \lt {p_{m}}$ denote the distinct values of ${P_{i}}$, $i\in {S_{b}}$, and ${\hat{\pi }_{1}}\le \cdots \le {\hat{\pi }_{m}}$ the corresponding values of the isotonic regression. A well known result about isotonic regression states that there exists a partition of ${S_{b}}$ into index sets ${\mathcal{I}_{1}},\dots ,{\mathcal{I}_{d}}$ such that ${\hat{\pi }_{j}}$ is the empirical mean of the ${Y_{i}}$ with indices in ${\mathcal{I}_{j}}$,
\[ {\hat{\pi }_{j}}=\frac{1}{\mathrm{\# }{\mathcal{I}_{j}}}\sum \limits_{i\in {\mathcal{I}_{j}}}{Y_{i}}.\]
To remedy the problem of predictions in $\{0,1\}$, we propose to apply the smoothed Laplace predictor, equivalent to Jeffreys’ prior in binomial proportion estimation,
\[ {\check{\pi }_{j}}=\frac{1}{\mathrm{\# }{\mathcal{I}_{j}}+1}\left(0.5+\sum \limits_{i\in {\mathcal{I}_{j}}}{Y_{i}}\right)\in (0,1).\]
For out-of-sample predictions at ${P_{i}}\notin \{{p_{1}},\dots ,{p_{m}}\}$, one can then apply linear interpolation
\[ {q_{i}}=\left\{\begin{array}{l@{\hskip10.0pt}l}\displaystyle \frac{{p_{l}}-{P_{i}}}{{p_{l}}-{p_{k}}}{\check{\pi }_{k}}+\displaystyle \frac{{P_{i}}-{p_{k}}}{{p_{l}}-{p_{k}}}{\check{\pi }_{l}},\hspace{1em}& \hspace{2.5pt}\hspace{2.5pt}\text{if}\hspace{2.5pt}{P_{i}}\in [{p_{k}},{p_{l}}],\\ {} {\check{\pi }_{1}},\hspace{1em}& \hspace{2.5pt}\hspace{2.5pt}\text{if}\hspace{2.5pt}{P_{i}}\lt {p_{1}},\\ {} {\check{\pi }_{m}},\hspace{1em}& \hspace{2.5pt}\hspace{2.5pt}\text{if}\hspace{2.5pt}{P_{i}}\gt {p_{m}},\end{array}\right.\]
where it is now guaranteed that ${q_{i}}\in (0,1)$.3 Simulations
This section evaluates the empirical performance of the feasible version of the proposed test in Section 2.5 together with sensible values of the splitting fraction $s\in (0,1)$. We follow the simulation setup of [17] with a quadratic misspecification in assessing HL-type tests, which is, if at all, just slightly modified in more recent contributions [16, 42, 1, 6, 25]. Replication material for the simulations and the application in Section 4 in the statistical software R is available under https://github.com/marius-cp/eHL.
For $i=1,\dots ,2n$ with $n\in \{1024,2048,4096,8192\}$, we simulate the iid covariate ${X_{i}}\stackrel{\text{iid}}{\sim }\text{U}(-3,3)$ and let the response variables ${Y_{i}}\sim \text{Bernoulli}({\pi _{i}})$ be independent, where the true conditional event probability ${\bar{\pi }_{i}}$ follows a logistic transformation of a quadratic model
(3.1)
\[ \begin{aligned}{}{\bar{\pi }_{i}}& =\bar{\pi }({X_{i}})=\mathbb{P}({Y_{i}}=1\mid {X_{i}};\hspace{0.1667em}{\beta _{0}},{\beta _{1}},{\beta _{2}})\\ {} & =\frac{\exp ({\beta _{0}}+{\beta _{1}}{X_{i}}+{\beta _{2}}{X_{i}^{2}})}{1+\exp ({\beta _{0}}+{\beta _{1}}{X_{i}}+{\beta _{2}}{X_{i}^{2}})}.\end{aligned}\]We split the simulated data into an estimation set and validation set, both of size n. Based on the data in the estimation set, we estimate the parameters of a linear, and hence misspecified, logistic regression model by maximum likelihood and denote the parameter estimates by $\big({\widehat{\beta }_{0}},{\widehat{\beta }_{1}}\big)$. The probability of a positive outcome is then predicted by
We vary the severity of the misspecification, expressed through the magnitude of ${\beta _{2}}$. Following [17], we characterize the “lack of linearity” through the conditions $\bar{\pi }(-3)=j-0.00733745$, $\bar{\pi }(-1.5)=0.05$ and $\bar{\pi }(3)=0.95$ such that the value $j=0$ results in the very accurate approximation ${\beta _{2}}\approx 0$, i.e., a linear effect of ${X_{i}}$ on the log odds-ratio. We consider a sequence of 51 equally spaced values of j in the interval $[0,0.1]$. Notice that for each choice of j, the values of ${\beta _{0}}$ and ${\beta _{1}}$ are also determined by these conditions.
Table 1
Rejection rates in percentage points of the classical HL test and our eHL test under the null hypothesis with $j=0$ and the true regression parameters $({\beta _{0}},{\beta _{1}})$ in (3.2) at a significance level of $5\% $. We treat an e-value above 20 as a rejection in the eHL test.
HL | eHL | |||||
s | $1/3$ | $1/2$ | $2/3$ | |||
$n=1024$ | 6.2 | 0.5 | 1.0 | 0.4 | ||
$n=2048$ | 5.0 | 0.1 | 0.4 | 0.6 | ||
$n=4096$ | 4.7 | 0.2 | 0.4 | 0.6 | ||
$n=8192$ | 4.5 | 0.0 | 0.1 | 0.5 |
Table 1 reports rejection rates of the tests over 1000 simulation replications, where we set ${\beta _{2}}=0$ (i.e., $j=0$), and use the true regression parameters $({\beta _{0}},{\beta _{1}})$ in (3.2) to guarantee that the null hypothesis ${\mathcal{H}_{\text{HL},n}}$ holds. For the classical HL test, we use ten equally populated (quantile-spaced) bins, where the exact procedure follows the method Q${^{R}}$ described in Appendix A. For the feasible eHL test of Section 2.5, we use the splitting fractions $s\in \{1/3,1/2,2/3\}$. To limit computation time, we choose a relatively low amount of bootstrap replications $B=10$ in the eHL test as we are mainly interested in rejection rates averaged across simulation replications, and hence, stability of the test is less of a concern as e.g., in the subsequent empirical application. Here and in the following, we treat e-values above 20 as a test rejection at the $5\% $ significance level. The table shows that all tests are well sized, where all eHL versions exhibit rejection frequencies much below the nominal value of $5\% $, which is not unusual for tests based on e-values.
Figure 1
Rejection (left) and e-value growth (right) rates for the classical HL (cHL) test, the feasible eHL test and an oracle eHL test for a range of splitting factors s. The oracle eHL test is based on the true ${\pi _{i}}$. The x-axis contains the severity of model misspecification, and the vertically aligned plots correspond to different sample sizes.
Figure 1 analyzes the tests’ behaviour under the alternative hypotheses induced by $j\gt 0$. Notice that we use the true parameters $({\beta _{0}},{\beta _{1}})$ in (3.2) for $j=0$ but estimates $\big({\widehat{\beta }_{0}},{\widehat{\beta }_{1}}\big)$ for any $j\gt 0$ as the pseudo-true parameters are unknown under model misspecification. In this analysis, we further include an oracle version of the eHL test, whose e-values are optimal in the sense that they are based on ${q_{i}}={\bar{\pi }_{i}}$, i.e., the practically unknown true conditional event probabilities. The oracle eHL version with $s=1/2$ facilitates a fair comparison with the feasible HL test based on the same splitting factor. The left panel of the figure shows classical test rejection rates for a nominal significance level of $5\% $. Following the explanations in Section 2.4 together with [13] and [31], a suitable measure of power for e-values is the growth rate $\mathbb{E}\big[\log ({E_{\text{HL},n}})\big]$, which is shown in the right panel of Figure 1, where we approximate the expectation by the average log e-value over the simulation replications. We restrict attention to $n\in \{1024,4096\}$ as the other sample sizes do not yield further insights.
We find that all tests develop power for increasing misspecification j. E.g., the feasible eHL versions already have substantial power for both sample sizes against an alternative with $j\approx 0.043$, which [17] interpret as a value inducing only ‘slight’ misspecification. (Notice that $j\approx 0.043$ equals 0.05 in their parametrization.) There seems to be little difference among the feasible eHL tests when using different splitting fractions s, and hence, we do not find arguments to deviate from the natural choice of $s=1/2$, which we continue to use in the application.
The higher power of the classical HL test can be explained by the required sample split in the eHL test, and the estimation error in assigning suitable values for ${q_{i}}$. The two oracle eHL tests make these steps redundant and hence achieve comparable power to the classical HL test. Perhaps surprisingly, the difference between the two oracle eHL tests with different s is smaller than the respective difference to the feasible test versions based on estimated ${q_{i}}$’s, which means that tuning the test to a specific alternative through the ${q_{i}}$’s is the main empirical challenge of the eHL test.
Notice that the often overlooked bin specification in the classical HL test implicitly determines the set of alternatives the test has power against as e.g. illustrated in [8, Section 5]. As the sample split in the eHL test allows for estimating a suitable alternative, Theorem 2.1 shows that the (ideal version of the) eHL test has power against all alternatives. This power guarantee together with the eHL test’s stability come at the cost of a lower power compared to the classical HL test in specific smooth forms of misspecifications as shown in Figure 1.
Turning to the growth rates of the feasible eHL tests, we find that larger choices of s perform better for slight model misspecifications (small j) while the opposite is true for large misspecifications. This can be explained since as discussed around (2.2)–(2.3), ${\bar{\pi }_{i}}$ must be on the ‘correct’ side of ${P_{i}}$ to gain power, which might be violated for small s (and n) under slight misspecifications.
4 Application: Credit Card Defaults in Taiwan
In this application, we analyze (re-)calibration of probability predictions for the binary event of credit card defaults in Taiwan in the year 2005. In that time period, banks in Taiwan over-issued credit cards, also to unqualified applicants, who at the same time overused the cards for consumption, resulting in severe credit card debts and damaged consumer finance confidence [43, 24]. This crisis calls for improved and in particular calibrated probability predictions for credit card defaults that can be used for a thorough risk management and improved financial regulations.
For our analysis, we use a data set of $m=30\hspace{0.1667em}000$ credit card holders from Taiwan in 2005, that is publicly available from the UCI Machine Learning Repository [10, 43] under https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. Specifically, the binary response variable ${Y_{i}}\in \{0,1\}$ contains information on whether a default payment, ${Y_{i}}=1$, occurred for customer $i=1,\dots ,m$. We observe a relatively high rate of $22.12\% $ of default payments in the data set that reflects the above mentioned credit card crisis. The data set further includes 23 explanatory variables, containing information on the amount of given credit, gender, education, marital status, age, and various historical payment records for the past six months.
Table 2
Prediction method | eHL e-values | Range of HL p-values |
Logistic model | $7.0\cdot {10^{28}}$ | [0.00, 0.00] |
Logistic model with increased estimation set | $9.6\cdot {10^{22}}$ | [0.00, 0.00] |
Isotonic recalibration | 20.04 | [0.00, 0.91] |
Bagged isotonic recalibration | 6.14 | [0.00, 0.53] |
We randomly split the data into an estimation and a Recalibration set $\mathcal{R}$ with $M=12000$ observations each, and a Validation set $\mathcal{V}$ containing the remaining $n=6000$ observations. We use the estimation set to fit a standard logistic regression model based on all predictor variables by maximum likelihood and compute the model predictions on the recalibration and validation sets, respectively. We run all the following tests on the validation set.
Table 2 reports the e-values of the feasible version of our calibration test described in Section 2.5 based on $B=10000$ bootstrap replication and with a splitting factor of $s=1/2$ that is motivated by our simulation results. We further report the range between the smallest and largest p-value of the classical HL test, where the different p-values result from five different, but natural binning procedures using $g=5,\dots ,20$ bins, respectively. We provide further details on these implementation choices in Appendix A.
The predictions from the logistic model result in an e-value far beyond the value of 20 in Table 2, hence implying that these predictions are clearly miscalibrated. In this setting, all implementation choices of the classical HL test agree and deliver p-values very close to zero. The second row of the table shows that even when using all observations in the “increased estimation set” comprising the estimation and the recalibration set, the situation barely changes and both the eHL and HL tests agree (under all implementation choices).
As a consequence of these clear rejections, we now aim at isotonically recalibrating the probability predictions, a technique that proved valuable in other disciplines [14, 36], where it is also called “post-processing”. For this, we estimate an isotonic regression on the recalibration set $\mathcal{R}$ and generate recalibrated predictions by transforming the logistic predictions on $\mathcal{V}$ with the estimated isotonic regression function. Table 2 shows that our calibration test has an e-value just above 20, i.e., a weak rejection when interpreted as a (conservative) test at the $5\% $ level.
Figure 2
Bagged isotonic recalibration curve of the logit predictions. The blue curve shows the mean, and the red band the range of the pointwise $1\% $ and $99\% $ quantiles, over all bagging iterations.
As a nonparametric method, the isotonic regression is known to involve substantial estimation noise that might adversely affect the recalibrated predictions. Hence, we stabilize the estimation through the classical bagging (bootstrap aggregation) method of [4]. In detail, we draw $\widetilde{B}=100$ bootstrap samples ${\mathcal{R}_{b}}$, $b=1,\dots ,\widetilde{B}$ of size M from the recalibration set $\mathcal{R}$ and estimate the isotonic regression on each bootstrap sample ${\mathcal{R}_{b}}$. The final predictions are obtained by recalibrating with the average of the estimated isotonic regression functions, displayed in Figure 2.
The last row of Table 2 shows an e-value of approximately 6 implying only very weak evidence against the null hypothesis of calibration, once again illustrating the practical strength of both, bagging and recalibration methods. The estimated re-calibration function displayed in Figure 2 reinforces the importance of recalibrating the logistic model predictions by showing that it substantially deviates from the diagonal.
Figure 3
Histograms of p-values of the classical out-of-sample HL test based the five binning procedures given in Appendix A based on 5–20 bins, respectively, resulting in a total of 80 test results.
For these two recalibration methods, the various natural implementation choices of the HL test, further described in Appendix A, result in p-values ranging between essentially 0 and 0.91 (and 0.53 respectively). The corresponding p-value histograms in Figure 3 (and the detailed results in Table 3) show the continuum of p-values, where the null hypothesis is rejected in approximately half of the cases at the $5\% $ level, implying that a researcher can essentially tailor the test decision to her will. As already noted by [17, 3, 22], this is a disconcerting state of affairs for a commonly used testing procedure and calls for more robust alternatives, such as the eHL test proposed in this paper. Appendix A further shows that the feasible eHL test version is affected less by such instabilities arising from the repeated sample splits, at least if B is chosen sufficiently large as in this application.
5 Discussion
This article proposes an e-test for perfect calibration, which is a safe testing counterpart to the widely used Hosmer-Lemeshow test. The proposed eHL test follows a simple betting interpretation (see [31]) where the e-value can be seen as the factor by which we multiply the bet against the hypothesis of perfect calibration. Intuitively, when accumulating money by the bet, we gain evidence against the null. Here, the e-value depends on the probability prediction, its corresponding realization, and an arbitrary value, which we suggest estimating in a two-step approach by isotonic regression. The ideal version of the test has power against all alternatives. In order to achieve this power guarantee, it is important that for any deviation from calibration, that is, for any deviation of $P\mapsto \mathbb{E}(Y|P)$ from the identity, there is an isotonic function with strictly smaller loss. If this property can be achieved with other function classes than isotonic functions is an interesting open question.
We assess the empirical performance of the test to detect quadratic model misspecifications. The simulations show that in samples of more than 2000 observations, the eHL test allows to reliably detect levels of quadratic misspecification, which [17, p. 973] denote to be slight. The intrinsic flexibility of the e-values allows the application of stable data-driven methods (here isotonic regression) instead of the typical binning and counting technique in the HL test. However, this flexibility comes at the cost of lower power in small samples of less than 2000 observations.
The feasible version of our test is based on random splits of the training data. Since the null hypothesis ${\mathcal{H}_{\text{HL},n}}$ requires calibration conditional on the predictions ${P_{1}},\dots ,{P_{n}}$, one also obtains a valid test if the splits are performed systematically based on ${P_{1}},\dots ,{P_{n}}$, for example, by choosing two subsets with similar distribution of the ${P_{i}}$ in order for the isotonic regression estimator to extrapolate well. Systematic sampling approaches to increase the power have already been applied by [11] for testing treatment effects.
Our article focuses on the batch setting where a fixed sample of size n is available, rather than the online setting in which $({P_{i}},{Y_{i}})$, $i\in \mathbb{N}$, arrive sequentially. However, the fact that powerful tests based on isotonic regression can be constructed in the batch setting suggests that similar approaches may be fruitful for online testing. Kotowski et al. [21, Section 7.1] describe algorithms with sublinear regret for online isotonic regression (without the random permutation setting). We believe that in conjunction with parts (i) and (ii) of our Theorem 2.1, it is possible to derive power guarantees for sequential calibration tests where $\mathbb{E}[Y|P]$ is estimated sequentially with isotonic regression. We leave such extensions for future work.