1 Introduction
The logrank test is arguably the most important tool for the statistical comparison of time-to-event data between two groups of participants. Our main focus is when the two groups refer to the treatment and control groups in a randomized controlled trial; the outcome of interest are event times, that is, the time elapsed until an outcome of interest. The logrank test, in turn, uses a simplified version of the proportional hazard ratio model of Cox [3]. For a fixed sample size and under this model, Cox gave a simple but profound insight: inference can be performed using the partial likelihood of having observed the events in the particular order that they were observed. To this end, the logrank test [20, 24], the score test associated to the Cox’ partial likelihood, is optimal for fixed sample size and a restricted alternative. Large-sample properties of the logrank test are known in very general settings [41, 32, 1]. Nevertheless, it is clear that the fixed-sample regime can be overly restrictive. Indeed, due to ethical and practical constraints in human survival-time medical trials, interim analyses may be performed to terminate the study earlier than planned if needed. Consequently, it has been of fundamental importance to develop methods for the sequential analysis of time-to-event data in general; for the logrank test, in particular.
In order to legitimate the use of sequential boundary decisions, asymptotic approximations over the study period have been developed for the logrank statistic [42, 36, 38]. The results in this line of work show the convergence of the sequentially computed logrank statistic to a rescaled Brownian motion under very general censoring and participant-arrival patterns. When interim analyses are only performed at discrete times, the decision boundaries based on continuously monitoring the logrank statistic are known to be overly conservative. This deficiency is addressed by group-sequential and α-spending methods, which, using knowledge of the interim analysis times relative to a predefined maximum number of events, allow for tighter decision boundaries [25, 22, 14]. These sequential methods allow several interim looks at the data to stop for efficacy (if the treatment shows to be beneficial) or futility (if the study is no longer likely to reach statistical significance).
Despite the profound impact that these methods have had in statistical practice, the requirement of a maximum sample size limits the utility of a promising but nonsignificant study once the maximum sample size is reached. Because of their design, extending such a trial makes it impossible to control their type-I error. Moreover, the evidence gathered in new—possibly unplanned—trials cannot be added in a typical retrospective meta-analysis, when the number of trials or timing of the meta-analysis are dependent on the trial results. Such dependencies introduce accumulation bias and invalidate the assumptions of conventional statistical procedures in meta-analysis [34]. In order to address these deficiencies, we look for flexible anytime-valid methods that provide type-I error control in two situations: (1) optional stopping, which refers to halting the experiment earlier or later than planned under arbitrary stopping rules, and (2) meta-analysis and optional continuation, which refers to the aggregation of evidence of possibly interdependent studies. Just as the existing methods, our approach is connected to early work by H. Robbins and collaborators [5, 16]. Most notably, existing approaches come with fixed stopping rules, which are not desirable in the use cases that are of our present interest. The details of the present approach are very different, and to some extent, as we will see, more straightforward.
The main result of this work is the anytime-valid (AV) logrank test, an anytime-valid test for the statistical comparison of time-to-event data from two groups of participants. The AV logrank test uses the exact ratio of the sequentially computed Cox partial likelihood as test statistic. This is in contrast to the conventional logrank test, that can be interpreted as transforming a test statistic into a p-value whose distribution is, in all applications we are aware of, not determined exactly but rather approximated by a normal or a generalized beta distribution [24]. Such approximations are valid only in asymptotic senses, and even then, some further conditions must often hold. In contrast to these approximations, the exact anytime-valid logrank test has guaranteed type-I error control also for small sample sizes, without worries whether asymptotic approximations to the sampling distribution are justified. The advantage of having an exact test becomes particularly clear in the case of unbalanced allocation, when both control and treatment groups start with different numbers of participants. For this case, it has been documented that α-spending approaches do not provide strong type-I error guarantees due to the approximations involved [51]. The basic version of the AV logrank test is, however, exact; unbalanced allocation presents no difficulties.
From a technical point of view, we show, under general patterns of incomplete observation, that under the composite null hypothesis our test statistic is a continuous-time martingale with expected value equal to one. Statistics with this sequential property are referred to as test martingales; they form the basis of anytime-valid tests [28]. The AV logrank test is a concrete instance of such a test martingale derived from the recent theory of anytime-valid hypothesis testing based on E-processes [29, 9, 37, 46]. At each time an event takes place, it takes on the form of a likelihood ratio ratio between two multinomial distributions in a sampling-without-replacement setting. Together with the underlying assumption of continuous time, this without-replacement property makes the test martingale and corresponding filtration different in spirit than in most existing work (such as Lindon and Malek [19] who consider discrete test martingales for multinomial experiments under sampling-with-replacement). Nevertheless, just as in [48, 49] who also considered a sampling without-replacement (but otherwise quite different) setting, we can show similar results on anytime-validity as in discrete-time, sampling-with-replacement settings.
In contrast to p-values, an analysis based on E-processes can extend existing trials as well as inform the decision to start new trials and meta-analyses, while still controlling type-I error rate. Type-I error control is retained even (i) if the E-process is monitored continuously and the trial is stopped early whenever the evidence is convincing, (ii) if the evidence of a promising trial is increased by extending the experiment and (iii) if a trial result spurs a new trial with the intention to combine them in a meta-analysis.
The AV logrank test was developed with a specific application in mind and it illustrates its usefulness. Some of the authors were involved in applying the AV logrank test to the continuous meta-analysis of seven Coronavirus disease (COVID-19) clinical trials—the results are available as a living systematic review including code and summary data to reproduce the analysis [33]. This analysis was performed concurrently with the trials in a so-called Anytime Live and Leading Interim (ALL-IN) meta-analysis [35]. We remark that even in the presence of dependencies between the existence and size of the trials, the test based on the multiplication of the values of E-processes retains type-I error control as long as all trials test the same global null hypothesis, as was the case in the above application. This is generally useful if we want to combine the results of several trials in a bottom-up retrospective meta-analysis, where ‘bottom-up retrospective’ means that the trials did not happen in coordination, and the decision to group them together was made after they had already started, so that no coordinated ‘top-down’ stopping rule can be enforced. It is even possible to obtain an interim meta-analysis result by combining interim results of ongoing trials by multiplication, stepping beyond the realm of existing sequential approaches.
1.1 Contributions and Outline
We begin with Section 2, where we review the special instance of Cox’ proportional hazards model for the two-group setting. There, we set the assumptions and notation used in the rest of the article. The definitions presented there are standard. In Section 3, we define the AV logrank test and prove that is indeed anytime valid. We first do this for (a) the case with only a group indicator (no other covariates) and without simultaneous events (ties). There, we also discuss its optimality properties and extend it to (b) the case with ties and to (c) the case when one wants to learn the actual effect size from the data. As usual for anytime-valid tests, they keep providing nonasymptotic type-I error control even if the alternative is wildly misspecified. This is of particular interest for the important case in which the null expresses ‘both groups are indistinguishable’, highlighted in Example 3.2.
These results hinge on showing that the likelihood underlying Cox’ proportional hazards model can be used to define E-variables and test martingales. In Section 4, we show a Gaussian approximation to the AV logrank statistic that is useful in the common situation when only summary statistics are available. We then provide extensive computer simulations to compare the AV logrank test to the classic logrank test and α-spending approaches. In Section 4.1, we show that the exact AV logrank test has a similar rejection region to O’Brien-Fleming α-spending for those designs and hazard ratios where it is well-approximated by a Gaussian AV logrank test. While always needing a small amount of extra data in the design phase (the price for indefinite optional continuation), the expected sample sizes needed for true rejections remain very competitive. During the design phase of a study, we might want to design for a maximum sample size in order to achieve a certain power, but need a smaller sample size on average during the study since we can safely engage in optional stopping. In Section 5, we show that AV-logrank-type tests can be combined through multiplication to perform meta-analysis, and in Section 6, we show how the test can be used to derive confidence sequences for the hazard ratio. In Section 7, we compare the sample sizes that are needed during the design phase in order to achieve a targeted power. Throughout the paper, we present several different instantiations of the AV logrank test. In the final Section 8 we provide recommendations for which one to use in which practical situation. There we also make concluding remarks and discuss future research directions, in particular the possibility to include covariates other than group membership, i.e. the full Cox model, which is theoretically straightforward but practically challenging unless the number of covariates remains small.
We remark that once the definitions are in place, the technical results are mostly straightforward consequences from earlier work; in particular, of the work of Cox [4], Slud [39] and Andersen et al. [1]. The novelty of the present work is thus mainly in defining the AV logrank test and showing by computer simulation that, while being substantially more flexible, it is competitive with existing approaches—the classic logrank test with fixed design and in combination with α-spending.
Next to the main body of this article, we provide two appendices. We delegate to Appendix A proofs and remarks that, while important, are not needed to follow the main development. Most importantly, the particular E-variable we design is growth-rate optimal in the worst case, GROW (see Section 3.1). Grünwald et al. [9] provide several motivations for this criterion; we provide an additional one using an argument of Breiman [2], which does not seem to be widely known. This argument shows a connection between growth-rate optimality and tests with minimal expected stopping time. In Appendix B, we discuss additional details of the extension to the case of general covariates.
2 Proportional Hazards Model and Cox’ Partial Likelihood
We begin by describing the hypothesis that is being tested, the data that are available, and Cox’ proportional hazards model. We are interested in comparing the survival rates between two groups of participants, Group A and Group B. In a randomized controlled trial, Group A would signify the control group; Group B, the treatment group. We assume that the available data about m participants are a subset of $\{({X^{i}},{g^{i}},{\delta ^{i}}):i=1,\dots ,m\}$, where ${X^{i}}=\min \{{T^{i}},{C^{i}}\}$ is the minimum between the event time ${T^{i}}$ and the (possibly infinite) censoring time ${C^{i}}$; ${g^{i}}$ is a zero-one covariate depending on group membership (${g^{i}}=0$ signifies that $i\in A$; ${g^{i}}=1$, that $i\in B$); and ${\delta ^{i}}=\mathbf{1}\{{X^{i}}={T^{i}}\}$ is the indicator of whether the event was witnessed before censoring or not. The idea is that at (continuous) time t, we have observed all (and only) those tuples $({X^{i}},{g^{i}},{\delta ^{i}})$ with ${X^{i}}\le t$, i.e. those participants that have experienced an event. Let ${m^{A}}$ be the number of members of Group A and ${m^{B}}$ the number of members of Group B—then ${m^{A}}+{m^{B}}=m$. Define $\mathbf{g}=({g^{1}},\dots ,{g^{m}})$, the vector of group memberships. We assume that ${T^{1}},\dots ,{T^{m}},{C^{1}},\dots ,{C^{m}}$ are independent and have continuous distribution functions. The continuity assumption precludes tied observations; we relax this assumption later on, in Section 3.3. For $i=1,\dots ,m$, the survival rates are quantified by the hazard functions ${\lambda ^{i}}={({\lambda _{t}^{i}})_{t\ge 0}}$ for ${T_{i}}$, given by
As is customary, the hazard function ${\lambda ^{i}}$ at t can be interpreted via the conditional probability of witnessing an event in a short time span provided that the event has not been witnessed up to t, that is,
Given our interest in comparing the survival rates between the two groups, suppose that all participants i of Group A have a common hazard function ${\lambda _{t}^{i}}={\lambda _{t}^{A}}$; members i of Group B, ${\lambda _{t}^{i}}={\lambda _{t}^{B}}$ (thus, for all participants i within the same group, the event times ${T_{i}}$ are identically distributed). Using the data, we wish to test proportional hazards hypotheses. Concretely, we test the hypotheses ${\mathcal{H}_{0}}$ that the hazard function of the members of both groups satisfy ${\lambda _{t}^{A}}={\theta _{0}}{\lambda _{t}^{B}}$, against an alternative hypothesis ${\mathcal{H}_{1}}$ that ${\lambda _{t}^{B}}=\theta {\lambda _{t}^{A}}$ for a $\theta \ne {\theta _{0}}$. As a first application of the methods that we develop, we consider a left-sided alternative, that is,
where θ is known as the hazard ratio and is the main quantity of statistical interest, and ${\theta _{1}}$ would be, in a clinical trial, a minimal clinically relevant effect size. The alternative is what we hope for in case of negative events, such as death, with treatments that are set out to lower (relative to the control condition) the hazard rate. Notice that the hypotheses in (2.3) are, in fact, nonparametric. Similarly, if the event is positive, e.g., recovery from an infection, we would typically set a right-sided alternative, which can be also be treated with the present methods. Two-sided alternatives and also the full alternative hypothesis ${\mathcal{H}^{\prime }_{1}}:\theta \ne {\theta _{0}}$ are also amenable to the methods that will follow. We remark, however, that all the methods retain their type-I error guarantees irrespective of the specific alternative that we use. As will be seen in Example 3.2 below, this is of particular interest in the simple, yet important case with ${\theta _{0}}=1$ and full alternative ${\mathcal{H}^{\prime }_{1}}:\theta \ne 1$—in that case, we can think of our test as really testing ‘the treatment has no effect’ against general alternative ‘the treatment has an effect’ (even though the proportional hazard assumption may be substantially violated).
(2.2)
\[ \mathbf{P}\{t\le {T^{i}}\lt t+\Delta t\hspace{2.5pt}|\hspace{2.5pt}t\le {T^{i}}\}={\lambda _{t}^{i}}\Delta t+o(\Delta t)\hspace{2.5pt}\text{as}\hspace{5pt}\Delta t\to 0.\](2.3)
\[\begin{aligned}{}& {\mathcal{H}_{0}}:{\lambda _{t}^{B}}={\theta _{0}}{\lambda _{t}^{A}}\hspace{5pt}\text{vs.}\hspace{7.5pt}{\mathcal{H}_{1}}:{\lambda _{t}^{B}}=\theta {\lambda _{t}^{A}}\\ {} & \hspace{1em}\text{for some}\hspace{5pt}\theta \le {\theta _{1}}\lt {\theta _{0}}\hspace{2.5pt}\text{and all}\hspace{2.5pt}t,\end{aligned}\]We now turn to defining Cox’ partial likelihood ${\mathrm{PL}_{t}}$, which is at the center of our approach. To that end, we need a battery of standard definitions—we lay them out to establish the notation. Let ${y_{t}^{i}}=\mathbf{1}\{{X^{i}}\ge t\}$ be the at-risk process, that is, the indicator of whether participant i is still at risk at time t, and let ${\bar{y}_{t}^{A}}={\textstyle\sum _{i\in A}}{y_{t}^{i}}$ and ${\bar{y}_{t}^{B}}={\textstyle\sum _{i\in B}}{y_{t}^{i}}$ be the number of participants at risk in each of the groups at time t. Define ${\mathbf{y}_{t}}=({y_{t}^{1}},\dots ,{y_{t}^{m}})$, the vector of at-risk processes, and ${\mathcal{R}_{t}}=\{j:{y_{t}^{j}}=1\}$, the set of participants at risk at time t. Let ${T^{(1)}}\lt {T^{(2)}}\lt \cdots \lt {T^{({\bar{N}_{\infty }})}}$ be the set of ordered events times that were witnessed (not censored). Note that, if all participants witness the event and censoring is absent, ${\bar{N}_{\infty }}=m$. For each $k=1,\dots ,{\bar{N}_{\infty }}$, let ${I_{(k)}}$ be the index of the individual that witnessed the event at time ${T^{(k)}}$. This means, for example, that if participant with label three was the fifth to witness the event, then ${I_{(5)}}=3$. Abbreviate by ${y_{(k)}^{i}}$, ${\bar{y}_{(k)}^{A}}$, ${\bar{y}_{(k)}^{B}}$, ${\mathcal{R}_{(k)}}$ the corresponding quantities at event time ${T^{(k)}}$, and define ${g^{(k)}}:={g^{{I_{(k)}}}}$. Cox’ partial likelihood ${\mathrm{PL}_{\theta ,t}}$ can be sequentially computed by
Cox’ likelihood evaluated at the event times ${T^{(1)}},{T^{(2)}},\dots \hspace{0.1667em}$ coincides to that of a sequence of multinomial trials where, at event time ${T^{(k)}}$, each of the participants $i\in {\mathcal{R}_{(k)}}$ witnesses the event with probability
We note that if, under the null, the groups are indistinguishable (see also Example 3.2 below), $\theta ={\theta _{0}}=1$, (2.4) reduces to the likelihood of a sequence of draws-with-replacement from an urn that initially contains ${m^{A}}$ marbles of color A and ${m^{B}}$ marbles of color B:
Cox showed that, indeed, conditionally on all the information accrued strictly before ${T^{(k)}}$, the probability that participant i observes an event at time ${T^{(k)}}$ is exactly ${p_{\theta ,(k)}}(\hspace{2.5pt}i\hspace{2.5pt})$ as long as the hazard ratio is θ. With these likelihood computations at hand, we are in place to show the main contribution of this article, the AV logrank test, which uses the partial likelihood ratio as the test statistic.
(2.4)
\[ {\mathrm{PL}_{\theta ,t}}=\prod \limits_{k:{T^{(k)}}\le t}\frac{{\theta ^{{g^{(k)}}}}}{{\textstyle\sum _{l\in {\mathcal{R}_{{T^{(k)}}}}}}{\theta ^{{g^{l}}}}}=\prod \limits_{k:{T^{(k)}}\le t}\frac{{\theta ^{{g^{(k)}}}}}{{\bar{y}_{(k)}^{A}}+\theta {\bar{y}_{(k)}^{B}}}.\](2.5)
\[ \begin{aligned}{}{p_{\theta ,(k)}}(\hspace{2.5pt}i\hspace{2.5pt})& :=\mathbf{P}\{{I_{(k)}}=i\mid {\mathbf{y}_{(l)}},\mathbf{g};\hspace{2.5pt}l=1,\dots k\},\\ {} {p_{\theta ,(k)}}(\hspace{2.5pt}i\hspace{2.5pt})& =\frac{{\theta ^{{g^{i}}}}}{{\bar{y}_{(k)}^{A}}+\theta {\bar{y}_{(k)}^{B}}}.\end{aligned}\](2.6)
\[ {\mathrm{PL}_{\theta ,t}}=\prod \limits_{k:{T^{(k)}}\le t}\frac{{\theta ^{{g^{(k)}}}}}{{\bar{y}_{(k)}^{A}}+{\bar{y}_{(k)}^{B}}}.\]3 The AV Logrank Test
In this section the AV logrank test for (2.3) is introduced; its type-I error guarantees and optimality properties are investigated. We give a solution to the first of the purposes laid down in the introduction: we show that the AV logrank test is anytime valid—its type-I error guarantees are not affected by optional stopping. The fact that it is also type-I-error-safe under optional continuation, our second purpose, is proven in Section 5. Without further ado, we define the AV logrank statistic ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$, typically, ${\theta _{0}}=1$, for (2.3) as the partial likelihood ratio
Here, ${p_{\theta ,(k)}}$ is as defined in (2.5); the product that defines our statistic ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ runs over the events that have been witnessed up to and including time t, and the empty product is taken to be equal to one. As is conventional with likelihood ratios, high values of ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ are indicative that the alternative hypothesis is better than the null hypothesis at the describing the data. Given a tolerable type-I error bound α and an arbitrary random time τ, the AV logrank test is the test that rejects the null hypothesis if ${S_{{\theta _{0}}\tau }^{{\theta _{1}}}}$ is above the threshold $1/\alpha $, that is,
As we will see, by its sequential properties, ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ takes large values with small probability under the null hypothesis uniformly over time, which translates into type-I error control for the test ${\xi _{{\theta _{0}},\tau }^{{\theta _{1}}}}$. This observation is behind the any-time validity of the AV logrank test, and of anytime-valid tests in general (more details and general constructions to the effect of anytime-valid sequential testing can be found in the work of Ramdas et al. [28]). We show in the following proposition that the test ${\xi _{{\theta _{0}},\tau }^{{\theta _{1}}}}$ has the desired type-I error control.
(3.1)
\[ {S_{{\theta _{0}},t}^{{\theta _{1}}}}=\frac{{\mathrm{PL}_{{\theta _{1}},t}}}{{\mathrm{PL}_{{\theta _{0}},t}}}=\prod \limits_{k:{T^{(k)}}\le t}\frac{{p_{{\theta _{1}},(k)}}({I_{(k)}})}{{p_{{\theta _{0}},(k)}}({I_{(k)}})}.\](3.2)
\[ {\xi _{{\theta _{0}},\tau }^{{\theta _{1}}}}=\mathbf{1}\{{S_{{\theta _{0}},\tau }^{{\theta _{1}}}}\ge 1/\alpha \}:=\left\{\begin{array}{l@{\hskip10.0pt}l}1\hspace{1em}& \hspace{2.5pt}\text{if}\hspace{2.5pt}{S_{{\theta _{0}},\tau }^{{\theta _{1}}}}\ge 1/\alpha \\ {} 0\hspace{1em}& \hspace{2.5pt}\text{otherwise}.\end{array}\right.\]Proposition 3.1.
Let ${\mathbf{P}_{0}}$ be any distribution under which the hazard ratio is equal to ${\theta _{0}}$, and let τ be any random time. The test ${\xi _{{\theta _{0}},\tau }^{{\theta _{1}}}}=\mathbf{1}\{{S_{{\theta _{0}},\tau }^{{\theta _{1}}}}\ge 1/\alpha \}$, where ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ is as in (3.1), has level α, that is,
This result can be readily obtained using the sequential-multinomial interpretation of Cox’ likelihood ratio. As we will see, in Section 3.1, this result can be interpreted in terms of E-variables and E-processes [9]. Define the process ${({S_{{\theta _{0}},(k)}^{{\theta _{1}}}})_{k=1,2,\dots }}$ as the value of the AV logrank statistic at the event times ${T^{(k)}}$, that is, ${S_{{\theta _{0}},(k)}^{{\theta _{1}}}}:={S_{{\theta _{0}},{T^{(k)}}}^{{\theta _{1}}}}$. In this time discretization, the AV logrank statistic is the product of random variables
the one-outcome partial likelihood ratio for the kth event, where ${p_{{\theta _{0}},(k)}}$ is as in (2.5) and $k=1,2,\dots \hspace{0.1667em}$.
(3.3)
\[ {R_{{\theta _{0}},(k)}^{{\theta _{1}}}}={p_{{\theta _{1}},(k)}}({I_{(k)}})/{p_{{\theta _{0}},(k)}}({I_{(k)}}),\]Proof of Proposition 3.1.
Under any distribution under which the hazard ratio is ${\theta _{0}}$, the fact that the likelihood of observing ${I_{(k)}}$ conditionally on $\{{\mathbf{y}_{(l)}}:l=1,\dots ,k\}$ equals ${p_{{\theta _{0}},(k)}}({I_{(K)}})$ implies that
This immediately shows that ${S_{{\theta _{0}},(k)}^{{\theta _{1}}}}={\textstyle\prod _{i\le k}}{R_{{\theta _{0}},(k)}^{{\theta _{1}}}}$ is a test martingale, a nonnegative martingale with expected value equal to one, with respect to the filtration ${\mathcal{F}_{-}}={({\mathcal{F}_{{(k)^{-}}}})_{k=1,2,\dots }}$ of sigma-algebras ${\mathcal{F}_{{(k)^{-}}}}=\sigma ({\mathbf{y}_{(k)}}:k=1,\dots ,k)$. Next, the type-I error control for the the test ${\xi _{{\theta _{0}}}^{{\theta _{1}}}}$ follows from Ville’s inequality, which asserts that, under the null hypothesis, the test martingale ${S_{{\theta _{0}},(k)}^{{\theta _{1}}}}$ takes large values with small probability. Ville’s inequality [44] implies that
(3.4)
\[ \mathbf{E}[{R_{{\theta _{0}},(k)}^{{\theta _{1}}}}\mid {\mathbf{y}_{(1)}},\dots ,{\mathbf{y}_{(k)}}]=\sum \limits_{j\in {\mathcal{R}_{(k)}}}{p_{{\theta _{0}},(k)}}(j)\frac{{p_{{\theta _{1}},(k)}}(j)}{{p_{{\theta _{0}},(k)}}(j)}=1.\]
\[ \mathbf{P}\left\{\underset{k=1,2,\dots }{\sup }{S_{{\theta _{0}},(k)}^{{\theta _{1}}}}\ge 1/\alpha \right\}\le \mathbf{E}[{S_{{\theta _{0}},(1)}^{{\theta _{1}}}}]\alpha =\alpha .\]
The previous display is a bound on ever making a type-I error when using the AV logrank test ${\xi _{{\theta _{0}},\tau }^{{\theta _{1}}}}$. □Proposition 3.1 is a variation of the standard statement and result establishing that tests based on E-processes are anytime-valid [29]. As remarked by a referee, one notable difference is that, within a clinical trial setting, in the standard statement each subsequent factor in ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ would refer to an additional participant coming in, whereas in our setting, it corresponds to an event, meaning that a patient leaves. Nevertheless, the technique used to prove Proposition 3.1 is entirely analogous to the proof of the standard result (namely, Ville’s inequality), as has previously shown to be also the case in other anytime-valid work that deals, like ours, with a drawing-without-replacement setting [48, 49].
Under general patterns of incomplete observation—like independent censoring or independent left truncation—, the AV logrank test provides the same type-I error guarantees. To prove this we do need to go beyond the standard, discrete-time setting: we give an alternative proof of Proposition 3.1 in Appendix A using the counting-process formalism [1]. There, we show that if the compensators of the underlying counting processes have a certain general product structure—which is the case under complete observation—, the AV logrank test is anytime-valid. We then refer to Andersen et al. [1], who show that this structure is preserved under said patterns of incomplete observation.
The AV-logrank test is optimal—in a sense to be defined in the next section—among a large family of statistics. A second look at the proof of Proposition 3.1 suggests a generalization of the AV logrank statistic given in (3.1). Let, for each k, ${q_{(k)}}$ be a probability distribution on participants in the risk set ${\mathcal{R}_{(k)}}$ which is only allowed to depend on ${\mathbf{y}_{(1)}},\dots ,{\mathbf{y}_{(k)}}$. Analogously to (3.3), we define the one-outcome ratio ${R_{{\theta _{0}},(k)}^{q}}:={q_{(k)}}({I_{(k)}})/{p_{{\theta _{0}},(k)}}({I_{(k)}})$—we now use ${q_{(k)}}$ instead of ${p_{{\theta _{1}}}}$—, and
A modification of the previous argument shows, for any random time τ, a type-I error guarantee for the test ${\xi _{{\theta _{0}},\tau }^{q}}$ based on the value of ${S_{{\theta _{0}},\tau }^{q}}$, that is, ${\xi _{{\theta _{0}},\tau }^{q}}:=\mathbf{1}\{{S_{{\theta _{0}},\tau }^{q}}\ge 1/\alpha \}$ (see Proposition 3.1). Any such test is also anytime valid as long as each ${q_{(k)}}$ depends on the data only through ${\mathbf{y}_{(1)}},\dots ,{\mathbf{y}_{(k)}}$. In Section 3.2, we use this generalization to provide tests when no value of ${\theta _{1}}$ is available. This generalization raises a natural question about the optimality of the AV logrank test based on (3.1) among test statistics of the form (3.5). This is the subject of the next section.
(3.5)
\[ {S_{{\theta _{0}},t}^{q}}:=\prod \limits_{k:{T^{(k)}}\le t}{R_{{\theta _{0}},(k)}^{q}}=\prod \limits_{k:{T^{(k)}}\le t}\frac{{q_{(k)}}({I_{(k)}})}{{p_{{\theta _{0}},(k)}}({I_{(k)}})}.\]3.1 E-variables and Optimality
The random variables ${\{{R_{{\theta _{0}},(k)}^{{\theta _{1}}}}\}_{k=1,2\dots }}$ from (3.3) and ${\{{R_{{\theta _{0}},(k)}^{q}}\}_{k=1,2\dots }}$ from (3.5) are examples of (conditional) E-variables—nonnegative random variables whose (conditional) expected value is below 1 under every distribution in the null hypothesis. E-variables and E-processes are the “correct” generalization of likelihood ratios to the case that either or both ${\mathcal{H}_{0}}$ and ${\mathcal{H}_{1}}$ are composite and can be interpreted in terms of gambling [9, 37, 28]. Under this gambling interpretation, a test martingale, a product of conditional E-variables, is the total profit made in a sequential gambling game where no earnings are expected under the null hypothesis. The analogy is thus between profit and evidence: no evidence can be gained against the null hypothesis if it is true. Just as p-values, the definition of E-variables and test martingales does not need any mention of an alternative hypothesis. However, if a composite set of alternative distributions is available, a gambler who is skeptical of the null distribution might want to maximize the speed of evidence accumulation (or of capital growth) under the alternative hypothesis. The worst-case growth rate is defined (conservatively) as the smallest expectation of the logarithm of the E-variable under the alternative. Consequently, any E-variable achieving it is called GROW, for Growth-Rate Optimal in the Worst case (see the work of Grünwald [9] and Shafer [37] for additional reasons to use this optimality criterion).
We instantiate this reasoning to our present problem. For the left-sided alternative (2.3), the choice ${R_{{\theta _{0}},(k)}^{{\theta _{1}}}}$ is conditionally GROW because it maximizes the worst-case conditional growth rate
\[ {R_{{\theta _{0}},(k)}^{q}}\mapsto \underset{\theta \le {\theta _{1}}}{\min }{\mathbf{E}_{\theta }}[\ln {R_{{\theta _{0}},(k)}^{q}}|{\mathbf{y}_{(1)}},\dots ,{\mathbf{y}_{(k)}}],\]
over all valid choices of ${q_{(k)}}$ (which can only depend on the data through ${\mathbf{y}_{(1)}},\dots ,{\mathbf{y}_{(k)}}$), that is,
\[\begin{aligned}{}& \underset{\theta \le {\theta _{1}}}{\min }{\mathbf{E}_{\theta }}[\ln {R_{{\theta _{0}},(k)}^{{\theta _{1}}}}|{\mathbf{y}_{(1)}},\dots ,{\mathbf{y}_{(k)}}]\\ {} & \hspace{1em}=\underset{q}{\max }\underset{\theta \le {\theta _{1}}}{\min }{\mathbf{E}_{\theta }}[\ln {R_{{\theta _{0}},(k)}^{q}}|{\mathbf{y}_{(1)}},\dots ,{\mathbf{y}_{(k)}}].\end{aligned}\]
In Appendix A.1, we show that in the limit that the risk sets are much larger than the number of events that are witnessed, this worst-case growth criterion yields a test that minimizes the worst-case expected stopping time—under the alternative hypothesis—among the tests that stop as soon as ${S_{{\theta _{0}},t}^{q}}\ge 1/\alpha $. Thus, among all possible AV logrank tests of the form (3.5), there are strong reasons to choose ${\xi _{{\theta _{0}},\tau }^{{\theta _{1}}}}$.In a similar fashion, a test can be constructed for two sided alternatives. Indeed, consider a testing problem of the form
where ${\theta _{1}}\lt 1$. For this problem, we can create a weighted, conditionally GROW, E-variable by using ${R^{2-\mathrm{sided}}}=\frac{1}{2}{R_{{\theta _{0}},(k)}^{{\theta _{1}}}}+\frac{1}{2}{R_{{\theta _{0}},(k)}^{1/{\theta _{1}}}}$.
(3.6)
\[ \begin{aligned}{}{\mathcal{H}_{0}}& :{\lambda ^{B}}={\lambda ^{A}}\hspace{5pt}\text{vs.}\hspace{7.5pt}\\ {} {\mathcal{H}_{1}}& :{\lambda ^{B}}=\theta {\lambda ^{A}}\hspace{5pt}\text{for some}\hspace{5pt}\theta \le {\theta _{1}}\hspace{7.5pt}\text{or}\hspace{5pt}\theta \ge 1/{\theta _{1}},\end{aligned}\]3.2 Learning the Hazard Ratio from Data
So far, the alternative hypotheses that we have studied are of the form ${\mathcal{H}_{1}}:\theta \le {\theta _{1}}$ for some value of ${\theta _{1}}\lt 1$. In some cases, such a value of ${\theta _{1}}$ is available from the context of the analysis. For instance, ${\theta _{1}}$ can correspond to a minimal clinically relevant effect that is satisfactory in a medical trial. However, sometimes it is not clear which value ${\theta _{1}}$ to chose. Still, statistics of the form (3.5) are useful to test a null hypothesis ${\mathcal{H}_{0}}$ as in (2.3). Indeed, for each k, we can use conditional probability mass functions ${q_{(k)}}$ that depend on data observed on $t\lt {T^{(k)}}$ and enable us to implicitly learn the hazard ratio θ. Two standard types of such alternatives are readily available [29]: a prequential plug-in likelihood and a Bayes predictive distribution. We proceed to describe the former, on which we base our experiments. For completeness we also briefly describe the latter in Appendix A.4.
Using only the data observed in $t\lt {T^{(k)}}$, let ${\breve{\theta }_{(k)}}$ be the smoothed maximum likelihood estimator
where ${p_{\theta ,0}}$ is a smoothing based on the likelihood of having observed two “virtual” data points prior to the observed data, that is, ${p_{\theta ,0}}=1/({\bar{y}_{0}^{A}}+1+\theta ({\bar{y}_{0}^{B}}+1))\times \theta /({\bar{y}_{0}^{A}}+\theta ({\bar{y}_{0}^{B}}+1))$. The statistic ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$ is (3.5) with ${q_{(k)}}={p_{{\breve{\theta }_{(k)}},(k)}}$, and it can also be used to define an anytime-valid test. With this choice, the process ${q_{(1)}},{q_{(2)}}\dots \hspace{0.1667em}$, is a typical instance of a prequential plug-in likelihood, that is often based on suitable smoothed likelihood-based estimators [11]. The phrase ‘prequential’ is due to [6], who reinvented and investigated the method, which is in fact much older, going back to [45]; see [29] for its history. The rationale behind this method is the following. Suppose the data are actually sampled from a distribution according to which the hazard ratio is θ. For sufficiently large initial risk sets, that is, if ${\bar{y}_{0}^{A}}$ and ${\bar{y}_{0}^{B}}$ are not too small, by the law of large numbers, the smoothed maximum likelihood estimate ${\breve{\theta }_{(k)}}$ will with high probability be close to θ. Therefore, ${p_{\breve{\theta },(k)}}$ will behave more and more like the real ${p_{\theta ,(k)}}$ from which data are sampled. Thus, the process ${S_{{\theta _{0}}}^{\mathrm{preq}}}$, will behave more and more similarly to the “correct” partial likelihood ratio (3.1).
(3.7)
\[ {\breve{\theta }_{(k)}}=\underset{\theta \ge 0}{\operatorname{arg\,max}}\left({p_{\theta ,0}}\times \prod \limits_{{k^{\prime }}:{T^{({k^{\prime }})}}\lt {T^{(k)}}}{p_{\theta ,({k^{\prime }})}}({I_{({k^{\prime }})}})\right),\]Although we will not do so in the experiments to follow, in principle we could also extend this approach to incorporate prior knowledge or guesses about the hazard ratio under ${\mathcal{H}_{1}}$ by adding more virtual data points when calculating estimator ${\breve{\theta }_{(k)}}$. Viz. the remark underneath (3.5), this will not affect AV Type-I error validity. Technically, we need only modify the ${p_{\theta ,0}}$ term in (3.7), which then plays a role analogous to a Bayesian prior: the more virtual data points we add in each group, the more influence they have on the likelihood; the ratio of virtual points in both groups determines our prior expectation of the hazard ratio under the alternative. As shown in Appendix A.4, we could also directly use a Bayesian prior, making our approach an instance of Robbins’ [5] method of mixtures.
Example 3.2.
If we set ${\theta _{0}}=1$ in (2.3), then the null expresses that both groups are indistinguishable, corresponding to a likelihood that, at each event time ${T^{(k)}}$, is proportional to the ratio between the number of participants still at risk in each group at that time. In a randomized clinical trial this would mean that the treatment has no effect. Interestingly and importantly, the anytime-valid Type-I error guarantee holds even if the alternative is arbitrarily misspecified—it may be the case that if the alternative is true, then the proportional hazard assumption does not hold. In particular, if we take ${\mathcal{H}^{\prime }_{1}}:\theta \ne 1$ and use ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$ as defined underneath (3.7), we can think of our test as really testing ‘the treatment has no effect’ against general alternative ‘the treatment has an effect’, where the test will have power as soon as there exists some $\theta \ne 1$ that will eventually lead to a higher partial likelihood than $\theta =1$—this will often be the case even if the proportional hazards assumption is substantially violated. On the other hand, we can, of course, not make any claims about optimality in the GROW or any other sense in this misspecified case.
Now assume again that the model is well-specified (data are generated from a distribution satisfying proportional hazards for some θ). Then given a target value ${\theta _{1}}$—a minimal clinically relevant effect size—the worst-case logarithmic growth rate of ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$ will in general be smaller than that of the GROW ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$. Nevertheless, ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$ can come close to the optimal for a whole range of potentially data-generating θ and may thus sometimes be preferable over choosing ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$. More precisely, the use of a prior allows us to exploit favorable situations in which θ is even smaller (more extreme) than ${\theta _{1}}$. In such situations, the GROW ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ is effectively misspecified. By using ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$ that learns from the data, we may actually obtain a test martingale that grows faster than the GROW ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$, which is fully committed to detecting the worst-case ${\theta _{1}}$.
Figure 1
We show the number of events at which one can stop retaining 80% power at $\alpha =0.05$ using the process ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ with ${\theta _{0}}=1$ and ${\theta _{1}}=0.80$ when the true hazard ratio θ generating the data are different from ${\theta _{1}}$. “Oracle” means that the method is specified with knowledge of the true θ, which in reality is unknown. Note that the y-axis is logarithmic.
In Figure 1, we illustrate such a situation where we start with 1000 participants in both groups (we defer general recommendations for when to use ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ and when to use ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$ to Section 8.4; for now, we merely aim to explain the difference). We generated data using different hazard ratios, and used a ‘misspecified’ ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ that always used ${\theta _{1}}=0.8$. Note that while this is still the GROW (minimax optimal) test martingale for ${\mathcal{H}_{1}}:\theta \le {\theta _{1}}\le 0.8$. If we knew the true θ, we could use the test martingale ${S_{{\theta _{0}},t}^{\theta }}$—it grows faster. We will call the test based on this latter martingale the oracle exact AV logrank test because it is based on inaccessible (oracle) knowledge. We estimated the number of events needed to reject the null with 80% power for ${S_{{\theta _{0}},t}^{0.8}}$, the oracle ${S_{{\theta _{0}},t}^{\theta }}$, and the prequential plug-in ${S_{{\theta _{0}},t}^{\mathrm{preq}.}}$. In all cases, we used the aggressive stopping rule that stops as soon as the statistic in question crosses the threshold $1/\alpha =20$. We see that, as the true θ gets smaller than 0.8, we need fewer events using the GROW test ${S_{{\theta _{0}},t}^{0.8}}$ (the data are favorable to us), but using the oracle exact AV logrank test we get a considerable additional reduction. The prequential plug-in ${S_{{\theta _{0}}}^{\mathrm{preq}.}}$ ‘tracks’ the oracle ${S_{{\theta _{0}},t}^{\theta }}$ by learning the true θ from the data: for θ near 0.8, it behaves worse (more data are needed) than ${S_{{\theta _{0}},t}^{0.8}}$ (which knows the right θ from the start), but for $\theta \lt 0.6$ it starts to behave better. For comparison we also added the methods discussed in Section 4.1. Notably, the O’Brien-Fleming procedure, even though unsuitable for optional continuation, needs even more events than the misspecified AV logrank test ${S_{{\theta _{0}},t}^{0.8}}$ as soon as θ goes below 0.8. The simulations were performed using exactly the same algorithms as for Figure 4 so the y-axis at $\theta =0.8$ coincides with that of Figure 4, but now with absolute rather than relative numbers; details are described in Appendix A.5.
3.3 Tied Observations
Here, we propose a sequential test for applications where events are not monitored continuously, but only at certain observation times. In this case, more than one event may be witnessed in the time interval between two observation moments. Since the order in which these observations are made would be unknown, our previous approaches fail to offer a satisfactory sequential test. Assume that we make observations at times ${t_{0}}\lt {t_{1}}\lt {t_{2}}\lt \cdots \hspace{0.1667em}$ that are fixed before the start of the study. Even though we assume the absence of censoring in this section, this approach can be adapted to its presence under an additional common assumption: that the events reported between two observation times ${t_{k-1}}$ and ${t_{k}}$ precede any censorings, so that censored patients contribute fully to the risk sets under consideration. We assume that the available data are of the form $({O_{1}^{A}},{O_{1}^{B}}),({O_{2}^{A}},{O_{2}^{B}}),\dots \hspace{0.1667em}$, where ${O_{k}^{A}}$ and ${O_{k}^{B}}$ are the number of events witnessed in each group in the time interval $({t_{k-1}},{t_{k}}]$, and ${O_{k}}={O_{k}^{A}}+{O_{k}^{B}}$ is the total. Notice that since the observation times are discrete, we can index the observations by k instead of ${t_{k}}$. For each k, let ${\bar{y}_{k}^{A}}={\textstyle\sum _{j\in A}}{y_{{t_{k}}}^{j}}$ the number of participants at risk at time ${t_{k}}$, define similarly ${\bar{y}_{k}^{B}}$, and let ${\bar{y}_{k}}={\bar{y}_{k}^{A}}+{\bar{y}_{k}^{B}}$ be the total. We derive an anytime-valid test—a test valid at any observation time—for the problem (2.3), where the hazard ratio under the null hypothesis is ${\theta _{0}}=1$, as in Example 3.2. The reason for this restriction in the null hypothesis—only ${\theta _{0}}=1$ is allowed—will soon become clear. Observe that, at time ${t_{k}}$, conditionally on $({\bar{y}_{k-1}^{A}},{\bar{y}_{k-1}^{B}})$ and the total number of events ${O_{k}}$, the number of events ${O_{k}^{B}}$ in group B follows a hypergeometric distribution. This implies that, conditionally on $({\bar{y}_{k-1}^{A}},{\bar{y}_{k-1}^{B}},{O_{k}})$, the conditional likelihood of observing ${O_{k}^{B}}$ is ${p_{k}}({O_{k}^{B}})={p_{\mathrm{Hyper}}}({O_{k}^{B}};\hspace{2.5pt}{\bar{y}_{k-1}},{\bar{y}_{k-1}^{B}},{O_{k}})$, where ${p_{\mathrm{Hyper}}}$ is the probability mass function of a hypergeometric random variable, that is,
\[ {p_{\mathrm{Hyper}.}}({o^{B}};\hspace{2.5pt}\bar{y},{\bar{y}^{B}},o)=\frac{\left(\genfrac{}{}{0.0pt}{}{{\bar{y}^{B}}}{{o^{B}}}\right)\left(\genfrac{}{}{0.0pt}{}{\bar{y}-{\bar{y}^{B}}}{o-{o^{B}}}\right)}{\left(\genfrac{}{}{0.0pt}{}{\bar{y}}{o}\right)}.\]
With this observation at hand, we can build, analogously to (3.5) from the continuous-monitoring case, anytime-valid tests based on partial likelihood ratios,
where each ${q_{k}}$ is a conditional distribution on the possible values of ${O_{k}^{B}}$ that only depends on the data up to time ${t_{k-1}}$. Following the same steps as in Section 3, a sequential test based on monitoring whether ${S_{1,k}^{q}}$ crosses the threshold $1/\alpha $ is also anytime valid at level α.Lemma 3.3.
Let ${t_{\kappa }}\in \{{t_{1}},{t_{2}},\dots \hspace{0.1667em}\}$ be an arbitrary random time. The test ${\xi _{1,\kappa }^{q}}$ given by ${\xi _{1,\kappa }^{q}}=\mathbf{1}\{{S_{1,\kappa }^{q}}\ge 1/\alpha \}$, where ${S_{1,\kappa }^{q}}$ is as in (3.8), has type-I error bounded by α, that is,
under any distribution ${\mathbf{P}_{0}}$ such that the hazard ratio is $\theta =1$.
Just as in the proof of Proposition 3.1, this lemma is shown by a combination of the martingale property of ${S_{1,k}^{{\theta _{1}}}}$ and Doob’s maximal inequality. Therefore, we omit the proof of Lemma 3.3.
In order to obtain an optimal test under a particular hazard ratio ${\theta _{1}}$—an alternative hypothesis—, it is necessary to compute the partial conditional likelihood for the data under the alternative of having observed ${O^{B}}$ given $({\bar{y}_{k-1}^{A}},{\bar{y}_{k-1}^{B}},{\bar{N}_{k-1}})$ (here ${\bar{N}_{k}}$ is the total number of observations up until time ${t_{k}}$). This conditional likelihood is given by Fisher’s noncentral hypergeometric distribution with parameter ω. Unfortunately, ω depends on the baseline hazard function λ, which is assumed to be unknown (see Appendix A.3 for details). It is for this reason that we restrict the null hypothesis to ${\theta _{0}}=1$. Luckily, since the test based on ${S_{{\theta _{0}},t}^{q}}$ remains valid even if q is only approximately correct, this problem can be skirted. As also noted by [21], when the times between observations are short, the parameter ω is well approximated by ${\theta _{1}}$, the hazard ratio under the alternative hypothesis—no knowledge of λ is needed for the approximation. With this in mind, we put forward the use of ${S_{{\theta _{0}},k}^{{\theta _{1}}}}$
\[ {S_{1,k}^{{\theta _{1}}}}:=\prod \limits_{l\le k}\frac{{p_{{\theta _{1}},k}}({O_{k}^{B}})}{{p_{1,k}}({O_{k}^{B}})},\]
where ${S_{1,k}^{{\theta _{1}}}}$ is a an instance of (3.8) with ${q_{k}}({O_{k}^{B}})={p_{{\theta _{1}},k}}({O_{k}^{B}})$ and ${p_{{\theta _{1}},k}}({O_{k}^{B}})$ is Fisher’s noncentral hypergeometric distribution with parameter $\omega ={\theta _{1}}$, that is,
\[\begin{aligned}{}{p_{{\theta _{1}},k}}({o^{B}})& ={p_{\mathrm{FNCH}}}({o^{B}};\hspace{2.5pt}\bar{y},{\bar{y}^{B}},o,\omega ={\theta _{1}})\\ {} & =\frac{\left(\genfrac{}{}{0.0pt}{}{{\bar{y}^{B}}}{{o^{B}}}\right)\left(\genfrac{}{}{0.0pt}{}{\bar{y}-{\bar{y}^{B}}}{o-{o^{B}}}\right){\theta _{1}^{{o^{B}}}}}{{\textstyle\sum _{\max \{0,{o^{B}}-{\bar{y}^{B}}\}\le u\le \min \{{\bar{y}^{B}},{o^{B}}\}}}\left(\genfrac{}{}{0.0pt}{}{{\bar{y}^{B}}}{u}\right)\left(\genfrac{}{}{0.0pt}{}{\bar{y}-{\bar{y}^{B}}}{{o^{B}}-u}\right){\theta _{1}^{u}}}.\end{aligned}\]
We remark that despite ${p_{({\theta _{1}}),k}}$ being only approximately the correct distribution for the observations under the alternative, type-I error guarantees are not compromised (see the discussion on luckiness in Section 3.2). In any case, this approximation is accurate when the time between two consecutive observation times is not very long and when the number of tied observations is small. Two reassuring remarks are in order. First, in the special case when only one observation is made in each time interval between two consecutive observation moments, the statistic ${S_{1,k}^{{\theta _{1}}}}$ reduces to the continuously monitored AV logrank test (3.1) at time ${t_{k}}$. Second, the score test associated to ${S_{1,k}^{{\theta _{1}}}}$ coincides with the logrank test as is conventionally computed in the presence of ties.4 A Gaussian Approximation to the AV Logrank Test
In this section we present an approximation to the AV logrank test introduced in the previous section. This is based on a Gaussian approximation to the logrank statistic. The approximation is of interest for two reasons. First, in practical situations, only the logrank Z-statistic (a standardized form of the classic logrank statistic) and other summary statistics may be available—and not the full risk-set process. This is often the case in medical meta-analyses, where the full data sets (Individual Participant Data, IPD, for each trial) are often not available, but summary statistics are. The Gaussian approximation to the AV logrank statistic can then often still be used (we remark though that in the BCG trial meta-analysis [33] referred to in the Introduction, there was IPD access and the exact anytime-valid logrank test was performed). The second reason, which we address in Section 4.1, is related to the fact that α-spending and group-sequential approaches, which we use as benchmarks, are also based on Gaussian approximations to the classic logrank statistic. Consequently, the behavior of the Gaussian approximation gives further insights into how the AV logrank statistic compares to group-sequential and α-spending approaches as well. We henceforth focus on ${\theta _{0}}=1$, a central case of interest also considered in Example 3.2.
Our general strategy is close in spirit to that followed in the construction of the exact AV logrank statistic in Section 3. We build likelihood ratios using a classic Gaussian approximation for the distribution of the original logrank statistic [32]. If the distribution of this statistic was exactly normal, we could monitor continuously its likelihood ratio. We show through extensive simulation in which regimes this approximation behaves similarly to the AV logrank statistic.
We begin by recalling the definition of the Z-score associated to the classic logrank test. Let ${E_{i}^{B}}={O_{i}}{p_{i}^{B}}$ with ${p_{i}^{B}}={\bar{y}_{i}^{B}}/({\bar{y}_{i}^{A}}+{\bar{y}_{i}^{B}})$ be the expected (under the null) number of events witnessed in the time interval $({t_{i-1}},{t_{i}}]$ in group B, and let ${V_{i}^{B}}={O_{i}}\hspace{2.5pt}{p_{i}^{B}}(1-{p_{i}^{B}})\frac{{\bar{y}_{i}}-{O_{i}}}{{\bar{y}_{i}}-1}$ be its variance. After k observations the Z-score associated to the classic logrank statistic, ${Z_{k}}$, is given by
The numerator in the definition of ${Z_{k}}$ is the classic logrank statistic ${H_{k}}={\textstyle\sum _{i\le k}}\left\{{O_{i}^{B}}-{E_{i}^{B}}\right\}$, which is typically interpreted as the cumulative difference between observed counts ${O_{i}^{B}}$ and the expected counts ${E_{i}^{B}}$ in Group B. The factor $\frac{{\bar{y}_{i}}-{O_{i}}}{{\bar{y}_{i}}-1}$ found in ${V_{i}^{B}}$ can be interpreted as a multiplicity correction, that is, a correction for ties [15, p. 207]. When only one event is witnessed between two consecutive observation times, then ${O_{i}}=1$, ${E_{i}^{B}}={p_{i}^{B}}$, and ${V_{i}^{B}}={p_{i}^{B}}(1-{p_{i}^{B}})$. We remark that the above formulation is also found in the work of Cox [3, (26)].
(4.1)
\[ {Z_{k}}=\frac{{\textstyle\sum _{i\le k}}\left\{{O_{i}^{B}}-{E_{i}^{B}}\right\}}{\sqrt{{\textstyle\sum _{i\le k}}{V_{i}^{B}}}}.\]Fix some initial ${m^{A}}$, ${m^{B}}$ and let $\gamma ={m^{B}}/{m^{A}}$. Schoenfeld [32] showed that in the limit for ${m^{A}}\to \infty $, ${m^{B}}=\gamma {m^{A}}$, $k\to \infty $, $k/{m^{A}}\to 0$, and $\log \theta =O({({m^{A}}+{m^{B}})^{-1/2}})$, the distribution of ${Z_{k}}$ under any distribution with hazard ratio ${\theta _{1}}$ converges to the distribution of a normal with unit variance and mean
Appendix C provides more details about this result. If ${Z_{k}}$ were exactly $N({\mu _{1}},1)$ distributed, then an easy calculation shows that the logrank partial likelihood ratio statistic ${S_{1,k}^{{\theta _{1}}}}$ would be equal to a likelihood ratio of two Gaussians, given by
where ${\bar{N}_{k}}$ is the total number of observations up until time ${t_{k}}$ and ${\mu _{1}}$ is given by (4.2). We thus put forward ${S_{k}^{G}}$ as a Gaussian approximation to the logrank statistic ${S_{1,k}^{{\theta _{1}}}}$. For an arbitrary random observation time ${t_{K}}\in \{{t_{1}},{t_{2}},\dots \hspace{0.1667em}\}$, we refer to the test ${\xi _{K}^{G}}=\mathbf{1}\{{S_{K}^{G}}\ge 1/\alpha \}$ as the Gaussian AV logrank test for (2.3). Recall that we test ${\theta _{0}}=1$, which corresponds to the asymptotic mean of the Z-score under the null hypothesis being ${\mu _{0}}=0$. Clearly, the approximation may fail in practice for at least two reasons: first, ${m^{A}}$, ${m^{B}}$, k and ${\theta _{1}}$ may not fall under Schoenfeld’s asymptotic regime; second, Schoenfeld’s result has been derived in a nonsequential setting, whereas we aim to use it in a sequential setting with optional stopping. Thus, in Appendix C.1 extensive simulations are performed to show in which regimes the Gaussian logrank test retains type-I error guarantees. In Appendix C.2, it is shown that, under continuous monitoring, the Gaussian AV logrank test tends to be more conservative—it needs more data than the exact one. The conclusion is the following: ${S_{K}^{G}}$ can be used for designs with balanced allocation, and it approximates ${S_{1,K}^{{\theta _{1}}}}$ well for hazard ratios between 0.5 and 2.
(4.3)
\[ {S_{k}^{G}}:=\exp \left(-\frac{1}{2}{\bar{N}_{k}}{\mu _{1}^{2}}+\sqrt{{\bar{N}_{k}}}{\mu _{1}}{Z_{k}}\right),\]We now compare the rejection regions defined by the Gaussian logrank test to those of continuously monitoring using α-spending and group-sequential approaches.
4.1 Rejection Region and α-Spending
In this section we compare the rejection regions of the Z-scores for which α-spending approaches and the AV logrank test for the null hypothesis of no effect (hazard ratio ${\theta _{0}}=1$, Example 3.2). The two main α-spending approaches discussed here are due to Pocock [25] and O’Brien and Fleming [22]. We provide two reasons why the main focus of the comparison, however, will be on the O’Brien-Fleming approach. Firstly, in retrospect, Pocock himself believes that his approach leads to boundaries that are unsuitable [26]. One main feature of the Pocock procedure is that the rejection regions are the same regardless of whether the (interim) analyses are conducted at the start or the end of the trial. In practice this leads to many stopped trials for benefits based on (too) small sample sizes and with unrealistically large treatments effects [26]. In contrast, the rejection boundary of the O’Brien-Fleming is more conservative at the start than at the end of the trial. Secondly, the Pocock procedure only allows for a finite number of planned analyses and, therefore, cannot be monitored continuously, whereas this is possible with the O’Brien-Fleming α-spending approach. Hence, the fair comparison is between the two procedures (the AV logrank test and the O’Brien-Fleming α-spending approach) that allow for continuous monitoring.
We begin by specifying the rejection regions for both the Gaussian AV logrank test and that of the O’Brien-Fleming α-spending procedure. For the Gaussian AV logrank we compute the region for the Z-score that rejects the null hypothesis. Indeed, using (4.3), we can compute that whenever ${m^{A}}={m^{B}}$, the null hypothesis is rejected as soon as
\[\begin{aligned}{}{Z_{n}}\ge \frac{\sqrt{n}}{4}\ln ({\theta _{1}})-\frac{2}{\sqrt{n}}\frac{\log (\alpha )}{\log ({\theta _{1}})}& \hspace{5pt}\text{if}\hspace{7.5pt}{\theta _{1}}\gt 1,\hspace{2.5pt}\text{or}\\ {} {Z_{n}}\le \frac{\sqrt{n}}{4}\ln ({\theta _{1}})-\frac{2}{\sqrt{n}}\frac{\log (\alpha )}{\log ({\theta _{1}})}& \hspace{5pt}\text{if}\hspace{7.5pt}{\theta _{1}}\lt 1.\end{aligned}\]
The O’Brien-Fleming procedure is based on a Brownian-motion approximation to the sequentially computed logrank statistic Z-score. Indeed, for large values of ${n_{\max }}$ and $t\in [0,1]$, the process $t\mapsto \sqrt{\frac{\lfloor t{n_{\max }}\rfloor }{{n_{\max }}}}{Z_{\lfloor t{n_{\max }}\rfloor }}$ can be approximated by a Brownian motion ${B_{t}}$. We stress the fact that ${n_{\max }}$ has to be set in advance.2 If ${B_{t}}$ is a Brownian motion, the reflection principle, a well-known but nontrivial application of the symmetry of ${B_{t}}$, implies that
Since ${B_{1}}$ is Gaussian with mean zero and standard deviation 1, setting $c={q_{1-\alpha /2}}$, the $(1-\alpha /2)$-quantile of a standard Gaussian distribution, then
This implies that
\[ \mathbf{P}\{\underset{n\le {n_{\max }}}{\max }\sqrt{n}{Z_{n}}\ge \sqrt{{n_{\max }}}{q_{1-\alpha /2}}\}\approx \alpha ,\]
or, in other words, the procedure that continuously monitors whether the Z-score crosses the boundary $\sqrt{{n_{\max }}}{q_{1-\alpha /2}}$ guarantees approximate type-I error α. Given a hazard ratio ${\theta _{1}}$ under the alternative hypothesis, ${n_{\max }}$ can be set to achieve a desired type-II error. The left-handed procedure can be worked out similarly, and we obtain that, for ${m^{A}}={m^{B}}$, the continuous-monitoring version of the O’Brien-Fleming procedure rejects as soon as
\[\begin{aligned}{}{Z_{n}}\ge \sqrt{\frac{n}{{n_{\max }}}}{q_{1-\alpha /2}}& \hspace{7.5pt}\text{if}\hspace{7.5pt}{\theta _{1}}\gt 1\hspace{2.5pt}\text{(right-sided test)},\hspace{2.5pt}\text{or}\hspace{2.5pt}\\ {} {Z_{n}}\le \sqrt{\frac{n}{{n_{\max }}}}{q_{1-\alpha /2}}& \hspace{7.5pt}\text{if}\hspace{7.5pt}{\theta _{1}}\lt 1\hspace{2.5pt}\text{(left-sided test)}.\end{aligned}\]
Figure 2
Left-sided rejection regions for continuous-monitoring using O’Brien-Fleming α-spending or the Gaussian AV logrank test. Allocation is balanced (${m^{A}}={m^{B}}$) and $\alpha =0.05$. Also shown are the O’Brien-Fleming and Pocock α-spending boundaries for 10 interim analyses. The α-spending boundaries are designed to have $80\% $ power when detecting a hazard ratio 0.7. For more details, including the values of ${n_{\max }}$, see Section 4.1.
The two regions of the Z-statistic values share an important feature: they are more conservative to reject the null hypothesis at small sample sizes than at larger ones, requiring more extreme values for the Z-statistic at the start of the trial. This sets them apart from the Pocock spending function that requires equally extreme values for the Z-statistic at small and large sample size. Figure 2 shows both the Gaussian AV logrank and the O’Brien-Fleming α-spending rejection regions. Additionally, Figure 2 shows the boundary of the Pocock α-spending function for 10 interim analyses. Note that the definition of the AV logrank test rejection region requires a very explicit value for the effect size ${\theta _{1}}={\theta _{\min }}$ of minimum clinical relevance, while that value is implicit in the definition of the α-spending rejection region: To specify an maximum sample size ${n_{\max }}$ to achieve a certain power, an effect size of minimal interest is also assumed. A fixed-sample-size analysis designed to detect a minimum hazard ratio of 0.7 would need 195 events to achieve $80\% $ power if the true hazard ratio is also 0.7. A sequential analysis using α-spending requires a slightly larger maximum number of events: 205 with the O’Brien-Fleming spending function; 245, with the Pocock α-spending function—when we design for 10 interim analyses. We investigate the number of events needed by the Gaussian AV logrank test in Appendix C.2. For the α-spending procedures continuing beyond ${n_{\max }}$ is problematic. This is not the case for the AV logrank test, as it allows for unlimited monitoring, then ${n_{\max }}$ is only a soft constraint on the study—there is no penalty in type-I error for continuing after ${n_{\max }}$ events have been witnessed.
The benefit of a sequential approach is that if there is evidence that the hazard ratio is more extreme than it was anticipated under the alternative hypothesis, we can detect that with fewer events than the maximum sample size. The left column of Figure 3 illustrates that we benefit because the true hazard ratio could be more extreme than we designed for (e.g. 0.5 instead of 0.7; a larger risk reduction in the treatment group) and the data reflects that. We also benefit from a sequential analysis if the true hazard ratio is 0.7 but by chance the values of our Z-statistics are more extreme than expected. The major difference between α-spending approaches and the AV logrank test is that the AV test does not require to set a maximum sample size. It in fact allows to indefinitely increase the sample size without ever spending all α. An α-spending approach designed to have $80\% $ power will miss out on rejecting the null hypothesis in $20\% $ (the type-II error) of the cases as is illustrate in the bottom middle plot of Figure 3 by the sample paths that remain (dark) green. In contrast, the AV logrank test can potentially reject with $100\% $ power by continue sampling. In the sample paths of 500 events in Figure 3, all but one sample path of Z-statistics could be rejected at a larger sample size by the AV logrank test. By extending the trial, the AV logrank test can potentially have $100\% $ power if the true hazard ratio is at least as small as the hazard ratio set for minimum clinical relevance in the design of the test. Still, type-I error is controlled. The bottom right plot of Figure 3 shows two null sample paths with a true hazard ratio of 1 that are rejected by the O’Brien-Fleming α-spending region, but not by the AV logrank test. Here, the AV logrank test is more conservative.
Figure 3
Null hypothesis rejections on simulated data. The rejection regions are the same as shown in Figure 2 (designed to detect a hazard ratio of 0.7 with $80\% $ power). Data are simulated under balanced allocation (${m_{1}}={m_{0}}=5000$) and as time-to-event data with possible ties. The logrank Z-statistic does not have a value for all n; it sometimes jumps with several additional events at a time.
It is known that α-spending methods behave poorly in case of unbalanced allocation [51]. In Appendix C.1 we showed that our Gaussian approximation to the logrank test is also not an E-variable in case of unbalanced allocation. Our exact AV logrank test, however, is an E-variable under any allocation since it is defined directly on the risk-set process (3.3). This suggests that if the complete data set is available and allocation is unbalanced, the exact logrank test should be preferred over the Gaussian approximation and the α-spending methods.
5 Optional Continuation and Live Meta-Analysis
In this section, we address optional continuation and live meta-analysis—the continuous aggregation of evidence from multiple experiments. For instance, data could come from medical trials conducted in different hospitals or in different countries. In such cases, we compare a global null hypothesis ${\mathcal{H}_{0}}$ that is addressed in all trials (for instance, ${\theta _{0}}=1$) to an alternative hypothesis ${\mathcal{H}_{1}}$ that allows for different hazard ratios in each experiment. The present approach covers even the case in which the decision to start each experiment might depend on the observations made in experiments that are already in progress. Assume that there are ${k_{E}}$ experiments, ${E_{(1)}},\dots ,{E_{({k_{E}})}}$, ordered by their respective starting times ${V_{(1)}}\le \cdots \le {V_{({k_{E}})}}$, each performed on different and independent populations. Assume further that the starting time ${V_{(k)}}$, of experiment ${E_{(k)}}$ depends only on the data observed in the ongoing experiments ${E_{(1)}},\dots ,{E_{(k-1)}}$. If each experiment ${E_{(k)}}$ monitors the AV logrank statistic ${S_{{\theta _{0}},t}^{k}}$, where ${S_{{\theta _{0}},t}^{k}}=1$ for $t\le {V_{(k)}}$, then the product statistic ${S_{{\theta _{0}},t}^{\mathrm{meta}}}={\textstyle\prod _{i\le {k_{E}}}}{S_{{\theta _{0}},t}^{i}}$ is a test martingale with respect to the filtration generated by all observations. Consequently, the meta-test based on it enjoys anytime validity.
Proposition 5.1.
Let τ be any random time. The test ${\xi _{{\theta _{0}},\tau }^{\mathrm{meta}}}$ given by $\mathbf{1}\{{S_{{\theta _{0}},\tau }^{\mathrm{meta}}}\ge 1/\alpha \}$, where ${S_{{\theta _{0}},t}^{\mathrm{meta}}}={\textstyle\prod _{i\le {k_{E}}}}{S_{{\theta _{0}},t}^{i}}$, has type-I error smaller than α.
This result follows from a reduction to independent left-truncation—we refer to left-truncation in the specific sense defined by Andersen et al. [1]. Indeed, even in the presence of dependencies on other studies, the observations made in ${E_{(k)}}$ can be regarded as a left-truncated sample. Here, the time at which observation in ${E_{(k)}}$ is started is random and only participants that have not witnessed an event are recruited into the study. One may worry that these dependencies may alter the sequential properties of ${S_{{\theta _{0}},t}^{\mathrm{meta}}}$, but this is not the case. Since the truncation time for ${E_{(k)}}$ is based on data that are independent of that of experiment ${E_{(k)}}$—it is possibly based on the observations made in all other experiments, it follows from results of Andersen et al. [1] (see Appendix A.2) that the sequential-multinomial interpretation of the partial likelihood for the truncated data remains valid. Consequently, so does the sequentially computed AV logrank statistic and the product statistic ${S_{{\theta _{0}},t}^{\mathrm{meta}}}$. By continuously monitoring ${S_{{\theta _{0}},t}^{\mathrm{meta}}}$, we effectively perform an online, cumulative and possibly live meta-analysis that remains valid irrespective of the order in which the events of the different trials are observed. Importantly, unlike in α-spending approaches, the maximum number of trials and the maximum sample size (number of events) per trial do not have to be fixed in advance; we can always decide to start a new trial, or to postpone to end a trial and wait for additional events.
6 Anytime-Valid Confidence Sequences
Anytime-valid (AV) confidence sequences correspond to anytime-valid tests in the same way fixed-sample tests correspond to confidence sets. Indeed, it is possible to “invert” a fixed-sample test to build a confidence set: the parameters of the null hypothesis that are not rejected by a the test form a confidence set. Analogously, test martingales can be used to derive AV confidence sequences [5, 16, 12, 13]. In our setting, a $(1-\alpha )$-AV confidence sequence is a sequence of confidence sets ${\{{\mathrm{CI}_{t}}\}_{t\ge 0}}$, such that
We call it an AV confidence interval if each confidence set ${\mathrm{CI}_{t}}$ is itself an interval. A standard way to design $(1-\alpha )$-AV confidence intervals, translated to our logrank setting, is to use a prequential plug-in test martingale ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$ as in Section 3.2 (or the Bayesian alternative discussed in Appendix A.4). At time t, one reports ${\mathrm{CI}_{t}}=[{\theta _{t}^{L}},{\theta _{t}^{U}}]$ where ${\mathrm{CI}_{t}}$ is the smallest interval containing the values of ${\theta _{0}}$ such that ${S_{{\theta _{0}},t}^{\mathrm{preq}}}\gt 1/\alpha $ outside this interval. Ville’s inequality readily implies that this is indeed an AV confidence interval. The same construction can be made for arbitrary instances of ${S_{{\theta _{0}},t}^{q}}$ as in (3.5).
(6.1)
\[ {\mathbf{P}_{\theta }}\{\theta \notin {\mathrm{CI}_{t}}\hspace{2.5pt}\hspace{2.5pt}\text{for some}\hspace{2.5pt}\hspace{2.5pt}t\ge 0\}\le \alpha .\]Figure 4
Maximum, expected (Mean) number of events needed to reject the null hypothesis with $80\% $ power. ‘Conditional Mean’ makes reference to the number of events needed given that the null hypothesis is indeed rejected. The maximum number of events needed using AV logrank statistics is higher than that of a fixed-sample test, but lower in expectation (see Section 7). All simulations are performed with $\alpha =0.05$ and tests are designed to detect the hazard ratio ${\theta _{1}}$ shown on the x-axis. Data are generated using that same hazard ratio. The classical logrank test needs the following sample sizes (number of events) $n({\theta _{1}})$ for an 80%-power design to detect hazard ratio ${\theta _{1}}$: $n(0.1)=5$, $n(0.2)=10$, $n(0.3)=18$, $n(0.4)=30$, $n(0.5)=52$, $n(0.6)=95$, $n(0.7)=195$, $n(0.8)=497$ and $n(0.9)=2228$. These sample sizes represent the $100\% $ line in all plots.
7 Power and Sample Size
In this section, we investigate the power properties of the AV logrank test—we will study specific stopping times. We have seen that by observing arbitrarily long sequences of events the logrank test can achieve type-II errors that are as close to zero as desired. However, in practice it is necessary to plan for a maximum number of events ${n_{\max }}$ so that either the experiment is stopped as soon as the null hypothesis is rejected or when ${n_{\max }}$ events have been observed. In the latter case, there is no evidence to reject the null hypothesis. We assess via simulation the value of ${n_{\max }}$ needed to guarantee $20\% $ type-II error ($80\% $ power) for the exact and Gaussian AV logrank tests. We compare this to the ${n_{\max }}$ needed to achieve the same power using the continuous-monitoring O’Brien-Fleming α-spending procedure introduced in the previous section, and the fixed-sample-size classic logrank test. Figure 4 show simulation results establishing three types of sample sizes. The leftmost panels (“Maximum”) shows the sample size ${n_{\max }}$ described earlier, which would be required to design the experiment. We stress the fact that using the classic logrank test or α-spending designs events beyond ${n_{\max }}$ cannot be analyzed. The rightmost panel of Figure 4 (“Mean”) shows the sample sizes that capture the expected duration of the trial. It expresses the mean number of events, under the alternative hypothesis, that will be observed before the trial can be stopped. Here, for the AV logrank tests, we use the aggressive stopping rule that stops as soon as ${S_{{\theta _{0}},t}^{{\theta _{1}}}}\ge 1/\alpha =20$ or $n={n_{\max }}$. In case of α-spending approaches and the AV logrank test this number of events is always smaller than the maximum needed in the design stage. Lastly, the middle panel (“Conditional Mean”) shows an even smaller number for those tests that have a flexible sample size: the expected stopping time given that the trial is stopped before the maximum ${n_{\max }}$ was reached—this only happens if the null is rejected. For comparison purposes, all sample sizes are shown relative to (i.e., divided by) the fixed sample size needed by the classical logrank test to obtain $80\% $ power. Note that for small sample size (for small hazard ratios), both the classic logrank test and O’Brien-Fleming α-spending are not recommended due to lack of type-I error control. They are based on Schoenfeld’s Gaussian approximation, which underestimates the number of events required for hazard ratios far away from 1. For example, simulations show that for ${\theta _{1}}=0.1$, $n=6$ or 7 events will be necessary—for small sample sizes the classical logrank test is not recommended due to lack of type-I error control. We give further details in Appendix A.5 (see also Figure 4). In summary, at all hazard ratios at which the Gaussian approximation to the classic logrank test is accurate (say for ${\theta _{1}}\ge 0.3$), the mean number of events needed by the AV logrank tests is about the same or noticeably smaller than that needed when using a fixed-sample-size analysis.
8 Discussion, Future Work and Recommendations for Practice
We introduced the AV logrank test, a version of the logrank test that retains type-I error guarantees under optional stopping and continuation. Extensive simulations reveal that, if we do engage in optional stopping, it is competitive with the classic logrank test (which neither allows in-trial optional stopping nor optional continuation) and α-spending procedures (which allow forms of optional stopping but not optional continuation). We provided an approximate (Gaussian) test for applications in which only summary statistics are available and also showed how the AV logrank test can be used in combination with prequential learning approaches, when no effect size of minimal clinical relevance can be specified. The GROW AV logrank tests (exact and Gaussian) are available in our safestats R package [43].
We end this paper by discussing how we avoid the issue of so-called doomed trials (Section 8.3) and providing a practical guideline for what method to use in what situation (Section 8.4). But first we discuss two highly important potential extensions: including covariates other than group membership and staggered entries.
8.1 Covariates: The Full Cox Model
From a purely theoretical perspective, it is straightforward to extend the AV logrank test to the situation when time-dependent covariates are present, making the underlying model equivalent to the full Cox proportional hazards model. We sketch how to do this extending the notation of Section 3. Assume therefore the presence of d covariates and let, for each participant i, ${\mathbf{z}_{t}^{i}}=({z_{t,1}^{i}},\dots ,{z_{t,d}^{i}})$ be the covariate vector consisting of left-continuous time-dependent covariates ${z_{t,1}^{i}},\dots ,{z_{t,d}^{i}}$. Denote by ${\mathbf{z}_{(k)}^{i}}$ the value of the covariates of participant i at the time ${T_{(k)}}$ when the kth event is witnessed. We let random variable ${I_{(k)}}$ denote the index of the patient to which the kth event happens, and consider the extended process ${I_{(1)}},{I_{(2)}},\dots \hspace{0.1667em}$ where the information that is available at time ${T_{(k)}}$ is, ${I_{(1)}},{I_{(2)}},\dots ,{I_{(k)}}$, and ${z_{(1)}},\dots ,{z_{(k)}}$. The conditional partial likelihood is now denoted ${\mathbf{P}_{\beta ,\theta }}$ with $\theta \gt 0$, $\boldsymbol{\beta }\in {\mathbb{R}^{d}}$, and ${\beta _{\theta }}=\ln \theta \in \mathbb{R}$, defined as follows:
This is consistent with Cox’ [3] proportional hazards regression model: the probability that the ith participant witnesses an event, assuming he/she is still at risk, is proportional to the exponentiated weighted covariates, with group membership being one of the covariates. In case $\boldsymbol{\beta }=0$, this is easily seen to coincide with the definition of ${\mathbf{P}_{\theta }}$ via (2.5) with $\theta ={\mathrm{e}^{{\beta _{\theta }}}}$.
(8.1)
\[ \begin{aligned}{}& {\mathbf{P}_{\boldsymbol{\beta },\theta }}\{{I_{(k)}}=i\mid {\mathbf{z}_{(l)}^{j}},{y_{(l)}^{j}};\hspace{2.5pt}j=1,\dots ,n;\hspace{2.5pt}l=1,\dots ,k\}\\ {} & \hspace{1em}:={P_{\beta ,\theta }}\{{I_{(k)=i}}\mid {\mathbf{z}_{(k)}^{j}},{y_{(k)}^{j}};\hspace{2.5pt}j=1,\dots ,n\},\\ {} & {\mathbf{P}_{\boldsymbol{\beta },\theta }}\{{I_{(k)}}=i\mid {\mathbf{z}_{(k)}^{j}},{y_{(k)}^{j}};\hspace{2.5pt}j=1,\dots ,n\}\\ {} & \hspace{1em}:={p_{\boldsymbol{\beta },\theta ,(k)}}(\hspace{2.5pt}i\hspace{2.5pt}):=\frac{\exp (\langle \boldsymbol{\beta },{\mathbf{z}_{(k)}^{i}}\rangle +{g^{i}}{\beta _{\theta }})}{{\textstyle\sum _{j\in {\mathcal{R}_{(k)}}}}\exp (\langle \boldsymbol{\beta },{\mathbf{z}_{(k)}^{j}}\rangle +{g^{j}}{\beta _{\theta }})}.\end{aligned}\]For simplicity we focus on the central case that ${\mathcal{H}_{0}}$ expresses that there is no treatment effect, i.e. we have composite null ${\mathcal{H}_{0}}:\boldsymbol{\beta }\in {\mathbb{R}^{d}};\theta =1$. First we simplify even further and consider a simple alternative ${\mathcal{H}_{1}}:\boldsymbol{\beta }={\boldsymbol{\beta }_{1}}$, $\theta ={\theta _{1}}$ for a specific fixed vector ${\boldsymbol{\beta }_{1}}\in {\mathbb{R}^{d}}$ and $\theta \ne 1$, with partial likelihoods given as above. We now want to extend (3.1) so that the corresponding test (3.2) is anytime-valid under all elements of ${\mathcal{H}_{0}}$, i.e. the appropriate extension of Proposition 3.1 holds (technically, for this we need the partial likelihood ratio to define an e-process). The factors in the numerator in (3.1) may simply be replaced by the corresponding partial likelihood factors ${p_{{\boldsymbol{\beta }_{1}},{\theta _{1}},(k)}}(\hspace{2.5pt}i\hspace{2.5pt})$. Since the null is now composite, it is not immediately clear what to use for the denominator. However, there exist two (by now standard) methods to set the numerator such that the ratio does become an e-process [29], the Universal Inference (UI) method [47] and Reverse Information Projection (RIPr, [9]). In the present setting, UI is much more straightforward to use: it amounts to set the denominator, at each time t, equal to the running-maximum likelihood estimator ${\hat{\boldsymbol{\beta }}_{({k_{t}})}}$ based on the ${k_{t}}$ events that happened before time t, i.e. for given t, ${k_{t}}$ is the largest k such that ${T^{(k)}}\lt t$. The partial likelihood ratio (3.1) then becomes
We can now easily extend this approach to deal with the full ${\mathcal{H}_{1}}:\theta \ne 1$ (testing whether there is any treatment effect at all, see Example 3.2) by learning $({\boldsymbol{\beta }_{1}},{\theta _{1}})$ using a prequential approach as in Section 3.2. It is straightforward to extend Proposition 3.1 to show that with this choice, (8.2) provides an anytime-valid test. The advantage of this UI method is that it can be readily implemented by small modifications of existing software for finding the MLE in the Cox model; the disadvantage is that we may expect it to have suboptimal power unless the number of covariates is small compared to the number of observed events [40]. Alternatively, we may employ a sequential version of the RIPr method. This will also lead to an anytime-valid test, but, at least currently, we do not know how to calculate it efficiently; for completeness, we describe this approach in Appendix B, where we also provide additional details about covariates in combination with ties. In future work, we aim to investigate and compare the UI and the RIPr approach in detail.
(8.2)
\[ \prod \limits_{k:{T^{(k)}}\le t}\frac{{p_{{\boldsymbol{\beta }_{1}},{\theta _{1}},(k)}}({I_{(k)}})}{{p_{{\hat{\boldsymbol{\beta }}_{({k_{t}})}},1,(k)}}({I_{(k)}})}.\]8.2 Staggered Entry
Earlier approaches to sequential time-to-event analysis were also studied under scenarios of staggered entry, where each patient has its own event time (e.g., time to death since surgery), but patients do not enter the follow-up simultaneously (such that the risk set of, say, a two-day-after-surgery event changes when new participants enter and survive two days). Sellke and Siegmund [36] and Slud [38] show that, in general, martingale properties cannot be preserved under such staggered entry settings, but that asymptotic results are hopeful [36] as long as certain scenarios are excluded [38]. When all participants’ risk is on the same (calendar) time scale (e.g., infection risk in a pandemic; staggered entry now amounts to left-truncation, which we can deal with), or new patients enter in large groups (allowing us to stratify), staggered entry poses no problem for our methods. But research is still ongoing into those scenarios in which our inference is fully AV for patient time under staggered entry, and those that need extra care.
8.3 Your Trial Is Not Doomed
In their summary of conditional power approaches in sequential analysis Proschan, Lan, and Wittes [27] write that low conditional power makes a trial futile. Continuing a trial in such case could only be worth the effort to rule out an effect of clinical relevance, when the effect can be estimated with enough precision. However, if “both conditional and revised unconditional power are low, the trial is doomed because a null result is both likely and uninformative” [27, p. 63]. While this is the case for all existing sequential approaches that set a maximum sample size, this is not the case for AV tests. Any trial can be extended and possibly achieve 100% power or in an anytime-valid confidence sequence show that the effect is too small to be of interest. This is especially useful for time-to-event data when sample size can increase by extending the follow-up time of the trial, without recruiting more participants. Moreover, new participants can always be enrolled either within the same trial or by spurring new trials that can be combined indefinitely in a cumulative meta-analysis.
8.4 Recommendations for Practical Use
Throughout this paper, we encountered different instantiations of the AV logrank test. Specifically, for alternatives as introduced in Section 2 that are ‘disconnected’ from ${\mathcal{H}_{0}}$, such as in (2.3) (for left-sided alternatives) or (3.6) (for two-sided alternatives) we presented partial likelihood ratios ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ with the numerator based on a point alternative ${\theta _{1}}$ as in (3.1) (one-sided case) or a mixture of two point alternatives (below (3.6)). For the full alternative ${\mathcal{H}_{1}}:\theta \ne {\theta _{0}}$ we presented partial likelihood ratios ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$ with the numerator based on prequential plug-in likelihood as in Section 3.2. However, all these versions continue to provide anytime-valid Type-I error bounds under ${\mathcal{H}_{0}}$, no matter what the potential ‘real’ alternatives we may encounter; they may not even satisfy proportional hazards (Example 3.2). So, what instantiation of the test should we choose in practice?
This depends on the situation. If we have a clearly defined one-sided minimum clinically relevant effect and we are convinced that the proportional hazards assumption holds, then this provides a strong motivation for using the point alternative version ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ since it has GROW status (Section 3.1): by Breiman’s result detailed in Appendix A.1 it will give us the smallest expected stopping time in the worst-case over all alternatives. In contrast, if we are interested in whether there is treatment effect at all (i.e. ${\mathcal{H}_{1}}:\theta \ne {\theta _{0}}$), but also if misspecification might be an issue (proportional hazards might be substantially violated, and then ‘effect size’ is not really well-defined) we should use the prequential version ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$, learning an alternative from the data. Yet, even if we have a minimally clinically relevant effect size in mind and we may assume proportional hazards, there are cases in which we advocate to use the prequential version after all, for the reason that, by varying ${\theta _{0}}$ in ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$ via the construction in Section 6, this provides us with anytime-valid confidence sequences with width that is guaranteed to shrink toward the actual effect size—in contrast, it is easily seen that if we vary ${\theta _{0}}$ in ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ then the resulting AV confidence sequences may remain wide forever. Well-behaved (i.e., shrinking) AV confidence sequences can be useful or even essential if we aim to communicate the results (confidence sequences are much more intuitive to practitioners than accept/reject decisions accompanied by partial likelihood ratios). Also, in some cases it cannot be ruled out that, and it would be really useful to know whether, the actual effect is much stronger than the minimum clinically relevant one. A case in point are the covid vaccination trials in which the original test had as alternative ${\mathcal{H}_{1}}$ that vaccine efficacy was above $50\% $, which was viewed as the minimum clinically relevant value, but the data suggested it was really more than $80\% $, which changed the outlook altogether (for example, suggesting vaccination campaigns for the entire adult population rather than just the elderly). A test based on the point alternative ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ with ${\theta _{1}}$ corresponding to $50\% $ could not easily have been transformed post-hoc into a test for testing whether the efficacy is larger than $80\% $, whereas with an AV confidence sequence based on ${S_{{\theta _{0}},t}^{\mathrm{preq}}}$ this can be established right away [35]. Finally, we note that we only recommend use of the Gaussian AV log rank test if one really has to, for example, as in the situation described in Section 4, when only logrank Z-statistics are available; and even then, only in the regime where it provides anytime-valid Type-I error guarantees, as explored in that section.
A Omitted Proofs and Details
In this section we provide proofs and remarks omitted from previous sections. In Appendix A.1 we relate growth-rate optimality to the minimum expected stopping time. In Appendix A.2, we show that the AV logrank statistic is a continuous-time martingale, and show that this is also true for general patterns of incomplete observation, such as left truncation and filtering as a consequence of the results of Andersen et al. [1]. In Appendix A.3, we proof the claims made in Section 3.3 about the martingale structure of the AV logrank test under the presence of ties. Appendix A.4 shows how to learn the alternative from the data using a Bayesian prior rather than the prequential plug-in approach that was explained in Section 3.2 in the main text. Lastly, in Appendix A.5, we give further details on the simulations used to compute the planned maximum sample sizes for a given targeted power. Under the alternative and optional stopping, the observed sample size is in many cases lower.
A.1 Expected Stopping Time, GROW and Wald’s Identity
Here we motivate the GROW criterion by showing that it minimizes, in a worst-case sense, the expected number of events needed before there is sufficient evidence to stop. Let ${\mathbf{P}_{0}}$ represent our null model, and let, as before, the alternative hypothesis be ${\mathcal{H}_{1}}:\theta \le {\theta _{1}}$ for some ${\theta _{1}}\lt {\theta _{0}}$. Suppose we perform a level-α test based on a test martingale ${S_{{\theta _{0}},t}^{q}}$ using the stopping rule τ that stops as soon as ${S_{{\theta _{0}},t}^{q}}$ exceeds the threshold $1/\alpha $, that is, ${\tau ^{q}}={\inf _{t}}\{t\hspace{0.1667em}:\hspace{0.1667em}{S_{{\theta _{0}},t}^{q}}\ge 1/\alpha \}$. In the main text we elaborated on how ${S_{{\theta _{0}},t}^{{\theta _{1}}}}$ is optimal with respect to the GROW criterion. We now show that the problem of minimizing the worst-case, the expected number of events ${\mathbf{E}_{\theta }}[{\bar{N}_{{\tau ^{q}}}}]$ over q is approximately equivalent to finding the GROW test martingale. To do so, we make simplifying assumptions that reduce the problem to an i.i.d. experiment. This allows us to employ a standard argument based on an identity of Wald [45], originally due to Breiman [2]. For this we assume that the initial risk sets (i.e., ${\bar{y}_{0}^{A}}$ and ${\bar{y}_{0}^{B}}$) are large enough so that, for all sample sizes we will ever encounter, ${\bar{y}_{t}^{A}}/{\bar{y}_{t}^{B}}\approx {\bar{y}_{0}^{A}}/{\bar{y}_{0}^{B}}$. This allows us to treat the likelihood of the participant(s) ${I_{(k)}}$ having witnessed the event at time ${T^{(k)}}$ to be independent of t, that is, as an i.i.d. experiment.
The argument of Breiman [2] relates the expected number of events to the expected value of our stopped AV logrank statistic. Suppose first that we happen to know that the data come from a specific θ in the alternative hypothesis. Then ${S_{{\theta _{0}},\tau }^{q}}$ is the product of ${\bar{N}_{\tau }}$ factors of ratios ${R_{{\theta _{0}},(i)}^{q}}={q_{(i)}}({I_{(i)}})/{p_{{\theta _{0}},(i)}}({I_{(i)}})$ at the ith event. Wald’s identity applied to its logarithm implies
For simplicity we will further assume that the number of participants at risk is large enough so that the probability that we run out of data before we can reject is negligible. Because of the choice of the stopping rule ${\tau ^{q}}$, the right-hand side of the last display can then be further rewritten as
(A.1)
\[ {\mathbf{E}_{\theta }}[{\bar{N}_{\tau }}]=\frac{{\mathbf{E}_{\theta }}[\ln {S_{{\theta _{0}},{\tau ^{q}}}^{q}}]}{{\mathbf{E}_{\theta }}[\ln {R_{{\theta _{0}},(1)}^{q}}].}.\]
\[ \frac{{\mathbf{E}_{\theta }}[\ln {S_{{\theta _{0}},{\tau ^{q}}}^{q}}]}{{\mathbf{E}_{\theta }}[\ln {R_{{\theta _{0}},(1)}^{q}}]}=\frac{\ln (1/\alpha )+\text{VERY}\hspace{2.5pt}\text{SMALL}}{{\mathbf{E}_{\theta }}\left[\ln \left({q_{(1)}}({I_{(1)}})/{p_{(1),{\theta _{0}}}}({I_{(1)}})\right)\right]},\]
where very small between 0 and $\log |{\theta _{1}}/{\theta _{0}}|$. The equality follows because we reject as soon as ${S_{{\theta _{0}},t}^{q}}\ge 1/\alpha $, so ${S_{{\theta _{0}},\tau }^{q}}$ cannot be smaller than $1/\alpha $, and it cannot be larger by more than a factor equal to the maximum likelihood ratio at a single outcome (if we would not ignore the probability of stopping because we run out of data, there would be an additional small term in the numerator).With (A.1) at hand, we can relate our choice of q to the expected number of events witnessed before stopping. If, for a fixed θ, we try find the q that minimizes the expected number of events ${\mathbf{E}_{\theta }}[{\bar{N}_{{\tau ^{q}}}}]$, and, as is customary in sequential analysis, we approximate the minimum by ignoring the very small part, we see that the expression is minimized by maximizing the numerator ${\mathbf{E}_{\theta }}\left[\ln \left({Q_{(1)}}/{P_{{\theta _{0}},(1)}}\right)\right]$ over q. The maximum is achieved by ${Q_{(1)}}={P_{\theta ,(1)}}$; the expression in the denominator then becomes the Kulback-Leibler divergence between two Bernoulli distributions. It follows that, under θ, the expected number of outcomes until rejection is minimized by ${Q_{(1)}}={P_{\theta }}$. Thus, in this case, we use the GROW ${S_{{\theta _{0}},t}^{\theta }}$ as test statistic. However, we still need to consider the fact that the real ${\mathcal{H}_{1}}$ is composite: as statisticians, we do not know the actual θ; we only know $0\lt \theta \le {\theta _{1}}$. A worst-case approach uses the q achieving
\[ \underset{q}{\max }\underset{\theta \le {\theta _{1}}}{\min }{\mathbf{E}_{\theta }}\left[\ln \left({p_{(1)}}({I_{(1)}})/{q_{(1),{\theta _{0}}}}({I_{(1)}})\right)\right]\]
since, repeating the reasoning leading to (A.1), this q should be close to achieving the min-max number of events until rejection, given by
\[ \underset{q}{\min }\underset{\theta \le {\theta _{1}}}{\max }{\mathbf{E}_{\theta }}[{\bar{N}_{{\tau ^{q}}}}]\]
But this just tells us to use the GROW E-variable relative to ${\mathcal{H}_{1}}$, which is what we were arguing for.A.2 Continuous Time and Anytime Validity
In this section, we show the anytime validity of the AV logrank test. This is done via Ville’s inequality for which it suffices to show that ${S_{{\theta _{0}}}^{q}}={({S_{{\theta _{0}},t}^{q}})_{t\ge 0}}$ is a nonnegative (super) martingale. To do so, we use the counting process formalism. A few definitions are in order. Only in this section, we assume knowledge of counting process theory [see 1, 8]. Denote, for $i=1,\dots ,m$, ${\tilde{N}_{t}^{i}}=\mathbf{1}\{t\le {T^{i}}\}$ the counting processes associated to each participant, and let ${y_{t}^{i}}$ be the at-risk process. For each participant, the censored process ${N_{t}^{i}}$, which is observed, is given by $\mathrm{d}{N_{t}^{i}}={y_{t}^{i}}\mathrm{d}{\tilde{N}_{t}^{i}}$—we use this convention to signify that ${N_{t}^{i}}={\textstyle\int _{0}^{t}}{y_{s}^{i}}\mathrm{d}{\tilde{N}_{s}^{i}}$. We define the sigma-algebra ${\mathcal{F}_{t}}:=\sigma ({N_{s}^{j}}:0\le s\le t,j=1,\dots ,n)$, which, as usual, can be interpreted as the information in the study up to time t.
One of the results of the counting process theory is that the processes $\mathrm{d}{N_{t}^{i}}-{y_{t}^{i}}\mathrm{d}{\lambda _{t}^{i}}$ are martingales, where, recall, ${y_{t}^{i}}=\mathbf{1}\{{X^{i}}\ge t\}$ is the at-risk process, and ${\lambda _{t}^{i}}$ is the hazard function associated to ${T^{i}}$. In that case, ${y_{t}^{i}}\mathrm{d}{\lambda _{t}^{i}}$ is called the compensator of ${N_{t}^{i}}$. The result that the AV logrank test is a martingale hinges specifically on this structure. Thus, any pattern that preserves this martingale structure also preserves the martingale property for the AV logrank test, and consequently its type-I error guarantees. Andersen et al. [1, III.4] show exactly this under general patterns of incomplete observation provided that the mechanisms are independent of the observations. With this in mind, in the following, we only assume that the counting processes ${N_{t}^{i}}$ have compensators ${A_{t}^{i}}$ given by $\mathrm{d}{A_{t}^{i}}={y_{t}^{i}}\mathrm{d}{\lambda _{t}^{i}}$.
The filtration $\mathcal{F}={({\mathcal{F}_{s}})_{s\ge 0}}$ is right-continuous and we can safely identify predictable processes with left-continuous process. For some ${\theta _{0}}$, denote by ${\mathbf{P}_{0}}$ the distribution under which, for each $i=1,\dots ,m$, the hazard function for ${T^{i}}$ is ${\lambda _{t}^{i}}={\theta _{0}^{{z^{i}}}}{\lambda _{t}^{A}}$, where ${g^{i}}=0$ if $i\in A$ and ${g^{i}}=1$ if $i\in B$. Recall from Section 2, if participant i belongs to Group B, ${\lambda _{t}^{i}}={\theta _{0}}{\lambda _{t}^{B}}={\theta _{0}}{\lambda _{t}^{A}}$; otherwise, ${\lambda _{t}^{i}}={\lambda _{t}^{A}}$. Let ${q_{t}^{1}},\dots ,{q_{t}^{m}}$ be predictable processes such that ${\textstyle\sum _{i\le m}}{q_{t}^{i}}{y_{t}^{i}}=1$ a.s. for all t, that is, ${\{{q_{t}^{i}}\}_{i\in {\mathcal{R}_{t}}}}$ at each t is a probability distribution over the participants at risk at time t. Define ${r_{t}^{i}}$ to be each of the ratios ${r_{t}^{i}}={q_{t}^{i}}/{p_{{\theta _{0}},t}^{i}}$. Define the predictable process ${S_{{\theta _{0}},{t^{-}}}^{q}}={\lim \nolimits_{s\uparrow t}}{S_{{\theta _{0}},{t^{-}}}^{q}}$. As such, at each t, the change $\mathrm{d}{S_{{\theta _{0}},t}^{q}}={S_{{\theta _{0}},t}^{q}}-{S_{{\theta _{0}},{t^{-}}}^{q}}$ of the AV logrank statistic ${S_{{\theta _{1}}}^{q}}$ at time t, given in (3.5), can be computed as
\[ \mathrm{d}{S_{{\theta _{0}},t}^{q}}=\sum \limits_{i\le m}{S_{{\theta _{0}},{t^{-}}}^{q}}({r_{t}^{i}}-1)\mathrm{d}{N_{t}^{i}},\]
because no two events happen simultaneously with positive probability. Since ${S_{{\theta _{0}},{t^{-}}}^{q}}$ is predictable, it is enough to prove that the process ${M_{t}}$ defined by $\mathrm{d}{M_{t}}={\textstyle\sum _{i\le m}}(1-{r_{t}^{i}})\mathrm{d}{N_{t}^{i}}$ is a martingale [see 8, Theorem 1.5.1]. Recall that ${\bar{y}_{t}^{A}}={\textstyle\sum _{i\in A}}{y_{t}^{i}}$ and ${\bar{y}_{t}^{B}}={\textstyle\sum _{i\in B}}{y_{t}^{i}}$. Then both ${\bar{y}^{A}}$ and ${\bar{y}^{B}}$ are left-continuous processes.Lemma A.1.
Let ${\{{q_{t}^{i}}\}_{i\le m}}$ be a collection of nonnegative left-continuous processes ${q^{i}}={({q_{t}^{i}})_{t\ge 0}}$ such that ${\textstyle\sum _{i\le m}}{y_{t}^{i}}{q_{t}^{i}}=1$ for all t. Let ${\{{p_{{\theta _{0}},t}^{i}}\}_{i\le m}}$ be the collection of processes given by
\[ {p_{{\theta _{0}},t}^{i}}=\frac{{\theta _{0}^{{g^{i}}}}{y_{t}^{i}}}{{\bar{y}_{t}^{A}}+{\theta _{0}}{\bar{y}_{t}^{B}}}.\]
The process $M={({M_{t}})_{t\ge 0}}$ given by $\mathrm{d}{M_{t}}={\textstyle\sum _{i\le m}}(1-{r_{t}^{i}})\mathrm{d}{N_{t}^{i}}$ is a martingale under ${\mathbf{P}_{0}}$ with respect to the filtration $\mathcal{F}={({\mathcal{F}_{t}})_{t\ge 0}}$.
Proof.
It suffices to show that the compensator ${A_{t}}$ of ${M_{t}}$, given by $\mathrm{d}{A_{t}}={\textstyle\sum _{i\le m}}{\textstyle\sum _{i\le m}}({r_{t}^{i}}-1){y_{t}^{i}}{\lambda _{t}^{i}}\mathrm{d}t$ is zero. Define ${\bar{q}_{t}^{A}}={\textstyle\sum _{i\in A}}{y_{t}^{i}}{q_{t}^{i}}$ and ${\bar{q}_{t}^{B}}={\textstyle\sum _{i\in B}}{y_{t}^{i}}{q_{t}^{i}}$. Notice that by assumption ${\bar{q}_{t}^{A}}+{\bar{q}_{t}^{B}}=1$., and recall that, under the null ${\lambda _{t}^{B}}={\theta _{0}}{\lambda _{t}^{A}}$. We can compute
\[\begin{aligned}{}\sum \limits_{i\le m}({r_{t}^{i}}-1){y_{t}^{i}}{\lambda _{t}^{i}}& =\sum \limits_{i\in A}{y_{t}^{i}}{\lambda _{t}^{A}}({r_{t}^{i}}-1)+\sum \limits_{i\in B}{y_{t}^{i}}{\lambda _{t}^{B}}({r_{t}^{i}}-1)\\ {} & ={\lambda _{t}^{A}}[({\bar{y}_{t}^{A}}+{\theta _{0}}{\bar{y}_{t}^{B}}){\bar{q}_{t}^{A}}-{\bar{y}_{t}^{A}}\\ {} & \hspace{1em}+({\bar{y}_{t}^{A}}+{\theta _{0}}{\bar{y}_{t}^{B}}){\bar{q}_{t}^{B}}-{\theta _{0}}{\bar{y}_{t}^{B}}]\\ {} & ={\lambda _{t}^{A}}[({\bar{y}_{t}^{A}}+{\theta _{0}}{\bar{y}_{t}^{B}})\stackrel{=1}{\overbrace{({\bar{q}_{t}^{A}}+{\bar{q}_{t}^{B}})}}-({\bar{y}_{t}^{A}}+\theta {\bar{y}_{t}^{B}})]\\ {} & =0,\end{aligned}\]
where we used the assumption that ${\textstyle\sum _{i\le m}}{y_{t}^{i}}{q_{t}^{i}}={\bar{y}_{t}^{A}}{q_{t}^{A}}+{\bar{y}_{t}^{B}}{q_{t}^{B}}=1$. As the compensator ${A_{t}}$ of ${M_{t}}$ is zero at each t, we conclude that ${M_{t}}$ is a martingale, as was to be shown. □Our previous discussion and the preceding lemma have the following corollary as a consequence.
Corollary A.2.
${S_{{\theta _{0}}}^{q}}={({S_{{\theta _{0}},t}^{q}})_{t\ge 0}}$ is a nonnegative martingale with expected value equal to one.
Hence, Ville’s inequality holds for ${S_{{\theta _{0}}}^{q}}$, which implies that
\[ {\mathbf{P}_{0}}\{{S_{{\theta _{0}},t}^{q}}\ge 1/\alpha \hspace{2.5pt}\text{for some}\hspace{5pt}t\ge 0\}\le \alpha .\]
This implies the anytime validity of the test ${\xi _{{\theta _{0}}}^{q}}={({\xi _{{\theta _{0}},t}^{q}})_{t\ge 0}}$ given by the AV logrank test ${\xi _{{\theta _{0}},t}^{q}}=\mathbf{1}\{{S_{{\theta _{0}},t}^{q}}\ge 1/\alpha \}$.A.3 Ties
The purpose of this section is twofold. Firstly, we prove Lemma 3.3. Secondly, we show that the conditional likelihood given in Section 3.3 indeed approximates the true conditional partial likelihood ratio under any distribution such that the hazard ratio is ${\theta _{1}}$.
Our general strategy in this case is similar to the one undertaken in the continuous-monitoring case: we build a test martingale with respect to a filtration ${\mathcal{G}^{\mathrm{\star }}}$, and use Ville’s inequality to derive anytime-valid type-I error guarantees. Define, for each $k=1,2,\dots \hspace{0.1667em}$, the sigma-algebra ${\mathcal{G}_{k}}$ generated by all observations made in times ${t_{1}},\dots ,{t_{k}}$, that is, ${\mathcal{G}_{k}}=\sigma ({N_{{t_{l}}}^{i}},{\tilde{N}_{{t_{l}}}^{i}}\hspace{2.5pt}:\hspace{2.5pt}i=1,\dots ,m;\hspace{2.5pt}l=1,\dots ,k)$, and the corresponding filtration $\mathcal{G}={({\mathcal{G}_{k}})_{k=1,2,\dots }}$. Under Cox’s proportional hazard model, conditionally on ${\mathcal{G}_{k-1}}$, our observations $\Delta {\bar{N}_{k}^{A}}$ and $\Delta {\bar{N}_{k}^{B}}$ are binomially distributed with parameters depending on the hazard function (see Lemma A.3 below). By conditioning both on ${\mathcal{G}_{k-1}}$ and on the total number of events $\Delta {\bar{N}_{k}}=\Delta {\bar{N}_{k}^{A}}+\Delta {\bar{N}_{k}^{B}}$, we use the likelihood of having observed $\Delta {\bar{N}_{k}^{B}}$, which follows Fisher’s noncentral hypergeometric distribution, as detailed in Corollary A.4. We gather these observations in the following two lemmas.
Lemma A.3.
Conditionally on ${\mathcal{G}_{k-1}}$, the following hold:
-
1. The number of events $\Delta {\bar{N}_{k}^{A}}$ has a binomial distribution with parameters ${\bar{y}_{k}^{A}}$ and ${p_{k}^{A}}$ where ${p_{k}^{A}}=1-\exp (-{\textstyle\int _{{t_{k-1}}}^{{t_{k}}}}{\lambda _{s}^{A}}\mathrm{d}s)$.
-
2. The number of events $\Delta {\bar{N}_{k}^{B}}$ has a binomial distribution with parameters ${\bar{y}_{k}^{B}}$ and ${p_{k}^{B}}$ where ${p_{k}^{B}}=1-\exp (-\theta {\textstyle\int _{{t_{k-1}}}^{{t_{k}}}}{\lambda _{s}^{A}}\mathrm{d}s)$ and θ is the hazard ratio.
Next, we use a standard result: given two binomially distributed random variables X and Y, the distribution of X conditionally on $X+Y$ is Fisher’s noncentral hypergeometric distribution. We apply this to $\Delta {\bar{N}_{k}^{A}}$ and $\Delta {\bar{N}_{k}^{B}}$ from the previous lemma and spell out the corresponding parameters in the following corollary.
Corollary A.4.
Let ${\mathcal{G}_{k-1}^{\mathrm{\star }}}={\mathcal{G}_{k-1}}\vee \sigma (\Delta {\bar{N}_{k}})$, and let ${p_{k}^{A}}$ and ${p_{k}^{B}}$ as in Lemma A.3. Define the odd ratios ${\omega _{k}^{A}}={p_{k}^{A}}/(1-{p_{k}^{A}})$, ${\omega _{k}^{B}}={p_{k}^{B}}/(1-{p_{k}^{B}})$ and the ratio ${\omega _{k}}={\omega _{k}^{B}}/{\omega _{k}^{A}}$. Then, conditionally on ${\mathcal{G}_{k-1}^{\mathrm{\star }}}$, the likelihood of having observed $\Delta {\bar{N}_{k}^{B}}$ events in group B is given by Fisher’s noncentral hypergeometric distribution ${p_{\mathrm{FNCH}}}(\Delta {\bar{N}_{k}^{B}}\hspace{2.5pt};\hspace{2.5pt}{\bar{y}_{k-1}^{B}},{\bar{y}_{k-1}^{A}},\Delta {\bar{N}_{k}},{\omega _{k}})$, where
\[\begin{aligned}{}& {p_{\mathrm{FNCH}}}({n^{B}};\hspace{2.5pt}{\bar{y}^{B}},{\bar{y}^{A}},n,\omega )\\ {} & \hspace{1em}=\frac{\left(\genfrac{}{}{0.0pt}{}{{\bar{y}^{B}}}{{n^{B}}}\right)\left(\genfrac{}{}{0.0pt}{}{{\bar{y}^{A}}}{n-{n^{B}}}\right){\omega ^{{n^{B}}}}}{{\textstyle\sum _{\max \{0,{n^{B}}-{\bar{y}^{B}}\}\le u\le \min \{{\bar{y}^{B}},{n^{B}}\}}}\left(\genfrac{}{}{0.0pt}{}{{\bar{y}^{B}}}{u}\right)\left(\genfrac{}{}{0.0pt}{}{{\bar{y}^{A}}}{{n^{B}}-u}\right){\omega ^{u}}}.\end{aligned}\]
Naively, one could use a partial likelihood ratio just as in the absence of ties to derive a sequential test. This, however, is not satisfactory, because, in general, the parameter ${\omega _{k}}$ depends heavily on the unknown baseline hazard function ${\lambda ^{A}}$. Contrary to the general case, when the hazard ratio θ is one, the parameter ${\omega _{k}}=1$, and Fisher’s noncentral hypergeometric distribution reduces to the conventional hypergeometric distribution. With this observation at hand, if ${\{{q_{k}}\}_{k=1,2,\dots }}$ is a sequence of conditional distributions ${q_{k}}(\hspace{2.5pt}\cdot \hspace{2.5pt})$ on the possible values of $\Delta {\bar{N}_{k}^{B}}$, we can build a sequential tests for (2.3), with its corresponding type-I error guarantee. We give the details in the following corollary, and subsequently point at a useful choice for q that approximates the real likelihood.
The choice of q for our statistic presented in Section 3.3 follows from an approximation of the parameter ω for small $\Delta {t_{k}}={t_{k}}-{t_{k-1}}$. As noted by [21], if ${\textstyle\int _{{t_{k-1}}}^{{t_{k}}}}{\lambda _{1}}(s)\mathrm{d}s$ is small, then ${p_{k}^{A}}\approx {\lambda _{{t_{k-1}}}}\Delta {t_{k}}$ and ${p_{k}^{A}}\approx \theta {p_{k}^{A}}$. With these two approximations, ${\omega _{k}}\approx \theta $. This means that the choice ${q_{k}}(\Delta {\bar{N}_{k}^{B}})={p_{{\theta _{1}},k}}(\Delta {\bar{N}_{k}^{B}}):={p_{\mathrm{FNCH}}}(\Delta {\bar{N}_{k}^{B}};\hspace{2.5pt}{\bar{y}_{k}^{B}},\hspace{2.5pt}{\bar{y}_{k}^{A}},\hspace{2.5pt}\Delta {\bar{N}_{k}},\hspace{2.5pt}\omega ={\theta _{1}})$ approximates the real conditional likelihood under any alternative for which the true hazard ratio is ${\theta _{1}}$. Hence, the sequentially computed statistic
\[ {S_{k}^{{\theta _{1}}}}=\prod \limits_{l\le k}\frac{{p_{{\theta _{1}},k}}(\Delta {\bar{N}_{k}^{B}})}{{p_{1,k}}(\Delta {\bar{N}_{k}^{B}})}\]
approximates the true partial likelihood ratio of the data observed up to time ${t_{k}}$ in the presence of ties, and we recommend its use.A.4 Bayesian Approach for Numerator
In Section 3.2 we showed how one can deal with alternatives such as ${\mathcal{H}_{1}}:\theta \ne {\theta _{0}}$ by learning θ from data, leading to a process ${q_{(1)}},{q_{(2)}},\dots $ with ${q_{(k)}}$ determined by a plug-in estimate of θ. Alternatively, it is also possible to use a Bayes predictive distribution based on a prior W on θ. If ${\mathbf{W}_{(k)}}=\mathbf{W}\hspace{2.5pt}|\hspace{2.5pt}{\mathbf{y}_{(1)}},\dots ,{\mathbf{y}_{(k)}}$ is the Bayes posterior on θ based on a prior W and the data up to time $t\lt {T^{(k)}}$, then
where ${\mathbf{W}_{(1)}}=\mathbf{W}$. Hence, ${p_{\mathbf{W},(k)}}$ is the Bayesian predictive distribution. The resulting statistic ${S_{t}^{\mathbf{W}}}$ is the result of multiplying the conditional probability mass functions ${p_{\mathbf{W},(k)}}$, and we obtain that
is a Bayes factor between the Bayes marginal distribution based on W and ${\theta _{0}}$. This technique has been employed in sequential analysis; it is known as the method of mixtures [5, 31] and has been compared to the plug-in method discussed in the main text by [30]. We do not know of a prior for which (A.2) or the constituent products have an analytic expression, but it can certainly be implemented using, for example, Gibbs sampling.
(A.2)
\[ {S_{{\theta _{0}},t}^{\mathbf{W}}}={\prod \limits_{k:{T^{(k)}}\le t}^{n}}\frac{{p_{\mathbf{W},(k)}}({I_{(k)}})}{{p_{{\theta _{0}},(k)}}({I_{(k)}})}\]As shown in Section 3 and discussed in Section 3.2, the use of any ${S_{{\theta _{0}},t}^{q}}$ instead of ${S_{\theta ,t}^{{\theta _{1}}}}$ does not compromise on safety: a test based on monitoring ${S_{{\theta _{0}}}^{q}}$ is anytime-valid, whether q makes reference to plug-in estimators or Bayes predictive distributions, no matter what prior W was chosen. The type-I error guarantee always holds, also when the prior is “misspecified”, putting most of its mass in a region of the parameter space far from the actual θ from which the data were sampled. Thus, our set-up is intimately related to the concept of luckiness in the machine learning theory literature [10] rather than to “pure” Bayesian statistics.
A.5 Details of Sample Size Comparison Simulations
In this section we lay out the procedure that we used to estimate the expected and maximum number of events required to achieve a predefined power as shown in Figure 4 and Figure 1 in Section 7. First we describe how we sampled the survival processes under a specific hazard ratio. We then describe how we estimated the maximum and expected sample size required to achieve a predefined power (80% in our case) for any of the test martingales that we considered (that of the exact AV logrank, its Gaussian approximation, and the prequential plugin variant). Finally, we explain how the same quantities for the classical logrank test and the O’Brien-Fleming procedure were obtained.
In order to simulate the order in which the events in a survival processes happens, we used the sequential-multinomial risk-set process from Section 3. As before, we consider the general testing problem with ${\theta _{0}}=1$ and a minimal clinically relevant effect size ${\theta _{1}}\lt 1$, and we denote the true data generating parameter by θ, typically, $\theta \le {\theta _{1}}$. Under θ, the odds of the next event at the ith event time happening in Group B are $\theta {\bar{y}_{(i)}^{B}}:{\bar{y}_{(i)}^{A}}$—the odds change at each time step. Thus, simulating in which group the next event happens only takes a biased coin flip. For the problem of testing (3.6) with ${\theta _{0}}$ we fix the tolerate a type-I error to $\alpha =0.05$ and the type-II error to $\beta =0.2$. For each test martingale ${S_{{\theta _{0}}}^{q}}$ of interest we first consider the stopping rule ${\tau ^{q}}=\inf \{k:{S_{{\theta _{0}},(k)}^{q}}\ge 1/\alpha \}$, that is, we stop as soon as ${S_{{\theta _{0}},(i)}^{q}}$ crosses the threshold $1/\alpha $. Recall that in the worst case, $\theta ={\theta _{1}}$ the expected stopping time ${\tau ^{q}}$ is lowest when we use ${S_{{\theta _{0}},(k)}^{{\theta _{1}}}}$, see Appendix A.1.
To estimate the maximum number of events needed to achieve a predefined power with a given test martingale, we turned our attention to a modified stopping rule ${\tilde{\tau }^{q}}$. Under ${\tilde{\tau }^{q}}$ we stop at the first of two moments: either when our test martingale ${S_{{\theta _{0}},(k)}^{q}}$ crosses the threshold $1/\alpha $ (i.e., at τ) or once we have witnessed a predefined maximum number of events ${n_{\max }}$. More compactly, this means using the stopping rule ${\tilde{\tau }^{q}}$ given by ${\tilde{\tau }^{q}}=\min ({\tau ^{q}},{n_{\max }})$. In those cases in which the test based on the stopping rule ${\tau ^{q}}$ achieves a power higher than $1-\beta $, a maximum number of events ${n_{\max }}$ smaller than the initial size of the combined risk groups can be selected to achieve approximate power $1-\beta $ using the rule ${\tilde{\tau }^{q}}$.
A quick computation shows that ${n_{\max }}$ has the following property: it is the smallest number of events n such that stopping after n events has probability smaller than $1-\beta $ under the alternative hypothesis, that is,
More succinctly, ${n_{\mathrm{max}}}$ is the (approximate) $(1-\beta )$-quantile of the stopping time ${\tau ^{q}}$, which can be estimated experimentally in a straightforward manner.
To estimate ${n_{\max }}$ for an initial risk set sizes ${m_{1}}$, ${m_{0}}$, we sampled ${10^{4}}$ realizations of the survival process (under θ) using the method described at the beginning of this section. This allowed us to obtain the same number of realizations of the stopping time ${\tau ^{q}}$. We then computed the $(1-\beta )$-quantile of the simulated first passage time distribution of ${\tau ^{q}}$, and reported it as an estimate of the number of events ${n_{\max }}$ in the ‘maximum’ column in Figure 4.
We assessed the uncertainty in the estimation ${n_{\mathrm{max}}}$ using the bootstrap. We performed 1000 bootstrap rounds on the sampled empirical distribution of ${\tau ^{q}}$, and found that the number of realizations that we sampled (${10^{4}}$) was high enough so that plotting the uncertainty estimates was not meaningful relative to the scale of our plots. For this reason we omitted the error bars in Figure 4 and Figure 1.
In the “mean” column of Figure 4 and Figure 1 we plot an estimate of the expected number of events ${\tilde{\tau }^{q}}=\min ({\tau ^{q}},{n_{\max }})$. For this, we used the empirical mean of the stopping times that were smaller than ${n_{\max }}$ on the sample that we obtained by simulation, with 20% of the stopping times being ${n_{\max }}$ itself. In the “conditional mean” column, we plot an estimate of ${\tilde{\tau }^{q}}\mid {\tilde{\tau }^{q}}\lt {n_{\max }}$, i.e., the stopping time given that we stop early (and hence reject the null).
For comparison, we also show the number of events that one would need under the Gaussian non-sequential approximation of Schoenfeld [32], and under the continuous monitoring version of the O’Brien-Fleming procedure. In order to judge Schoenfeld’s approximation, we report the number of events required to achieve 80% power. This is equivalent to treating the logrank statistic as if it were normally distributed, and rejecting the null hypothesis using a z-test for a fixed number of events. The power analysis of this procedure is classic, and the number of events required is ${n_{\max }^{S}}=4{({z_{\alpha }}+{z_{\beta }})^{2}}/{\log ^{2}}{\theta _{1}}$, where ${z_{\alpha }}$, and ${z_{\beta }}$ are the α, and β-quantiles of the standard normal distribution. In the case of the continuous monitoring version of O’Brien-Fleming’s procedure, we estimated the number of events ${n_{\max }^{OF}}$ needed to achieve 80% as follows. For each experimental setting $({m^{A}},{m^{B}},\theta )$, we generated ${10^{4}}$ realizations of the survival process under θ and computed the corresponding trajectories of the logrank statistic. For each possible value n of ${n_{\max }^{OF}}$, we computed the fraction of trajectories for which the O’Brien-Fleming procedure correctly stopped when used with the maximum number of events set to n. We report as an estimate of the true ${n_{\max }^{OF}}$ the first value of n for which this fraction is higher than 80%, our predefined power.
B Covariates: The Full Cox Proportional Hazards E-Variable
In the concluding Section 8.1, we indicated how to extend our approach to the full Cox’ proportional hazards model with covariates, using the Universal Inference (‘running MLE’) approach. Here we show that we may also proceed using the Reverse Information Projection (RIPr) approach pioneered by [9]. Our starting point are the partial likelihoods (8.1).
Since the RIPr approach inevitably requires the use of prior distributions on the parameters $\boldsymbol{\beta }\in {\mathbb{R}^{d}}$ appearing in ${\mathcal{H}_{0}}$, it is more convenient to also treat composite alternatives ${\mathcal{H}_{1}}$ in a Bayesian manner as described in Appendix A.4 rather than using the prequential plug-in approach of Section 3.2, so for simplicity, below we will only describe this approach. To be sure though, we may use the plug-in approach here as well, if we are so inclined.
B.1 E-Variables and Martingales
Let W be a prior distribution on $\beta \in {\mathbb{R}^{d}}$ for some $d\gt 0$. (W may be degenerate, i.e., put mass one on a specific parameter vector ${\beta _{1}}$, and this will allow us to deal with the analogue of the one-sided tests of Section 2 rather than just ‘learning the alternative distribution’ tests of Section 3.2). For each such W, we let ${q_{\mathbf{W},\theta ,(k)}}$ be the probability distribution on ${\mathcal{R}_{(k)}}$ defined by
be our analogue of (3.3).
\[ {q_{\mathbf{W},\theta ,(k)}}(\hspace{2.5pt}i\hspace{2.5pt}):=\int {p_{\boldsymbol{\beta },\theta ,(k)}}(\hspace{2.5pt}i\hspace{2.5pt})\mathrm{d}\mathbf{W}(\boldsymbol{\beta }),\]
with ${p_{\boldsymbol{\beta },\theta ,(k)}}$ as in (8.1). Consider a measure ρ on ${\mathbb{R}^{d}}$ (e.g., Lebesgue or some counting measure) and we let $\mathcal{W}$ be the set of all distributions on ${\mathbb{R}^{d}}$ which have a density relative to ρ, and ${\mathcal{W}^{\circ }}\subset \mathcal{W}$ be any convex subset of $\mathcal{W}$ (we may take ${\mathcal{W}^{\circ }}=\mathcal{W}$, for example). We define ${\tilde{q}_{\gets \mathbf{W},{\theta _{0}}}}$ to be the reverse information projection [18] (RIPr) of ${q_{\mathbf{W},\theta ,(k)}}$ on $\{{q_{\mathbf{W},{\theta _{0}},(k)}}:\mathbf{W}\in {\mathcal{W}^{\circ }}\}$, defined as the probability distribution on ${\mathcal{R}_{(k)}}$ such that
\[\begin{aligned}{}& \mathrm{KL}({q_{\mathbf{W},{\theta _{1}},(k)}}\| {\tilde{q}_{\gets \mathbf{W},{\theta _{0}},(k)}})\\ {} & \hspace{1em}=\underset{{\mathbf{W}^{\circ }}\in {\mathcal{W}^{\circ }}}{\inf }\hspace{2.5pt}\mathrm{KL}({q_{\mathbf{W},{\theta _{1}},(k)}}\| {q_{{\mathbf{W}^{\circ }},{\theta _{0}},(k)}}).\end{aligned}\]
We know from Li [18] and Grünwald et al. [9] that ${\tilde{q}_{\gets \mathbf{W},{\theta _{0}},(k)}}$ exists for each k. Grünwald et al. [9] show, in the context of E-variables for $2\times 2$ contingency tables, that the infimum in the previous display is in fact achieved by some distribution ${\mathbf{W}^{\mathrm{\star }}}$ with finite support on ${\mathbb{R}^{d}}$ if the random variables ${y_{(k)}^{1}},\dots ,{y_{(k)}^{m}}$ constituting our random process have a finite range. For given hazard ratios ${\theta _{0}},{\theta _{1}}\gt 0$, let
(B.1)
\[ {R_{\mathbf{W},{\theta _{0}},(k)}^{{\theta _{1}}}}=\frac{{q_{\mathbf{W},{\theta _{1}},(k)}}({I_{(k)}})}{{q_{\gets \mathbf{W},{\theta _{0}},(k)}}({I_{(k)}})}\]Theorem B.1 (Corollary of Theorem 1 from Grünwald et al. [9]).
For every prior W on ${\mathbb{R}^{d}}$, for all $\boldsymbol{\beta }\in {\mathbb{R}^{d}}$,
\[\begin{aligned}{}& {\mathbf{E}_{\boldsymbol{\beta },{\theta _{0}}}}[{R_{\mathbf{W},{\theta _{0}},(k)}^{{\theta _{1}}}}\mid {\mathbf{z}_{(l)}^{i}},{y_{(l)}^{i}};\hspace{2.5pt}i=1,\dots ,m;\hspace{2.5pt}l=1,\dots ,k]\\ {} & \hspace{1em}=\sum \limits_{i\in {\mathcal{R}_{(k)}}}{q_{\boldsymbol{\beta },{\theta _{0}},(k)}}(i)\frac{{q_{\mathbf{W},{\theta _{1}},(k)}}(i)}{{q_{\gets \mathbf{W},{\theta _{0}},(k)}}(i)}\le 1\end{aligned}\]
so that ${R_{\mathbf{W},{\theta _{0}},(k)}^{{\theta _{1}}}}$ is an E-variable conditionally on ${\mathbf{z}_{(l)}^{i}}$, ${y_{(l)}^{i}}$ with $i=1,\dots ,m$; $l=1,\dots ,k$.
Note that the result does not require the prior W to be well specified in any way: under any $(\boldsymbol{\beta },{\theta _{0}})$ in the null distribution, even if $\boldsymbol{\beta }$ is completely disconnected to W, ${R_{\mathbf{W},{\theta _{0}},(k)}^{{\theta _{1}}}}$ is an E-variable conditional on past data.
In particular, since the result holds for arbitrary priors, it holds, at the kth event time, for the Bayesian posterior ${\mathbf{W}_{k+1}}={\mathbf{W}_{1}}\mid {\mathbf{z}_{(l)}^{i}},{y_{(l)}^{i}}$; $i=1,\dots ,m$; $l=1,\dots ,k$, based on arbitrary prior ${\mathbf{W}_{1}}$ with density ${w_{1}}$, i.e., the density of ${\mathbf{W}_{k+1}}$ is given by
\[ {w_{k+1}}(\boldsymbol{\beta })\propto \prod \limits_{l\le k}{q_{\boldsymbol{\beta },\theta ,(l)}}({I_{(l)}}){w_{1}}(\boldsymbol{\beta }).\]
In parallel to the discussion in Section 3.1, we can therefore, for each prior ${\mathbf{W}_{1}}$, construct a test martingale ${S_{k}}:={\textstyle\prod _{l\le k}}{R_{{\mathbf{W}_{l}},{\theta _{0}},(l)}^{{\theta _{1}}}}$ that “learns” $\boldsymbol{\beta }$ from the data, analogously to (A.2), and computes a new RIPr at each event time k.B.2 Finding the RIPr
While it is not clear how to calculate the RIPr ${q_{\gets \mathbf{W},{\theta _{0}},(k)}}$ in general, it can be well approximated with the efficient algorithm design by Li [18] and Li and Barron [17]. Their algorithm is computationally feasible as long as we restrict ${\mathbf{W}_{\delta }^{\circ }}$ to be the set of all priors W for which ${\min _{i\in {\mathcal{R}_{(k)}}}}{q_{\mathbf{W},{\theta _{0}},(k)}}(i)\ge \delta $, for some $\delta \gt 0$. In that case, when run for M steps, the algorithm achieves an approximation error of $O(\ln (1/\delta )/M)$, where each step is linear in the dimension d. Since the approximation error is logarithmic in $1/\delta $, we can take a very small value of δ, which makes the requirement less restrictive. Exploring whether the Li-Barron algorithm really allows us to compute the RIPr for the Cox model, and hence ${R_{{\mathbf{W}_{k}},{\theta _{0}},(k)}^{{\theta _{1}}}}$ in practice, is a major goal for future work.
B.3 Ties
Without covariates, our E-variables allow for ties correspond to a likelihood ratio of Fisher’s noncentral hypergeometric distributions (see Section 3.3), the situation is not so simple in the presence of covariates. Although deriving the appropriate extension of the noncentral hypergeometric partial likelihood is possible, one ends up with a hard-to-calculate formula [23]. Various approximations have been proposed in the literature [3, 7]. In case these preserve the E-variable and martingale properties, they would retain type-I error probabilities under optional stopping and we could use them without problems. We do not know whether this is the case however; for the time being, we recommend handling ties by putting the events in a worst-case order, leading to the smallest values of the E-variable of interest, as this is bound to preserve the type-I error guarantees.
C Gaussian AV Logrank Test
In this section we heuristically derive the Gaussian AV logrank test of Section 4, and investigate the validity of the Gaussian approximation. In Appendix C.1, we show by simulation that this approximation is only valid when the allocation of participants to each group under investigation is balanced, that is, when ${m^{A}}={m^{B}}$. In Appendix C.2 we investigate numerically the sample size needed to reject the null hypothesis under both the exact AV logrank test and its Gaussian approximation.
We start with the derivation of (4.3). For this we use (local) asymptotic normality of the Z-score (4.1). Under the null distribution, ${Z_{k}}$ from (4.1) has an asymptotic standard Gaussian distribution. Under any alternative distribution under which the hazard ratio is θ, Schoenfeld [32] showed that, in the absence of ties, the Z-statistic also asymptotically follows a Gaussian distribution with unit variance, but this time with mean ${\mu _{1}^{\mathrm{\star }}}$ given by
where ${\bar{N}_{k}}$ is the total number of observations up until time ${t_{k}}$, and the resulting approximation only depends on summary statistics. It is exactly this value ${\mu _{1}}$ that we use in the Gaussian AV logrank test. Inspecting the proof of the asymptotic result of Schoenfeld, we find that it relies on two conditions: (1) that the hazard ratio ${\theta _{1}}$ under the alternative is close enough to one so that a first-order Taylor approximation around ${\theta _{0}}=1$ is adequate; (2) that the expected number of events ${E_{k}^{B}}$ stays approximately constant over time, that is, close to the initial allocation proportion ${E_{1}^{B}}={m^{B}}/({m^{B}}+{m^{A}})$. This indicates that the asymptotic approximation is reasonable for values of ${\theta _{1}}$ close to 1 and the initial risk sets are both large in comparison to the number of events witnessed. Notice that in this regime of large risk sets the multiplicity correction in ${V_{k}}$ is also negligible.
\[ {\mu _{1}^{\mathrm{\star }}}=\frac{{\textstyle\sum _{i\le k}}{E_{i}^{B}}(1-{E_{i}^{B}})}{\sqrt{{\textstyle\sum _{i\le k}}{E_{i}^{B}}(1-{E_{i}^{B}})}}\log (\theta ).\]
In the above, ‘asymptotically’ means in the limit ${m^{A}}\to \infty $, ${m^{B}}=\gamma {m^{A}}$ for some fixed γ, $k\to \infty $. $\log \theta =O({({m^{A}}+{m^{B}})^{-1/2}})$. Note that ${\mu _{1}^{\mathrm{\star }}}$ depends on more than the summary statistic ${Z_{k}}$. In the case that the number of observed events is much smaller than the initial risk set sizes (i.e. additionally we require $k/{m^{A}}\to 0$), the mean ${\mu _{1}^{\mathrm{\star }}}$ under the alternative can be further approximated by
(C.1)
\[ {\mu _{1}^{\mathrm{\star }}}\approx \sqrt{{\bar{N}_{k}}}{\mu _{1}}=\sqrt{{\bar{N}_{k}}}\sqrt{\frac{{m^{B}}{m^{A}}}{{({m^{B}}+{m^{A}})^{2}}}}\log (\theta ),\]This raises the question whether a sequential Gaussian approximation is sensible for the logrank statistic—a priori it is not at all clear whether Schoenfeld’s asymptotic fixed-sample result still provides a reasonable approximation for the partial likelihood ratio under optional stopping. We now investigate this question empirically (as remarked by a referee, it may be that the techniques of [50] on time-uniform central limit theory could be extended to investigate this more rigorously, but the details being far from evident, we have refrained from doing so). Define the logrank statistic per observation time
We investigate whether the exact AV logrank statistic behaves similarly to the Gaussian likelihood ratio
\[\begin{aligned}{}{S^{\prime \hspace{0.1667em}G}_{k}}& =\prod \limits_{i\le k}\frac{{\phi _{{\mu _{1}}\sqrt{{O_{i}}}}}({Z_{i}})}{{\phi _{{\mu _{0}}}}({Z_{i}})}\\ {} & =\exp \left(-\frac{1}{2}\sum \limits_{i\le k}\left\{{O_{i}}{\mu _{1}^{2}}-2{\mu _{1}}\sqrt{{O_{i}}{Z_{i}}}\right\}\right)\end{aligned}\]
for ${\theta _{0}}=1$ we have ${\mu _{0}}=0$, ${\mu _{1}}=\log (\theta )\sqrt{{m^{B}}{m^{A}}/{({m^{A}}+{m^{B}})^{2}}}$, and ${\phi _{\mu }}$ is the Gaussian density with unit variance and mean μ. Note that the statistic still depends on elements of the full data set; more approximations are needed. Write the Gaussian densities, and use that in the limit of large risk sets ${p_{i}^{B}}\approx {m^{B}}/({m^{A}}+{m^{B}})$ and that consequently ${V_{i}}\approx \sqrt{{O_{i}}\frac{{m^{A}}{m^{B}}}{{({m^{A}}+{m^{B}})^{2}}}}$. These approximations are valid under Schoenfeld’s second assumption. With these approximations at hand, the Z-statistic is approximated by
\[ {Z_{k}}\approx \frac{{\textstyle\sum _{i\le k}}\left\{{O_{i}^{B}}-{E_{i}^{B}}\right\}}{\sqrt{{O_{i}}\frac{{m^{A}}{m^{B}}}{{({m^{A}}+{m^{B}})^{2}}}}}\]
and consequently
\[ {{S^{\prime }}_{k}^{G}}\approx {S_{k}^{G}}=\exp \left(-\frac{1}{2}{\bar{N}_{k}}{\mu _{1}^{2}}+\sqrt{{\bar{N}_{k}}}{\mu _{1}}{Z_{k}}\right),\]
where ${S_{k}^{G}}$ is as in (4.3). In Figure 5 we show, in case of balanced allocation, that the Gaussian approximation ${S_{k}^{G}}$ at a single event time is very similar to the exact ${S_{{\theta _{0}},(k)}^{{\theta _{1}}}}$ for alternative hazard ratios ${\theta _{1}}$ between 0.5 and 2.Figure 5
For balanced allocation (${m^{A}}={m^{B}}$) ${S^{\prime \hspace{0.1667em}G}_{1}}$ is very similar to ${S_{(1)}}$ when $0.5\le {\theta _{1}}\le 2$. Here ${\theta _{0}}=1$, ${\mu _{0}}=0$, and ${\mu _{1}}={\mu _{1}}({\theta _{1}})$ as in in (C.1). Note that both axis are logarithmic.
Figure 6
Expected value of the increments of the Gaussian AV logrank statistic as a function of the hazard ratio ${\theta _{1}}$. For balanced allocation ${R_{i}^{G}}$ is an E-variable, but it is not for unbalanced allocation. The risk set can also start out balanced but become unbalanced; this is unlikely under the null hypothesis (see Appendix C.1). Note that the x-axis is logarithmic.
C.1 Safety Only for Balanced Allocation
In order to assess whether the Gaussian AV logrank test is indeed AV, that is, whether the type-I error guarantees holds, we inspect whether the expected value of each of its multiplicative increments is bellow 1. In relation to our discussion in Section 3.1, this would imply that all multiplicative increments are conditional E-variables and that the resulting test is, at least approximately, a test martingale. Figure 6 shows the expectation of these increments as a function of the hazard ratio for several initial allocation ratios. In case of balanced 1:1 allocation ${S_{k}^{G}}$ is an E-variable, since its expectation is 1 or smaller. However, in case of unbalanced 2:1 or 3:1 allocation and designs with hazard ratio ${\theta _{1}}\lt 1$, ${S_{k}^{G}}$ is not an E-variable. Of course, even if the initial allocation is balanced, it can become unbalanced. Figure 6 shows that in case of designs outside the range $0.5\le {\theta _{1}}\le 2$ the deviations from expectation 1 can be problematic. Hence we do not recommend to use the Gaussian approximation on the logrank statistic for unbalanced designs and designs for ${\theta _{1}}\lt 0.5$ or ${\theta _{1}}\gt 2$. For balanced designs with $0.5\le {\theta _{1}}\le 2$, we found that in practice they are safe to use, the reason being that scenarios in which the allocation becomes highly unbalanced after some time (e.g. ${y_{i}^{B}}=80$, ${y_{i}^{A}}=20$) are extremely unlikely to occur under the null.
C.2 Sample Size
In this section we compare the stopping time distribution ${\tau ^{G}}:=\inf \{k:{\xi _{k}^{G}}=1\}$ of the Gaussian approximation to that of $\tau =\inf \{k:{\xi _{k}}=1\}$. We use tests with tolerable type I error $\alpha =0.05$, thus, the threshold $1/\alpha =20$ for both tests. In the previous section we showed that the Gaussian approximation to the AV logrank statistic is valid when the initial allocation is 1:1 and for values $0.5\le {\theta _{1}}\le 2$, where ${\theta _{1}}$ is the hazard ratio under the alternative. In these scenarios, we simulate a survival process from a distribution according to which the true data generating hazard ratio is $\theta ={\theta _{1}}$ and sampled realizations ${\tau ^{G}}$ and τ for the same data set. The results of the simulation are shown in Figure 7, where we plot the realizations of ${\tau ^{G}}$ against those of τ. We see that in most cases both tests reject at the same time ${\tau ^{G}}=\tau $, and that the approximation becomes better as ${\theta _{1}}$ moves closer to ${\theta _{0}}=1$ (Schoenfeld’s assumption 1). When both tests do not reject at the same time, the Gaussian approximation errs on the conservative side. The deviations from the constant large and balanced risk set do not seem to occur often for this range of hazard ratios. After all, the risk set needs to be large to observe the number of events to detect hazard ratios in the range $0.5\le {\theta _{1}}\le 2$.
Figure 7
Stopping times for the Gaussian and exact AV logrank tests under continuous monitoring (no ties) with threshold $1/\alpha =20$. The stopping times under the Gaussian approximation often coincide with the exact ones, and are often more conservative (see Appendix C.2). Note that both axes are logarithmic.