We introduce the anytime-valid (AV) logrank test, a version of the logrank test that provides type-I error guarantees under optional stopping and optional continuation. The test is sequential without the need to specify a maximum sample size or stopping rule, and allows for cumulative meta-analysis with type-I error control. The method can be extended to define anytime-valid confidence intervals. The logrank test is an instance of the martingale tests based on E-variables that have been recently developed. We demonstrate type-I error guarantees for the test in a semiparametric setting of proportional hazards, show explicitly how to extend it to ties and confidence sequences and indicate further extensions to the full Cox regression model. Using a Gaussian approximation on the logrank statistic, we show that the AV logrank test (which itself is always exact) has a similar rejection region to O’Brien-Fleming

The logrank test is arguably the most important tool for the statistical comparison of time-to-event data between two groups of participants. Our main focus is when the two groups refer to the treatment and control groups in a randomized controlled trial; the outcome of interest are event times, that is, the time elapsed until an outcome of interest. The logrank test, in turn, uses a simplified version of the proportional hazard ratio model of Cox [

In order to legitimate the use of sequential boundary decisions, asymptotic approximations over the study period have been developed for the logrank statistic [

Despite the profound impact that these methods have had in statistical practice, the requirement of a maximum sample size limits the utility of a promising but nonsignificant study once the maximum sample size is reached. Because of their design, extending such a trial makes it impossible to control their type-I error. Moreover, the evidence gathered in new—possibly unplanned—trials cannot be added in a typical retrospective meta-analysis, when the number of trials or timing of the meta-analysis are dependent on the trial results. Such dependencies introduce accumulation bias and invalidate the assumptions of conventional statistical procedures in meta-analysis [

The main result of this work is the anytime-valid (AV) logrank test, an anytime-valid test for the statistical comparison of time-to-event data from two groups of participants. The AV logrank test uses the exact ratio of the sequentially computed Cox partial likelihood as test statistic. This is in contrast to the conventional logrank test, that can be interpreted as transforming a test statistic into a p-value whose distribution is, in all applications we are aware of, not determined exactly but rather approximated by a normal or a generalized beta distribution [

From a technical point of view, we show, under general patterns of incomplete observation, that under the composite null hypothesis our test statistic is a continuous-time martingale with expected value equal to one. Statistics with this sequential property are referred to as test martingales; they form the basis of anytime-valid tests [

In contrast to

The AV logrank test was developed with a specific application in mind and it illustrates its usefulness. Some of the authors were involved in applying the AV logrank test to the continuous meta-analysis of seven Coronavirus disease (COVID-19) clinical trials—the results are available as a living systematic review including code and summary data to reproduce the analysis [

We begin with Section

These results hinge on showing that the likelihood underlying Cox’ proportional hazards model can be used to define

We remark that once the definitions are in place, the technical results are mostly straightforward consequences from earlier work; in particular, of the work of Cox [

Next to the main body of this article, we provide two appendices. We delegate to Appendix

We begin by describing the hypothesis that is being tested, the data that are available, and Cox’ proportional hazards model. We are interested in comparing the survival rates between two groups of participants, Group

We now turn to defining Cox’ partial likelihood

In this section the AV logrank test for (

This result can be readily obtained using the sequential-multinomial interpretation of Cox’ likelihood ratio. As we will see, in Section

Under any distribution under which the hazard ratio is

Proposition

Under general patterns of incomplete observation—like independent censoring or independent left truncation—, the AV logrank test provides the same type-I error guarantees. To prove this we do need to go beyond the standard, discrete-time setting: we give an alternative proof of Proposition

The AV-logrank test is optimal—in a sense to be defined in the next section—among a large family of statistics. A second look at the proof of Proposition

The random variables

We instantiate this reasoning to our present problem. For the left-sided alternative (

In a similar fashion, a test can be constructed for two sided alternatives. Indeed, consider a testing problem of the form

So far, the alternative hypotheses that we have studied are of the form

Using only the data observed in

Although we will not do so in the experiments to follow, in principle we could also extend this approach to incorporate prior knowledge or guesses about the hazard ratio under

If we set

Now assume again that the model is well-specified (data are generated from a distribution satisfying proportional hazards for some

We show the number of events at which one can stop retaining 80% power at

In Figure

Here, we propose a sequential test for applications where events are not monitored continuously, but only at certain observation times. In this case, more than one event may be witnessed in the time interval between two observation moments. Since the order in which these observations are made would be unknown, our previous approaches fail to offer a satisfactory sequential test. Assume that we make observations at times

Just as in the proof of Proposition

In order to obtain an optimal test under a particular hazard ratio

In this section we present an approximation to the AV logrank test introduced in the previous section. This is based on a Gaussian approximation to the logrank statistic. The approximation is of interest for two reasons. First, in practical situations, only the logrank

Our general strategy is close in spirit to that followed in the construction of the exact AV logrank statistic in Section

We begin by recalling the definition of the

Fix some initial

We now compare the rejection regions defined by the Gaussian logrank test to those of continuously monitoring using

In this section we compare the rejection regions of the

We begin by specifying the rejection regions for both the Gaussian AV logrank test and that of the O’Brien-Fleming

The O’Brien-Fleming procedure is based on a Brownian-motion approximation to the sequentially computed logrank statistic

One might think that a maximum sample size is implicit also in the AV logrank test: if we start with

Left-sided rejection regions for continuous-monitoring using O’Brien-Fleming

The two regions of the

The benefit of a sequential approach is that if there is evidence that the hazard ratio is more extreme than it was anticipated under the alternative hypothesis, we can detect that with fewer events than the maximum sample size. The left column of Figure

Null hypothesis rejections on simulated data. The rejection regions are the same as shown in Figure

It is known that

In this section, we address optional continuation and live meta-analysis—the continuous aggregation of evidence from multiple experiments. For instance, data could come from medical trials conducted in different hospitals or in different countries. In such cases, we compare a global null hypothesis

This result follows from a reduction to independent left-truncation—we refer to left-truncation in the specific sense defined by Andersen et al. [

Anytime-valid (AV) confidence sequences correspond to anytime-valid tests in the same way fixed-sample tests correspond to confidence sets. Indeed, it is possible to “invert” a fixed-sample test to build a confidence set: the parameters of the null hypothesis that are not rejected by a the test form a confidence set. Analogously, test martingales can be used to derive AV confidence sequences [

Maximum, expected (Mean) number of events needed to reject the null hypothesis with

In this section, we investigate the power properties of the AV logrank test—we will study specific stopping times. We have seen that by observing arbitrarily long sequences of events the logrank test can achieve type-II errors that are as close to zero as desired. However, in practice it is necessary to plan for a maximum number of events

We introduced the AV logrank test, a version of the logrank test that retains type-I error guarantees under optional stopping and continuation. Extensive simulations reveal that, if we do engage in optional stopping, it is competitive with the classic logrank test (which neither allows in-trial optional stopping nor optional continuation) and

We end this paper by discussing how we avoid the issue of so-called

From a purely theoretical perspective, it is straightforward to extend the AV logrank test to the situation when time-dependent covariates are present, making the underlying model equivalent to the full Cox proportional hazards model. We sketch how to do this extending the notation of Section

For simplicity we focus on the central case that

Earlier approaches to sequential time-to-event analysis were also studied under scenarios of staggered entry, where each patient has its own event time (e.g., time to death since surgery), but patients do not enter the follow-up simultaneously (such that the risk set of, say, a two-day-after-surgery event changes when new participants enter and survive two days). Sellke and Siegmund [

In their summary of conditional power approaches in sequential analysis Proschan, Lan, and Wittes [

Throughout this paper, we encountered different instantiations of the AV logrank test. Specifically, for alternatives as introduced in Section

This depends on the situation. If we have a clearly defined one-sided minimum clinically relevant effect and we are convinced that the proportional hazards assumption holds, then this provides a strong motivation for using the point alternative version

In this section we provide proofs and remarks omitted from previous sections. In Appendix

Here we motivate the GROW criterion by showing that it minimizes, in a worst-case sense, the expected number of events needed before there is sufficient evidence to stop. Let

The argument of Breiman [

With (

In this section, we show the anytime validity of the AV logrank test. This is done via Ville’s inequality for which it suffices to show that

One of the results of the counting process theory is that the processes

The filtration

It suffices to show that the compensator

Our previous discussion and the preceding lemma have the following corollary as a consequence.

Hence, Ville’s inequality holds for

The purpose of this section is twofold. Firstly, we prove Lemma

Our general strategy in this case is similar to the one undertaken in the continuous-monitoring case: we build a test martingale with respect to a filtration

The result is standard, and it follows from explicitly solving for

Next, we use a standard result: given two binomially distributed random variables

Naively, one could use a partial likelihood ratio just as in the absence of ties to derive a sequential test. This, however, is not satisfactory, because, in general, the parameter

The choice of

In Section

As shown in Section

In this section we lay out the procedure that we used to estimate the expected and maximum number of events required to achieve a predefined power as shown in Figure

In order to simulate the order in which the events in a survival processes happens, we used the sequential-multinomial risk-set process from Section ^{th} event time happening in Group

To estimate the maximum number of events needed to achieve a predefined power with a given test martingale, we turned our attention to a modified stopping rule

A quick computation shows that

To estimate

We assessed the uncertainty in the estimation

In the “mean” column of Figure

For comparison, we also show the number of events that one would need under the Gaussian non-sequential approximation of Schoenfeld [

In the concluding Section

Since the RIPr approach inevitably requires the use of prior distributions on the parameters

Let

Note that the result does not require the prior

In particular, since the result holds for arbitrary priors, it holds, at the

While it is not clear how to calculate the RIPr

Without covariates, our E-variables allow for ties correspond to a likelihood ratio of Fisher’s noncentral hypergeometric distributions (see Section

In this section we heuristically derive the Gaussian AV logrank test of Section

We start with the derivation of (

This raises the question whether a sequential Gaussian approximation is sensible for the logrank statistic—a priori it is not at all clear whether Schoenfeld’s asymptotic fixed-sample result still provides a reasonable approximation for the partial likelihood ratio under optional stopping. We now investigate this question empirically (as remarked by a referee, it may be that the techniques of [

For balanced allocation (

Expected value of the increments of the Gaussian AV logrank statistic as a function of the hazard ratio

In order to assess whether the Gaussian AV logrank test is indeed AV, that is, whether the type-I error guarantees holds, we inspect whether the expected value of each of its multiplicative increments is bellow 1. In relation to our discussion in Section

In this section we compare the stopping time distribution

Stopping times for the Gaussian and exact AV logrank tests under continuous monitoring (no ties) with threshold

This work is part of the research program with project number 617.001.651, which is financed by the Dutch Research Council (NWO). We thank Henri van Werkhoven, Richard Gill, Wouter Koolen, Aaditya Ramdas, Rosanne Turner and two anonymous referees for useful remarks.