The Anytime-Valid Logrank Test: Error Control Under Continuous Monitoring with Unlimited Horizon

We introduce the anytime-valid (AV) logrank test, a version of the logrank test that provides type-I error guarantees under optional stopping and optional continuation. The test is sequential without the need to specify a maximum sample size or stopping rule, and allows for cumulative meta-analysis with type-I error control. The method can be extended to define anytime-valid confidence intervals. The logrank test is an instance of the martingale tests based on E-variables that have been recently developed. We demonstrate type-I error guarantees for the test in a semiparametric setting of proportional hazards and show how to extend it to ties, Cox' regression and confidence sequences. Using a Gaussian approximation on the logrank statistic, we show that the AV logrank test (which itself is always exact) has a similar rejection region to O'Brien-Fleming alpha-spending but with the potential to achieve 100% power by optional continuation. Although our approach to study design requires a larger sample size, the *expected* sample size is competitive by optional stopping.


INTRODUCTION
The logrank test is arguably the most important tool for the statistical comparison of time-to-event data between two groups of participants.Our main focus is when the two groups refer to the treatment and control groups in a randomized controlled trial; the outcome of interest are event times, that is, the time elapsed until an outcome of interest.The logrank test, in turn, uses a simplified version of the proportional hazard ratio model of Cox [1972].For a fixed sample size and under this model, Cox gave a simple but profound insight: inference can be performed using the partial likelihood of having observed the events in the particular order that they were observed.To this end, the logrank test [Mantel, 1966, Peto andPeto, 1972], the score test associated to the Cox' partial likelihood, is optimal for fixed sample size and a restricted alternative.Large-sample properties of the logrank test are known in very general settings [Tsiatis, 1981, Schoenfeld, 1981, Andersen et al., 1993].Nevertheless, it is clear that the fixed-sample assumption can be overly restrictive.Indeed, due to ethical and practical constraints in human survival-time medical trials, interim analyses may be performed to terminate the study earlier than planned if needed.Consequently, it has been of fundamental importance to develop methods for the sequential analysis of time-to-event data in general; for the logrank test, in particular.
In order to legitimate the use of sequential boundary decisions, uniform asymptotic approximations over the study period have been developed for the logrank statistic [Tsiatis, 1982, Sellke and Siegmund, 1983, Slud, 1984].The results in this line of work show the convergence of the sequentially computed logrank statistic to a rescaled Brownian motion under very general censoring and participant-arrival patterns.When interim analyses are only performed at discrete times, the decision boundaries based on continuously monitoring the logrank statistic are known to be overly conservative.This deficiency is addressed by group-sequential and α-spending methods, which, using knowledge of the interim analysis times relative to a predefined maximum number of events, allow for tighter decision boundaries [Pocock, 1977, O'Brien and Fleming, 1979, Kim and DeMets, 1987].These sequential methods allow several interim looks at the data to stop for efficacy (if the treatment shows to be beneficial) or futility (if the study is no longer likely to reach statistical significance).
Despite the profound impact that these methods have had in statistical practice, the requirement of a maximum sample size limits the utility of a promising but nonsignificant study once the maximum sample size is reached.Because of their design, extending such a trial makes it impossible to control their type-I error.Moreover, the evidence gathered in new-possibly unplanned-trials cannot be added in a typical retrospective meta-analysis, when the number of trials or timing of the metaanalysis are dependent on the trial results.Such dependencies introduce accumulation bias and invalidate the assumptions of conventional statistical procedures in meta-analysis [Ter Schure and Grünwald, 2019].In order to address these deficiencies, we look for flexible anytime-valid methods that provide type-I error control in two situations: (1) optional stopping, which refers to halting the experiment earlier or later than planned under arbitrary stopping rules, and (2) meta-analysis and optional continuation, which refers to the aggregation of evidence of possibly interdependent studies.Just as the existing methods, our approach is connected to early work by H. Robbins andcollaborators [Darling andRobbins, 1967, Lai, 1976].Most notably, existing approaches come with fixed stopping rules, which are not desirable in the use cases that are of our present interest.The details of the present approach are very different, and to some extent, as we will see, more straightforward.
The main result of this work is the anytime-valid (AV) logrank test, an anytime-valid test for the statistical comparison of timeto-event data from two groups of participants.The AV logrank test uses the exact ratio of the sequentially computed Cox partial likelihood as test statistic.The advantage of having an exact test manifests, for instance, in the case of unbalanced allocation, when both control and treatment groups start with different numbers of participants.In this case, α-spending approaches do not provide strong type-I error guarantees due to the approximations involved [Wu and Xiong, 2017].The basic version of the AV logrank test is, however, exact; unbalanced allocation presents no difficulties.
From a technical point of view, we show, under general patterns of incomplete observation, that under the composite null hypothesis our test statistic is a continuous-time martingale with expected value equal to one.Statistics with this sequential property are referred to as test martingales; they form the basis of anytimevalid tests [Ramdas et al., 2020].The AV logrank test is a concrete instance of such a test martingale derived from the recent theory of anytime-valid hypothesis testing based on E-processes [Henzi and Ziegel, 2021, Grünwald et al., 2020, Shafer, 2021, Wang and Ramdas, 2020].At each time an event takes place, it takes on the form of a likelihood ratio ratio between two multinomial distributions in a sampling-without-replacement setting.Relatedly, Lindon and Malek [2022] consider discrete test martingales for multinomial experiments that can be interpreted in a sampling-with-replacement setting.In contrast to p-values, an analysis based on E-processes can extend existing trials as well as inform the decision to start new trials and meta-analyses, while still controlling type-I error rate.Type-I error control is retained even (i) if the E-process is monitored continuously and the trial is stopped early whenever the evidence is convincing, (ii) if the evidence of a promising trial is increased by extending the experiment and (iii) if a trial result spurs a new trial with the intention to combine them in a meta-analysis.
The AV logrank test was developed with a specific application in mind and it illustrates its usefulness.Some of the authors were involved in applying the AV logrank test to the continuous metaanalysis of seven Coronavirus disease (COVID-19) clinical trials -the results are available as a living systematic review including code and summary data to reproduce the analysis [Ter Schure et al., 2022].This analysis was performed concurrently with the trials in a so-called Anytime Live and Leading Interim (ALL-IN) meta-analysis [Ter Schure and Grünwald, 2022].We remark that even in the presence of dependencies between the existence and size of the trials, the test based on the multiplication of the values of E-processes retains type-I error control as long as all trials test the same global null hypothesis, as was the case in the above application.This is generally useful if we want to combine the results of several trials in a bottom-up retrospective meta-analysis, where no top-down stopping rule can be enforced.It is even possible to obtain an interim meta-analysis result by combining interim results of ongoing trials by multiplication, stepping beyond the realm of existing sequential approaches.

Contributions and outline
We begin with Section 2, where we review the special instance Cox' proportional hazards model for the two-group setting.There, we set the assumptions and notation used in the rest of the article.The definitions presented there are standard.In Section 3, we define and prove that the AV logrank test is indeed anytime valid.We first do this for (a) the case with only a group indicator (no other covariates) and without simultaneous events (ties).There, we also discuss its optimality properties and extend it to (b) the case with ties and to (c) the case when one wants to learn the actual effect size of the data and/or use prior knowledge about the effect size via a Bayesian prior.The resulting version of the test keeps providing nonasymptotic type-I error control even if the priors are wildly misspecified, that is, if they predict very different data from the data we actually observe.These results hinge on showing that the likelihood underlying Cox' proportional hazards model can be used to define E-variables and test martingales.In Section 4, we show a Gaussian approximation to the AV logrank statistic that is useful in the common situation when only summary statistics are available.We then provide extensive computer simulations to compare the AV logrank test to the classic logrank test and α-spending approaches.In Section 4.1, we show that the exact AV logrank test has a similar rejection region to O'Brien-Fleming α-spending for those designs and hazard ratios where it is well-approximated by a Gaussian AV logrank test.While always needing a small amount of extra data in the design phase (the price for indefinite optional continuation), the expected sample sizes needed for true rejections remain very competitive.During the design phase of a study, we might want to design for a maximum sample size in order to achieve a certain power, but need a smaller sample size on average during the study since we can safely engage in optional stopping.In Section 5, we show that AV-logrank-type tests can be combined through multiplication to perform meta-analysis, and in Section 6, we show how the test can be used to derive confidence sequences for the hazard ratio.In Section 7, we compare the sample sizes that are needed during the design phase in order to achieve a targeted power.Lastly, in Section 8 we make concluding remarks and discuss future research directions.
We remark that once the definitions are in place, the technical results are mostly straightforward consequences from earlier work; in particular, of the work of Cox [1975], Slud [1992] and Andersen et al. [1993].The novelty of the present work is thus mainly in defining the AV logrank test and showing by computer simulation that, while being substantially more flexible, it is competitive with existing approaches-the classic logrank test with fixed design and in combination with α-spending.
Next to the main body of this article, we provide two appendices.We delegate to Appendix A proofs and remarks that, while important, are not needed to follow the main development.Most importantly, the particular E-variable we design is growth-rate optimal in the worst case, GROW (see Section 3.1).Grünwald et al. [2020] provide several motivations for this criterion; we provide an additional one using an argument of Breiman [1961], which does not seem to be widely known.This argument shows a connection between growth-rate optimality and tests with minimal expected stopping time.In Appendix B, we provide an extension to the case when covariates other than group membership are present.This extension, based on the full Cox model, requires solving a challenging optimization problem and its implementation is therefore deferred to future work.

PROPORTIONAL HAZARDS MODEL AND COX' PARTIAL LIKELIHOOD
We begin by describing the hypothesis that is being tested, the data that are available, and Cox' proportional hazards model.We are interested in comparing the survival rates between two groups of participants, Group A and Group B. In a randomized controlled trial, Group A would signify the control group; Group B, the treatment group.We assume that the available data about m participants are of the form {(X i , g i , δ i ) : i = 1, . . ., m}, where X i = min{T i , C i } is the minimum between the event time T i and the (possibly infinite) censoring time C i ; g i is a zeroone covariate depending on group membership (g i = 0 signifies that i ∈ A; g i = 1, that i ∈ B); and δ i = 1{X i = T i } is the indicator of whether the event was witnessed before censoring or not.Let m A be the number of members of Group A and m B the number of members of Group B-then m A + m B = m.Define g = (g 1 , . . ., g n ), the vector of group memberships.We assume that T 1 , . . ., T n , C 1 , . . ., C n are independent and have continuous distribution functions.The continuity assumption precludes tied observations; we relax this assumption later on, in Section 3.3.For i = 1, . . ., m, the survival rates are quantified by the hazard functions λ i = (λ i t ) t≥0 for T i , given by As is customary, the hazard function λ i at t can be interpreted via the conditional probability of witnessing an event in a short time span provided that the event has not been witnessed up to t, that is, (2.2) Given our interest in comparing the survival rates between the two groups, suppose that all participants i of Group A have a common hazard function Using the data, we wish to test proportional hazards hypotheses.Concretely, we test the hypotheses H 0 that the hazard function of the members of both groups satisfy λ A t = θ 0 λ B t , against an alternative hypothesis H 1 that λ B t = θλ A t for a θ = θ 0 .
As a first application of the methods that we develop, we consider the statistical hypothesis testing problem between the null hypothesis that the hazard functions of the two groups are the same against the left-sided alternative, that is, for some θ ≤ θ 1 < θ 0 and all t, (2.3) where θ is known as the hazard ratio and is the main quantity of statistical interest, and θ 1 would be, in a clinical trial, a minimal clinically relevant effect size.The alternative is what we hope for in case of negative events, such as death, with treatments that are set out to lower (relative to the control condition) the hazard rate.
Notice that the hypotheses in (2.3) are, in fact, nonparametric.
Similarly, if the event is positive, e.g., recovery from an infection, we would typically set a right-sided alternative, which can be also be treated with the present methods.Right-sided, two-sided and the full alternative hypothesis H 1 : θ = 1 are also amenable to the methods that will follow.We remark, however, that all the methods retain their type-I error guarantees irrespective of the specific alternative that we use.We now turn to defining Cox' partial likelihood PL t , which is at the center of our approach.To that end, we need a battery of standard definitions-we lay them out to establish the notation.Let y i t = 1{X i ≥ t} be the at-risk process, that is, the indicator of whether participant i is still at risk at time t, and let ȳA t = i∈A y i t and ȳB t = i∈B y i t be the number of participants at risk in each of the groups at time t.Define y t = (y 1 t , . . ., y m t ), the vector of at-risk processes, and R t = {j : y j t = 1}, the set of participants at risk at time t.Let T (1) < T (2) < • • • < T ( N∞)  be the set of ordered events times that were witnessed (not censored).Note that, if all participants witness the event and censoring is absent, N∞ = m.For each k = 1, . . ., N∞ , let I (k) be the index of the individual that witnessed the event at time T (k) .This means, for example, that if participant with label three was the fifth to witness the event, then the corresponding quantities at event time T (k) , and define g (k) := g I (k) .Cox' partial likelihood PL θ,t can be sequentially computed by . (2.4) Cox' likelihood evaluated at the event times T (1) , T (2) , . . .coincides to that of a sequence of multinomial trials where, at event time T (k) , each of the participants i ∈ R (k) witnesses the event with probability . (2.5) Cox showed that, indeed, conditionally on all the information accrued strictly before T (k) , the probability that participant i observes an event at time T (k) is exactly p θ,(k) ( i ) as long as the hazard ratio is θ.With these likelihood computations at hand, we are in place to show the main contribution of this article, the AV logrank test, which uses the partial likelihood ratio as the test statistic.

THE AV LOGRANK TEST
In this section the AV logrank test for (2.3) is introduced; its type-I error guarantees and optimality properties are investigated.We give a solution to the first of the purposes laid down in the introduction: we show that the AV logrank test is anytime validits type-I error guarantees are not affected by optional stopping.The fact that it is also type-I-error-safe under optional continuation, our second purpose, is proven in Section 5. Without further ado, we define the AV logrank statistic S θ1 θ0,t , typically, θ 0 = 1, for (2.3) as the partial likelihood ratio Here, p θ,(k) is as defined in (2.5); the product that defines our statistic S θ1 θ0,t runs over the events that have been witnessed up to and including time t, and the empty product is taken to be equal to one.As is conventional with likelihood ratios, high values of S θ1 θ0,t are indicative that the alternative hypothesis is better than the null hypothesis at the describing the data.Given a tolerable type-I error bound α and an arbitrary random time τ , the AV logrank test is the test that rejects the null hypothesis if S θ1 θ0τ is above the threshold 1/α, that is, As we will see, by its sequential properties, S θ1 θ0,t takes large values with small probability under the null hypothesis uniformly over time, which translates into type-I error control for the test ξ θ1 θ0,τ .This observation is behind the any-time validity of the AV logrank test, and of anytime-valid tests in general (more details and general constructions to the effect of anytime-valid sequential testing can be found in the work of Ramdas et al. [2020]).We shown in the following proposition that the test ξ θ1 θ0,τ has the desired type-I error control.
Proposition 3.1.Let P 0 be any distribution under which the hazard ratio is equal to θ 0 , and let τ be any random time.The test ξ θ1 θ0,τ = 1{S θ1 θ0,τ ≥ 1/α}, where S θ1 θ0,t is as in (3.1), has level α, that is, P 0 {ξ θ1 θ0,τ = 1} ≤ α.This result can be readily obtained using the sequentialmultinomial interpretation of Cox' likelihood ratio.As we will see, in Section 3.1, this result can be interpreted in terms of Evariables and E-processes [Grünwald et al., 2020].Define the process (S θ1 θ0,(k) ) k=1,2,... as the value of the AV logrank statistic at the event times T (k) , that is, S θ1 θ0,(k) := S θ1 θ0,T (k) .In this time discretization, the AV logrank statistic is the product of random variables the one-outcome partial likelihood ratio for the kth event, where p θ0,(k) is as in (2.5) and k = 1, 2, . . . .
Proof of Proposition 3.1.Under any distribution under which the hazard ratio is θ 0 , the fact that the likelihood of observing I (k) conditionally on {y (l) : l = 1, . . ., k} equals p θ0,(k) (I (K) ) implies that (3.4)This immediately shows that S θ1 θ0,(k) = i≤k R θ1 θ0,(k) is a test martingale, a nonnegative martingale with expected value equal to one, with respect to the filtration Next, the type-I error control for the the test ξ θ1 θ0 follows from Ville's inequality, which asserts that, under the null hypothesis, the test martingale S θ1 θ0,(k) takes large values with small probability.Ville's inequality [Ville, 1939] (also known as Doob's maximal inequality) implies that P sup k=1,2,...
The previous display is a bound on ever making a type-I error when using the AV logrank test ξ θ1 θ0,τ .
Under general patterns of incomplete observation-like independent censoring or independent left truncation-, the AV logrank test provides the same type-I error guarantees.To proof this, we give an alternative proof of Proposition 3.1 in Appendix A using the counting-process formalism [Andersen et al., 1993].There, we show that if the compensators of the underlying counting processes have a certain general product structurewhich is the case under complete observation-, the AV logrank test is anytime-valid.We then refer to Andersen et al. [1993], who show that this structure is preserved under said patterns of incomplete observation.
The AV-logrank test is optimal-in a sense to be defined in the next section-among a large family of statistics.A second look at the proof of Proposition 3.1 suggests a generalization of the AV logrank statistic given in (3.1).Let, for each k, q (k) be a probability distribution on participants in the risk set R (k) which is only allowed to depend on y (1) , . . ., y (k) .Analogously to (3.3), we define the one-outcome ratio R q θ0,(k) := q (k) (I (k) )/p θ0,(k) (I (k) )we now use q (k) instead of p θ1 -, and . (3.5) A modification of the previous argument shows, for any random time τ , a type-I error guarantee for the test ξ q θ0,τ based on the value of S q θ0,τ , that is, ξ q θ0,τ := 1{S q θ0,τ ≥ 1/α} (see Proposition 3.1).Any such test is also anytime valid as long as each q (k) depends on the data only through y (1) , . . ., y (k) .In Section 3.2, we use this generalization to provide tests when no value of θ 1 is available.This generalization raises a natural question about the optimality of the AV logrank test based on (3.1) among test statistics of the form (3.5).This is the subject of the next section.

E-variables and optimality
The random variables {R θ1 θ0,(k) } k=1,2... from (3.3) and {R q θ0,(k) } k=1,2... from (3.5) are examples of (conditional) Evariables-nonnegative random variables whose (conditional) expected value is below 1 uniformly over the null hypothesis.E-variables and E-processes are the "correct" generalization of likelihood ratios to the case that either or both H 0 and H 1 are composite and can be interpreted in terms of gambling [Grünwald et al., 2020, Shafer, 2021, Ramdas et al., 2020].Under this gambling interpretation, a test martingale, a product of conditional Evariables, is the total profit made in a sequential gambling game where no earnings are expected under the null hypothesis.The analogy is thus between profit and evidence: no evidence can be gained against the null hypothesis if it is true.Just as p-values, the definition of E-variables and test martingales does not need any mention of an alternative hypothesis.However, if a composite set of alternative distributions is available, a gambler who is skeptical of the null distribution might want to maximize the speed of evidence accumulation (or of capital growth) under the alternative hypothesis.The worst-case growth rate is defined (conservatively) as the smallest expectation of the logarithm of the E-variable under the alternative.Consequently, any E-variable achieving it is called GROW, for Growth-Rate Optimal in the Worst case (see the work of Grünwald et al. [2020] and Shafer [2021] for additional reasons to use this optimality criterion).
We instantiate this reasoning to our present problem.For the left-sided alternative (2.3), the choice R θ1 θ0,(k) is conditionally GROW because it maximizes the worst-case conditional growth rate over all valid choices of q (k) (which can only depend on the data through y (1) , . . ., y (k) ), that is, In Appendix A.1, we show that in the limit that the risk sets are much larger than the number of events that are witnessed, this worst-case growth criterion yields a test that minimizes the worstcase expected stopping time-under the alternative hypothesisamong the tests that stop as soon as S q θ0,t ≥ 1/α.Thus, among all possible AV logrank tests of the form (3.5), there are strong reasons to choose ξ θ1 θ0,τ .
In a similar fashion, a test can be constructed for two sided alternatives.Indeed, consider a testing problem of the form where θ 1 < 1.For this problem, we can create a weighted, conditionally GROW, E-variable by using

Learning the hazard ratio from data
So far, the alternative hypotheses that we have studied are of the form H 1 : θ ≤ θ 1 for some value of θ 1 < 1.In some cases, such a value of θ 1 is available from the context of the analysis.For instance, θ 1 can correspond to a minimal clinically relevant effect that is satisfactory in a medical trial.However, sometimes it is not clear which value θ 1 to chose.Still, statistics of the form (3.5) are useful to test a null hypothesis H 0 as in (2.3).Indeed, for each k, we can use conditional probability mass functions q (k) that depend on data observed on t < T (k) and enable us to implicitly learn the hazard ratio θ.We describe two such alternatives: a prequential plug-in likelihood and Bayes predictive distribution.

Prequential plugin test approach
Using only the data observed in t < T (k) , let θ(k) be the smoothed maximum likelihood estimator where p θ,0 is a smoothing based on the likelihood of having observed two "virtual" data points prior to the observed data, that is, , and it can also be used to define an anytime-valid test.With this choice, the process q (1) , q (2) . . ., is a typical instance of a prequential plug-in likelihood [Dawid, 1984], that is often based on suitable smoothed likelihood-based estimators [Grünwald and Roos, 2019].The rationale behind this method is the following.Suppose the data are actually sampled from a distribution according to which the hazard ratio is θ.For sufficiently large initial risk sets, that is, if ȳA 0 and ȳB 0 are not too small, by the law of large numbers, the smoothed maximum likelihood estimate θ(k) will with high probability be close to θ.Therefore, p θ,(k) will behave more and more like the real p θ,(k) from which data are sampled.Thus, the process S preq θ0 , will behave more and more similarly to the "correct" partial likelihood ratio (3.1).

Bayesian approach
Instead of q (k) based on a plug-in estimate of θ, it is also possible to use a Bayes predictive distribution based on a prior W on θ.If W (k) = W | y (1) , . . ., y (k) is the Bayes posterior on θ based on a prior W and the data up to time t < T (k) , then where W (1) = W. Hence, p W,(k) is the Bayesian predictive distribution.The resulting statistic S W t is the result of multiplying the conditional probability mass functions p W,(k) , and we obtain that is a Bayes factor between the Bayes marginal distribution based on W and θ 0 .This technique has been employed in sequential analysis; it is known as the method of mixtures [Darling andRobbins, 1967, Robbins andSiegmund, 1970].We do not know of a prior for which (3.7) or the constituent products have an analytic expression, but it can certainly be implemented using, for example, Gibbs sampling.
As shown in Section 3, the use of any S q θ0,t instead of S θ1 θ,t does not compromise on safety: a test based on monitoring S q θ0 is anytime-valid, whether q makes reference to plug-in estimators or Bayes predictive distributions, no matter what prior W was chosen.The type-I error guarantee always holds, also when the prior is "misspecified", putting most of its mass in a region of the parameter space far from the actual θ from which the data were sampled.Thus, our set-up is intimately related to the concept of luckiness in the machine learning theory literature [Grünwald and Mehta, 2019] rather than to "pure" Bayesian statistics.Indeed, given a target value θ 1 -a minimal clinically relevant effect size-the worst-case logarithmic growth rate of S q θ0,t will in general be smaller than that of the GROW S θ1 θ0,t .Nevertheless, S q θ0,t can come close to the optimal for a whole range of potentially data-generating θ and may thus sometimes be preferable over choosing S θ1 θ0,t .More precisely, the use of a prior allows us to exploit favorable situations in which θ is even smaller (more extreme) than θ 1 .In such situations, the GROW S θ1 θ0,t is effectively misspecified.By using S q θ0,t that learn from the data, we may actually obtain a test martingale that grows faster than the GROW S θ1 θ0,t , which is fully committed to detecting the worstcase θ 1 .
In Figure 1, we illustrate such a situation where we start with 1000 participants in both groups.We generated data using different hazard ratios, and used a 'misspecified' S θ1 θ0,t that always used θ 1 = 0.8.Note that while this is still the GROW (minimax optimal) test martingale for H 1 : θ ≤ θ 1 ≤ 0.8.If we knew the true θ, we could use the test martingale S θ θ0,t -it grows faster.We will call the test based on this latter martingale the oracle exact AV logrank test because it is based on inaccessible (oracle) knowledge.We estimated the number of events needed to reject the null with 80% power for S 0.8 θ0,t , the oracle S θ θ0,t , and the prequential plug-in S preq.θ0,t .In all cases, we used the aggressive stopping rule that stops as soon as the statistic in question crosses the threshold 1/α = 20.We see that, as the true θ gets smaller than 0.8, we need fewer events using the GROW test S 0.8 θ0,t (the data are favorable to us), but using the oracle exact AV logrank test we get a considerable additional reduction.The prequential plug-in S preq.θ0 'tracks' the oracle S θ θ0,t by learning the true θ from the data: for θ near 0.8, it behaves worse (more data are needed) than S 0.8 θ0,t (which knows the right θ from the start), but for θ < 0.6 it starts to behave better.For comparison we also added the methods discussed in Section 4.1.Notably, the O'Brien-Fleming procedure, even though unsuitable for optional continuation, needs even more events than the misspecified AV logrank test S 0.8 θ0,t as soon as θ goes below 0.8.The simulations were performed using exactly the same algorithms as for Figure 4 so the y-axis at θ = 0.8 coincides with that of Figure 4, but now with absolute rather than relative numbers; details are described in Appendix A.4.

Tied observations
Here, we propose a sequential test for applications where events are not monitored continuously, but only at certain observation times.In this case, more than one event may be witnessed in the time interval between two observation moments.Since the order in which these observations are made would be unknown, our previous approaches fail to offer a satisfactory sequential test.Assume that we make observations at times t 0 < t 1 < t 2 < . . .that are fixed before the start of the study.Even though we assume the absence of censoring in this section, this approach can be adapted to its presence under an additional common assumption: that the events reported between two observation times t k−1 and t k precede any censorings, so that censored patients contribute fully to the risk sets under consideration.We assume that the available data are of the form k is the total.Notice that since the observation times are discrete, we can index the observations by k instead of t k .For each k, let ȳA k = j∈A y j t k the number of participants at risk at time t k , define similarly ȳB k , and let ȳk = ȳA k + ȳB k be the total.We derive an anytime-valid test-a test valid at any observation time-for the problem (2.3), where the hazard ratio under the null hypothesis is θ 0 = 1.The reason for this restriction in the null hypothesisonly θ 0 = 1 is allowed-will soon become clear.Observe that, at time t k , conditionally on (ȳ A k−1 , ȳB k−1 ) and the total number of events O k , the number of events O B k in group B follows a hypergeometric distribution.This implies that, conditionally on , where p Hyper is the probability mass function of a hypergeometric random variable, that is, With this observation at hand, we can build, analogously to (3.5) from the continuous-monitoring case, anytime-valid tests based on partial likelihood ratios, where each q k is a conditional distribution on the possible values of O B k that only depends on the data up to time t k−1 .Following

Number of events for 80% power
Figure 1: We show the number of events at which one can stop retaining 80% power at α = 0.05 using the process S θ1 θ0,t with θ 0 = 1 and θ 1 = 0.80 when the true hazard ratio θ generating the data are different from θ 1 ."Oracle" means that the method is specified with knowledge of the true θ, which in reality is unknown.Note that the y-axis is logarithmic.the same steps as in Section 3, a sequential test based on monitoring whether S q 1,k crosses the threshold 1/α is also anytime valid at level α.
Just as in the proof of Proposition 3.1, this lemma is shown by a combination of the martingale property of S θ1 1,k and Doob's maximal inequality.Therefore, we omit the proof of Lemma 3.2.
In order to obtain an optimal test under a particular hazard ratio θ 1 -an alternative hypothesis-, it is necessary to compute the partial conditional likelihood for the data under the alternative of having observed . This conditional likelihood is given by Fisher's noncentral hypergeometric distribution with parameter ω.Unfortunately, ω depends on the baseline hazard function λ, which is assumed to be unknown (see Appendix A.3 for details).It is for this reason that we restrict the null hypothesis to θ 0 = 1.Luckily, since the test based on S q θ0,t remains valid even if q is only approximately correct, this problem can be skirted.As also noted by Mehrotra and Roth [2001], when the times between observations are short, the parameter ω is well approximated by θ 1 , the hazard ratio under the alternative hypothesis-no knowledge of λ is needed for the approximation.With this in mind, we put forward the use of S θ1 θ0,k We remark that despite p (θ1),k being only approximately the correct distribution for the observations under the alternative, type-I error guarantees are not compromised (see the discussion on luckiness in Section 3.2).In any case, this approximation is accurate when the time between two consecutive observation times is not very long and when the number of tied observations is small.Two reassuring remarks are in order.First, in the special case when only one observation is made in each time interval between two consecutive observation moments, the statistic S θ1 1,k reduces to the continuously monitored AV logrank test (3.1) at time t k .Second, the score test associated to S θ1 1,k coincides with the logrank test as is conventionally computed in the presence of ties.

A GAUSSIAN APPROXIMATION TO THE AV LOGRANK TEST
In this section we present an approximation to the AV logrank test introduced in the previous section.This is based on a sequential-Gaussian approximation to the logrank statistic.The approximation is of interest for two reasons.First, in practical situations, only the logrank Z-statistic (a standardized form of the classic logrank statistic) and other summary statistics may be available-and not the full risk-set process.This is often the case in medical trials, where the full data sets are confidential.If we also know the number of events Nk and the initial number of participants in both groups, m A and m B , the Gaussian approximation to the AV logrank statistic can still be used.The second reason, which we address in Section 4.1, is related to the fact that α-spending and group-sequential approaches, which we use as benchmarks, are also based on Gaussian approximations to the classic logrank statistic.Consequently, the behavior of the Gaussian approximation gives further insights into how the AV logrank statistic compares to group-sequential and α-spending approaches as well.We henceforth focus on the main case of interest, θ 0 = 1.
Our general strategy is close in spirit to that followed in the construction of the exact AV logrank statistic in Section 3. We build likelihood ratios using a classic approximation for the distribution of the original logrank statistic [Schoenfeld, 1981].If the distribution of this statistic was exactly normal, we could monitor continuously its likelihood ratio.We show through extensive simulation in which regimes this approximation behaves similarly to the AV logrank statistic.
We begin by recalling the definition of the Z-score associated to the classic logrank test.Let be the expected (under the null) number of events witnessed in the time interval (t i−1 , t i ] in group B, and let The numerator in the definition of Z k is the classic logrank statistic , which is typically interpreted as the cumulative difference between observed counts O B i and the expected counts E B i in Group B. The factor ȳi−Oi ȳi−1 found in V B i can be interpreted as a multiplicity correction, that is, a correction for ties [Klein and Moeschberger, 2003, p. 207].When only one event is witnessed between two consecutive observation times, then We remark that the above formulation is also found in the work of Cox [1972, (26)].
We put forward the Gaussian approximation S G k to the logrank statistic S θ1 1,k -we show its derivation in Appendix C-, given by where Nk is the total number of observations up until time t k and For an arbitrary random observation time t K ∈ {t 1 , t 2 , . . .}, we refer to the test ξ G K = 1{S G K ≥ 1/α} as the Gaussian AV logrank test for (2.3).Recall that we test θ 0 = 1, which corresponds to the asymptotic mean of the Z-score under the null hypothesis being µ 0 = 0.In Appendix C.1 extensive simulations are performed to show in which regimes the Gaussian logrank test retains type-I error guarantees.In Appendix C.2, it is shown that, under continuous monitoring, the Gaussian AV logrank test tends to be more conservative-it needs more data than the exact one.The conclusion is the following: S G K can be used for designs with balanced allocation, and it approximates S θ1 1,K well for hazard ratios between 0.5 and 2.
We now compare the rejection regions defined by the Gaussian logrank test to those of continuously monitoring using αspending and group-sequential approaches.

Rejection region and α-spending
In this section we compare the rejection regions of the Zscores for which α-spending approaches and the AV logrank test for the null hypothesis of no effect (hazard ratio θ 0 = 1).The two main α-spending approaches discussed here are due to Pocock [1977] and O'Brien and Fleming [1979].We provide two reasons why the main focus of the comparison, however, will be on the O'Brien-Fleming approach.Firstly, in retrospect, Pocock himself believes that his approach leads to boundaries that are unsuitable [Pocock, 2006].One main feature of the Pocock procedure is that the rejection regions are the same regardless of whether the (interim) analyses are conducted at the start or the end of the trial.In practice this leads to many stopped trials for benefits based on (too) small sample sizes and with unrealistically large treatments effects [Pocock, 2006].In contrast, the rejection boundary of the O'Brien-Fleming is more conservative at the start than at the end of the trial.Secondly, the Pocock procedure only allows for a finite number of planned analyses and, therefore, cannot be monitored continuously, whereas this is possible with the O'Brien-Fleming α-spending approach.Hence, the fair comparison is between the two procedures (the AV logrank test and the O'Brien-Fleming α-spending approach) that allow for continuous monitoring.
We begin by specifying the rejection regions for both the Gaussian AV logrank test and that of the O'Brien-Fleming αspending procedure.For the Gaussian AV logrank we compute the region for the Z-score that rejects the null hypothesis.Indeed, using (4.2), we can compute that whenever m A = m B , the null hypothesis is rejected as soon as The O'Brien-Fleming procedure is based on a Brownianmotion approximation to the sequentially computed logrank statistic Z-score.Indeed, for large values of n max and t ∈ [0, 1], the process t → tnmax nmax Z tnmax can be approximated by a Brownian motion B t .We stress the fact that n max has to be set in advance.If B t is a Brownian motion, the reflection principle, a well-known but nontrivial application of the symmetry of B t , implies that Since B 1 is Gaussian with mean zero and standard deviation 1, setting c = q 1−α/2 , the (1−α/2)-quantile of a standard Gaussian distribution, then or, in other words, the procedure that continuously monitors whether the Z-score crosses the boundary √ n max q 1−α/2 guarantees approximate type-I error α.Given a hazard ratio θ 1 under the alternative hypothesis, n max can be set to achieve a desired type-II error.The left-handed procedure can be worked out similarly, and we obtain that, for m A = m B , the continuous-monitoring version of the O'Brien-Fleming procedure rejects as soon as The two regions of the Z-statistic values share an important feature: they are more conservative to reject the null hypothesis at small sample sizes than at larger ones, requiring more extreme values for the Z-statistic at the start of the trial.This sets them apart from the Pocock spending function that requires equally extreme values for the Z-statistic at small and large sample size.Figure 2 shows both the Gaussian AV logrank and the O'Brien-Fleming α-spending rejection regions.Additionally, Figure 2 shows the boundary of the Pocock α-spending function for 10 interim analyses.Note that the definition of the AV logrank test rejection region requires a very explicit value for the effect size θ 1 = θ min of minimum clinical relevance, while that value is implicit in the definition of the α-spending rejection region: To specify an maximum sample size n max to achieve a certain power, an effect size of minimal interest is also assumed.A fixedsample-size analysis designed to detect a minimum hazard ratio of 0.7 would need 195 events to achieve 80% power if the true hazard ratio is also 0.7.A sequential analysis using α-spending requires a slightly larger maximum number of events: 205 with the O'Brien-Fleming spending function; 245, with the Pocock αspending function-when we design for 10 interim analyses.We investigate the number of events needed by the Gaussian AV logrank test in Appendix C.2.For the α-spending procedures continuing beyond n max is problematic.This is not the case for the AV logrank test, as it allows for unlimited monitoring, then n max is only a soft constraint on the study-there is no penalty in type-I error for continuing after n max events have been witnessed.
The benefit of a sequential approach is that if there is evidence that the hazard ratio is more extreme than it was anticipated under the alternative hypothesis, we can detect that with fewer events than the maximum sample size.The left column of Figure 3 illustrates that we benefit because the true hazard ratio could be more extreme than we designed for (e.g.0.5 instead of 0.7; a larger risk reduction in the treatment group) and the data reflects that.We also benefit from a sequential analysis if the true hazard ratio is 0.7 but by chance the values of our Z-statistics are more extreme than expected.The major difference between αspending approaches and the AV logrank test is that the AV test does not require to set a maximum sample size.It in fact allows to indefinitely increase the sample size without ever spending all α.An α-spending approach designed to have 80% power will miss out on rejecting the null hypothesis in 20% (the type-II error) of the cases as is illustrate in the bottom middle plot of Figure 3 by the sample paths that remain (dark) green.In contrast, the AV logrank test can potentially reject with 100% power by continue sampling.In the sample paths of 500 events in Figure 3, all but one sample path of Z-statistics could be rejected at a larger sample size by the AV logrank test.By extending the trial, the AV logrank test can potentially have 100% power if the true hazard ratio is at least as small as the hazard ratio set for minimum clinical relevance in the design of the test.Still, type-I error is controlled.The bottom right plot of Figure 3 shows two null sample paths with a true hazard ratio of 1 that are rejected by the O'Brien-Fleming α-spending region, but not by the AV logrank test.Here, the AV logrank test is more conservative.
It is known that α-spending methods behave poorly in case of unbalanced allocation [Wu and Xiong, 2017].In Appendix C.1 we showed that our Gaussian approximation to the logrank test is also not an E-variable in case of unbalanced allocation.Our exact AV logrank test, however, is an E-variable under any allocation since it is defined directly on the risk-set process (3.3).This suggests that if the complete data set is available and allocation is unbalanced, the exact logrank test should be preferred over the Gaussian approximation and the α-spending methods.

OPTIONAL CONTINUATION AND LIVE META-ANALYSIS
In this section, we address optional continuation and live metaanalysis-the continuous aggregation of evidence from multiple    2 (designed to detect a hazard ratio of 0.7 with 80% power).Data are simulated under balanced allocation (m 1 = m 0 = 5000) and as time-to-event data with possible ties.The logrank Z-statistic does not have a value for all n; it sometimes jumps with several additional events at a time.
experiments.For instance, data could come from medical trials conducted in different hospitals or in different countries.In such cases, we compare a global null hypothesis H 0 that is addressed in all trials (for instance, θ 0 = 1) to an alternative hypothesis H 1 that allows for different hazard ratios in each experiment.The present approach covers even the case in which the decision to start each experiment might depend on the observations made in experiments that are already in progress.Assume that there are k E experiments, E (1) , . . ., E (k E ) , ordered by their respective starting times This result follows from a reduction to independent lefttruncation-we refer to left-truncation in the specific sense defined by Andersen et al. [1993].Indeed, even in the presence of dependencies on other studies, the observations made in E (k) can be regarded as a left-truncated sample.Here, the time at which observation in E (k) is started is random and only participants that have not witnessed an event are recruited into the study.One may worry that these dependencies may alter the sequential properties of S meta θ0,t , but this is not the case.Since the truncation time for E (k) is based on data that are independent of that of experiment E (k) -it is possibly based on the observations made in all other experiments, it follows from results of Andersen et al. [1993] (see Appendix A.2) that the sequential-multinomial interpretation of the partial likelihood for the truncated data remains valid.Consequently, so does the sequentially computed AV logrank statistic and the product statistic S meta θ0,t .By continuously monitoring S meta θ0,t , we effectively perform an online, cumulative and possibly live meta-analysis that remains valid irrespective of the order in which the events of the different trials are observed.Importantly, unlike in α-spending approaches, the maximum number of trials and the maximum sample size (number of events) per trial do not have to be fixed in advance; we can always decide to start a new trial, or to postpone to end a trial and wait for additional events.

ANYTIME-VALID CONFIDENCE SEQUENCES
Anytime-valid (AV) confidence sequences corresponds to anytime-valid tests in the same way fixed-sample tests correspond to confidence intervals.Indeed, it is possible to "invert" a fixedsample test to build a confidence interval: the parameters of the null hypothesis that are not rejected by a the test form a confidence interval.Analogously, test martingales can be used to derive AV confidence sequences [Darling and Robbins, 1967, Lai, 1976, Howard et al., 2018a,b].In our setting, a (1 − α)-AV confidence sequence is a sequence of confidence intervals {CI t } t≥0 , such that P θ {θ / ∈ CI t for some t ≥ 0} ≤ α. (6.1) A standard way to design (1 − α)-AV confidence sequences, translated to our logrank setting, is to use a prequential plug-in test martingale S preq θ0,t or the Bayesian version S W θ0,t as in Section 3.2.At time t, one reports CI t = [θ L t , θ U t ] where CI t is the smallest interval containing the values of θ 0 such that S preq θ0,t > 1/α outside this interval.Ville's inequality readily implies that this is indeed an AV confidence sequence.The same construction can be made for arbitrary instances of S q θ0,t as in (3.5).

POWER AND SAMPLE SIZE
In this section, we investigate the power properties of the AV logrank test-we will study specific stopping times.We have seen that by observing arbitrarily long sequences of events the logrank test can achieve type-II errors that are as close to zero as desired.However, in practice it is necessary to plan for a maximum number of events n max so that either the experiment is stopped as soon as the null hypothesis is rejected or when n max events have been observed.In the latter case, there is no evidence to reject the null hypothesis.We assess via simulation the value of n max needed to guarantee 20% type-II error (80% power) for the exact and Gaussian AV logrank tests.We compare this to the n max needed to achieve the same power using the continuousmonitoring O'Brien-Fleming α-spending procedure introduced in the previous section, and the fixed-sample-size classic logrank test.Figure 4 show simulation results establishing three types of sample sizes.The leftmost panels ("Maximum") shows the sample size n max described earlier, which would be required to design the experiment.We stress the fact that using the classic logrank test or α-spending designs events beyond n max cannot be analyzed.The rightmost panel of Figure 4 ("Mean") shows the sample sizes that capture the expected duration of the trial.It expresses the mean number of events, under the alternative hypothesis, that will be observed before the trial can be stopped.Here, for the AV logrank tests, we use the aggressive stopping rule that stops as soon as S θ1 θ0,t ≥ 1/α = 20 or n = n max .In case of α-spending approaches and the AV logrank test this number of events is always smaller than the maximum needed in the design stage.Lastly, the middle panel ("Conditional Mean") shows an even smaller number for those tests that have a flexible sample size: the expected stopping time given that the trial is stopped before the maximum n max was reached-this only happens if the null is rejected.For comparison purposes, all sample sizes are shown relative to (i.e., divided by) the fixed sample size needed by the classical logrank test to obtain 80% power.Note that for small sample size (for small hazard ratios), both the classic logrank test and O'Brien-Fleming α-spending are not recommended due to lack of type-I error control.They are based on Schoenfeld's Gaussian approximation, which underestimates the number of events required for hazard ratios far away from 1.For example, simulations show that for θ 1 = 0.1, n = 6 or 7 events will be necessary-for small sample sizes the classical logrank test is not recommended due to lack of type-I error control.We give further details in Appendix A.4 (see also Figure 4).In summary, at all hazard ratios at which the Gaussian approximation to the classic logrank test is accurate (say for θ 1 ≥ 0.3), the mean number of events needed by the AV logrank tests is about the same or noticeably smaller than that needed when using a fixedsample-size analysis.

DISCUSSION, CONCLUSION AND FUTURE WORK
We introduced the AV logrank test, a version of the logrank test that retains type-I error guarantees under optional stopping and continuation.Extensive simulations reveal that, if we do engage in optional stopping, it is competitive with the classic logrank test (which neither allows in-trial optional stopping nor optional continuation) and α-spending procedures (which allows forms of optional stopping but not optional continuation).We provided an approximate test for applications in which only summary statistics are available and also showed how the AV logrank test can be used in combination with (informative) priors and prequential learning approaches, when no effect size of minimal clinical relevance can be specified.Two of our extensions invite further research: we introduced anytime-valid confidence sequences for the hazard ratio, and will study their performance in comparison to other approaches in future work.We also introduced an extension to Cox' proportional hazards regression, which guarantees type-I error guarantees even if the alternative model is equipped with arbitrary priors.In future work, we plan to implement this extension-which requires the use of sophisticated methods for estimating mixture models.The GROW AV logrank tests (exact and Gaussian) are already available in our safestats R package [Turner et al., 2022].We end with two final points of discussion: staggered entries and doomed trials.

Staggered entry
Earlier approaches to sequential time-to-event analysis were also studied under scenarios of staggered entry, where each patient has its own event time (e.g., time to death since surgery), but patients do not enter the follow-up simultaneously (such that the risk set of, say, a two-day-after-surgery event changes when new participants enter and survive two days).Sellke and Siegmund [1983] and Slud [1984] show that, in general, martingale properties cannot be preserved under such staggered entry settings, but that asymptotic results are hopeful [Sellke and Siegmund, 1983] as long as certain scenarios are excluded [Slud, 1984].When all participants' risk is on the same (calendar) time scale (e.g., infection risk in a pandemic; staggered entry now amounts to lefttruncation, which we can deal with), or new patients enter in large groups (allowing us to stratify), staggered entry poses no problem for our methods.But research is still ongoing into those scenarios in which our inference is fully AV for patient time under staggered entry, and those that need extra care.

Your trial is not doomed
In their summary of conditional power approaches in sequential analysis Proschan, Lan, and Wittes [2006] write that low conditional power makes a trial futile.Continuing a trial in such case could only be worth the effort to rule out an effect of clinical relevance, when the effect can be estimated with enough precision.However, if "both conditional and revised unconditional power are low, the trial is doomed because a null result is both likely and uninformative" [Proschan et al., 2006, p. 63].While this is the case for all existing sequential approaches that set a maximum sample size, this is not the case for AV tests.Any trial can be extended and possibly achieve 100% power or in an anytimevalid confidence sequence show that the effect is too small to be of interest.This is especially useful for time-to-event data when sample size can increase by extending the follow-up time of the trial, without recruiting more participants.Moreover, new participants can always be enrolled either within the same trial or by spurring new trials that can be combined indefinitely in a cumulative meta-analysis.

Number of events for 80% power
Figure 4: Maximum, expected (Mean) number of events needed to reject the null hypothesis with 80% power.'Conditional Mean' makes reference to the number of events needed given that the null hypothesis is indeed rejected.The maximum number of events needed using AV logrank statistics is higher than that of a fixed-sample test, but lower in expectation (see Section 7).All simulations are performed with α = 0.05 and tests are designed to detect the hazard ratio θ 1 shown on the x-axis.Data are generated using that same hazard ratio.The classical logrank test needs the following sample sizes (number of events) n(θ 1 ) for an 80%-power design to detect hazard ratio θ 1 : n(0.1) = 5, n(0.2) = 10, n(0.3) = 18, n(0.4) = 30, n(0.5) = 52, n(0.6) = 95, n(0.7) = 195, n(0.8) = 497 and n(0.9) = 2228.These sample sizes represent the 100% line in all plots.

APPENDIX A. OMITTED PROOFS AND DETAILS
In this section we provide proofs and remarks omitted from previous sections.In Appendix A.1 we relate growth-rate optimality to the minimum expected stopping time.In Appendix A.2, we show that the AV logrank statistic is a continuous-time martingale, and show that this is also true for general patterns of incomplete observation, such as left truncation and filtering as a consequence of the results of Andersen et al. [1993].In Appendix A.3, we proof the claims made in Section 3.3 about the martingale structure of the AV logrank test under the presence of ties.Lastly, in Appendix A.4, we give further details on the simulations used to compute the planned maximum sample sizes for a given targeted power.Under the alternative and optional stopping, the observed sample size is in many cases lower.

A.1 Expected Stopping Time, GROW and Wald's Identity
Here we motivate the GROW criterion by showing that it minimizes, in a worst-case sense, the expected number of events needed before there is sufficient evidence to stop.Let P 0 represent our null model, and let, as before, the alternative hypothesis be H 1 : θ ≤ θ 1 for some θ 1 < θ 0 .Suppose we perform a level-α test based on a test martingale S q θ0,t using the stopping rule τ that stops as soon as S q θ0,t exceeds the threshold 1/α, that is, τ q = inf t {t : S q θ0,t ≥ 1/α}.In the main text we elaborated on how S θ1 θ0,t is optimal with respect to the GROW criterion.We now show that the problem of minimizing the worst-case, the expected number of events E θ [ Nτ q ] over q is approximately equivalent to finding the GROW test martingale.To do so, we make simplifying assumptions that reduce the problem to an i.i.d.experiment.This allows us to employ a standard argument based on an identity of Wald [1947], originally due to Breiman [1961].For this we assume that the initial risk sets (i.e., ȳA 0 and ȳB 0 ) are large enough so that, for all sample sizes we will ever encounter, ȳA t /ȳ B t ≈ ȳA 0 /ȳ B 0 .This allows us to treat the likelihood of the participant(s) I (k) having witnessed the event at time T (k) to be independent of t, that is, as an i.i.d.experiment.
The argument of Breiman [1961] relates the expected number of events to the expected value of our stopped AV logrank statistic.Suppose first that we happen to know that the data come from a specific θ in the alternative hypothesis.Then S q θ0,τ is the product of Nτ factors of ratios R q θ0,(i) = q (i) (I (i) )/p θ0,(i) (I (i) ) at the ith event.Wald's identity applied to its logarithm implies For simplicity we will further assume that the number of participants at risk is large enough so that the probability that we run out of data before we can reject is negligible.Because of the choice of the stopping rule τ q , the right-hand side of the last display can then be further rewritten as , where VERY SMALL between 0 and log |θ 1 /θ 0 |.The equality follows because we reject as soon as S q θ0,t ≥ 1/α, so S q θ0,τ cannot be smaller than 1/α, and it cannot be larger by more than a factor equal to the maximum likelihood ratio at a single outcome (if we would not ignore the probability of stopping because we run out of data, there would be an additional small term in the numerator).
With (A.1) at hand, we can relate our choice of q to the expected number of events witnessed before stopping.If, for a fixed θ, we try find the q that minimizes the expected number of events E θ [ Nτ q ], and, as is customary in sequential analysis, we approximate the minimum by ignoring the VERY SMALL part, we see that the expression is minimized by maximizing the numerator E θ ln Q (1) /P θ0,(1) over q.The maximum is achieved by Q (1) = P θ,(1) ; the expression in the denominator then becomes the Kulback-Leibler divergence between two Bernoulli distributions.It follows that, under θ, the expected number of outcomes until rejection is minimized by Q (1) = P θ .Thus, in this case, we use the GROW S θ θ0,t as test statistic.However, we still need to consider the fact that the real H 1 is composite: as statisticians, we do not know the actual θ; we only know 0 < θ ≤ θ 1 .A worst-case approach uses the q achieving max q min θ≤θ1 E θ ln p (1) (I (1) )/q (1),θ0 (I (1) ) since, repeating the reasoning leading to (A.1), this q should be close to achieving the min-max number of events until rejection, given by But this just tells us to use the GROW E-variable relative to H 1 , which is what we were arguing for.

A.2 Continuous time and anytime validity
In this section, we show the anytime validity of the AV logrank test.This is done via Ville's inequality for which it suffices to show that S q θ0 = (S q θ0,t ) t≥0 is a nonnegative (super) martingale.To do so, we use the counting process formalism.A few definitions are in order.Only in this section, we assume knowledge of counting process theory [see Andersen et al., 1993, Fleming andHarrington, 2011].Denote, for i = 1, . . ., m, Ñ i t = 1{t ≤ T i } the counting processes associated to each participant, and let y i t be the at-risk process.For each participant, the censored process N i t , which is observed, is given by dN i t = y i t d Ñ i t -we use this convention to signify that N i t = t 0 y i s d Ñ i s .We define the sigmaalgebra F t := σ(N j s : 0 ≤ s ≤ t, j = 1, . . ., n), which, as usual, can be interpreted as the information in the study up to time t.
One of the results of the counting process theory is that the processes dN i t − y i t dλ i t are martingales, where, recall, y i t = 1{X i ≥ t} is the at-risk process, and λ i t is the hazard function associated to T i .In that case, y i t dλ i t is called the compensator of N i t .The result that the AV logrank test is a martingale hinges specifically on this structure.Thus, any pattern that preserves this martingale structure also preserves the martingale property for the AV logrank test, and consequently its type-I error guarantees.Andersen et al. [1993, III.4] show exactly this under general patterns of incomplete observation provided that the mechanisms are independent of the observations.With this in mind, in the following, we only assume that the counting processes N i t have compensators A i t given by dA i t = y i t dλ i t .The filtration F = (F s ) s≥0 is right-continuous and we can safely identify predictable processes with left-continuous process.For some θ 0 , denote by P 0 the distribution under which, for each i = 1, . . ., m, the hazard function for Let q 1 t , . . ., q m t be predictable processes such that i≤m q i t y i t = 1 a.s.for all t, that is, {q i t } i∈Rt at each t is a probability distribution over the participants at risk at time t.Define r i t to be each of the ratios r i t = q i t /p i θ0,t .Define the predictable process S q θ0,t − = lim s↑t S q θ0,t − .As such, at each t, the change dS q θ0,t = S q θ0,t − S q θ0,t − of the AV logrank statistic S q θ1 at time t, given in (3.5), can be computed as because no two events happen simultaneously with positive probability.Since S q θ0,t − is predictable, it is enough to prove that the process M t defined by dM t = i≤m (1−r i t )dN i t is a martingale [see Fleming and Harrington, 2011, Theorem 1.5.1].Recall that ȳA t = i∈A y i t and ȳB t = i∈B y i t .Then both ȳA and ȳB are left-continuous processes.
Lemma A.1.Let {q i t } i≤m be a collection of nonnegative leftcontinuous processes q i = (q i t ) t≥0 such that i≤m y i t q i t = 1 for all t.Let {p i θ0,t } i≤m be the collection of processes given by The process M = (M t ) t≥0 given by dM t = i≤m (1 − r i t )dN i t is a martingale under P 0 with respect to the filtration F = (F t ) t≥0 .
Proof.It suffices to show that the compensator A t of M t , given by dA t = i≤m i≤m (r i t − 1)y i t λ i t dt is zero.Define qA t = i∈A y i t q i t and qB t = i∈B y i t q i t .Notice that by assumption qA t + qB t = 1., and recall that, under the null where we used the assumption that i≤m y i t q i t = ȳA t q A t + ȳB t q B t = 1.As the compensator A t of M t is zero at each t, we conclude that M t is a martingale, as was to be shown.
Our previous discussion and the preceding lemma have the following corollary as a consequence.
Corollary A.2. S q θ0 = (S q θ0,t ) t≥0 is a nonnegative martingale with expected value equal to one.

A.3 Ties
The purpose of this section is twofold.Firstly, we prove Lemma 3.2.Secondly, we show that the conditional likelihood given in Section 3.3 indeed approximates the true conditional partial likelihood ratio under any distribution such that the hazard ratio is θ 1 .
Our general strategy in this case is similar to the one undertaken in the continuous-monitoring case: we build a test martingale with respect to a filtration G , and use Ville's inequality to derive anytime-valid type-I error guarantees.Define, for each k = 1, 2, . . ., the sigma-algebra G k generated by all observations made in times t 1 , . . ., t k , that is, G k = σ(N i t l , Ñ i t l : i = 1, . . ., m; l = 1, . . ., k), and the corresponding filtration G = (G k ) k=1,2,... .Under Cox's proportional hazard model, conditionally on G k−1 , our observations ∆ N A k and ∆ N B k are binomially distributed with parameters depending on the hazard function (see Lemma A.3 below).By conditioning both on G k−1 and on the total number of events ∆ Nk = ∆ N A k + ∆ N B k , we use the likelihood of having observed ∆ N B k , which follows Fisher's noncentral hypergeometric distribution, as detailed in Corollary A.4.We gather these observations in the following two lemmas.Proof.The result is standard, and it follows from explicitly solving for λ in (2.1) and computing the conditional probability in (2.2) for each group.
Next, we use a standard result: given two binomially distributed random variables X and Y , the distribution of X conditionally on X + Y is Fisher's noncentral hypergeometric distribution.We apply this to ∆ N A k and ∆ N B k from the previous lemma and spell out the corresponding parameters in the following corollary.
Then, conditionally on G k−1 , the likelihood of having observed ∆ N B k events in group B is given by Fisher's noncentral hypergeometric dis- Naively, one could use a partial likelihood ratio just as in the absence of ties to derive a sequential test.This, however, is not satisfactory, because, in general, the parameter ω k depends heavily on the unknown baseline hazard function λ A .Contrary to the general case, when the hazard ratio θ is one, the parameter ω k = 1, and Fisher's noncentral hypergeometric distribution reduces to the conventional hypergeometric distribution.With this observation at hand, if {q k } k=1,2,... is a sequence of conditional distributions q k ( • ) on the possible values of ∆ N B k , we can build a sequential tests for (2.3), with its corresponding type-I error guarantee.We give the details in the following corollary, and subsequently point at a useful choice for q that approximates the real likelihood.
The choice of q for our statistic presented in Section 3.3 follows from an approximation of the parameter ω for small ∆t k = t k − t k−1 .As noted by Mehrotra and Roth [2001], if approximates the real conditional likelihood under any alternative for which the true hazard ratio is θ 1 .Hence, the sequentially computed statistic approximates the true partial likelihood ratio of the data observed up to time t k in the presence of ties, and we recommend its use.

A.4 Details of sample size comparison simulations
In this section we lay out the procedure that we used to estimate the expected and maximum number of events required to achieve a predefined power as shown in Figure 4 and Figure 1 in Section 7. First we describe how we sampled the survival processes under a specific hazard ratio.We then describe how we estimated the maximum and expected sample size required to achieve a predefined power (80% in our case) for any of the test martingales that we considered (that of the exact AV logrank, its Gaussian approximation, and the prequential plugin variant).Finally, we explain how the same quantitiees for the classical logrank test and the O'Brien-Fleming procedure were obtained.
In order to simulate the order in which the events in a survival processes happens, we used the sequential-multinomial risk-set process from Section 3. As before, we consider the general testing problem with θ 0 = 1 and a minimal clinically relevant effect size θ 1 < 1, and we denote the true data generating parameter by θ, typically, θ ≤ θ 1 .Under θ, the odds of the next event at the i th event time happening in Group B are θȳ B (i) : ȳA (i)the odds change at each time step.Thus, simulating in which group the next event happens only takes a biased coin flip.For the problem of testing (3.6) with θ 0 we fix the tolerate a type-I error to α = 0.05 and the type-II error to β = 0.2.For each test martingale S q θ0 of interest we first consider the stopping rule τ q = inf{k : S q θ0,(k) ≥ 1/α}, that is, we stop as soon as S q θ0,(i) crosses the threshold 1/α.Recall that in the worst case, θ = θ 1 the expected stopping time τ q is lowest when we use S θ1 θ0,(k) , see Appendix A.1.
To estimate the maximum number of events needed to achieve a predefined power with a given test martingale, we turned our attention to a modified stopping rule τ q .Under τ q we stop at the first of two moments: either when our test martingale S q θ0,(k) crosses the threshold 1/α (i.e., at τ ) or once we have witnessed a predefined maximum number of events n max .More compactly, this means using the stopping rule τ q given by τ q = min(τ q , n max ).In those cases in which the test based on the stopping rule τ q achieves a power higher than 1 − β, a maximum number of events n max smaller than the initial size of the combined risk groups can be selected to achieve approximate power 1 − β using the rule τ q .
A quick computation shows that n max has the following property: it is the smallest number of events n such that stopping after n events has probability smaller than 1 − β under the alternative hypothesis, that is, More succinctly, n max is the (approximate) (1 − β)-quantile of the stopping time τ q , which can be estimated experimentally in a straightforward manner.
To estimate n max for an initial risk set sizes m 1 , m 0 , we sampled 10 4 realizations of the survival process (under θ) using the method described at the beginning of this section.This allowed us to obtain the same number of realizations of the stopping time τ q .We then computed the (1 − β)-quantile of the simulated first passage time distribution of τ q , and reported it as an estimate of the number of events n max in the 'maximum' column in Figure 4.
We assessed the uncertainty in the estimation n max using the bootstrap.We performed 1000 bootstrap rounds on the sampled empirical distribution of τ q , and found that the number of realizations that we sampled (10 4 ) was high enough so that plotting the uncertainty estimates was not meaningful relative to the scale of our plots.For this reason we omitted the error bars in Figure 4 and Figure 1.
In the "mean" column of Figure 4 and Figure 1 we plot an estimate of the expected number of events τ q = min(τ q , n max ).For this, we used the empirical mean of the stopping times that were smaller than n max on the sample that we obtained by simulation, with 20% of the stopping times being n max itself.In the "conditional mean" column, we plot an estimate of τ q | τ q < n max , i.e., the stopping time given that we stop early (and hence reject the null).
For comparison, we also show the number of events that one would need under the Gaussian non-sequential approximation of Schoenfeld [1981], and under the continuous monitoring version of the O'Brien-Fleming procedure.In order to judge Schoenfeld's approximation, we report the number of events required to achieve 80% power.This is equivalent to treating the logrank statistic as if it were normally distributed, and rejecting the null hypothesis using a z-test for a fixed number of events.The power analysis of this procedure is classic, and the number of events required is n S max = 4(z α + z β ) 2 / log 2 θ 1 , where z α , and z β are the α, and β-quantiles of the standard normal distribution.In the case of the continuous monitoring version of O'Brien-Fleming's procedure, we estimated the number of events n OF max needed to achieve 80% as follows.For each experimental setting (m A , m B , θ), we generated 10 4 realizations of the survival process under θ and computed the corresponding trajectories of the logrank statistic.For each possible value n of n OF max , we computed the fraction of trajectories for which the O'Brien-Fleming procedure correctly stopped when used with the maximum number of events set to n.We report as an estimate of the true n OF max the first value of n for which this fraction is higher than 80%, our predefined power.

APPENDIX B. COVARIATES: THE FULL COX PROPORTIONAL HAZARDS E-VARIABLE
We extend the AV logrank test to the situation when timedependent covariates are present, as done in Section 3 with the same notation used there.Assume now the presence of d covariates and let, for each participant i, z i t = (z i t,1 , . . ., z i t,d ) be the covariate vector consisting left-continuous time-dependent covariates z i t,1 , . . ., z i t,d .Denote by z i (k) the value of the covariates of participant i at the time T (k) when the kth event is witnessed.We let random variable I (k) denote the index of the patient to which the kth event happens, and consider the extended process I (1) , I (2) , . . .where the information that is available at time T (k) is, I (1) , I (2) , . . ., I (k) , and z (1) , . . ., z (k) .The conditional partial likelihood underlying the process is now denoted P β,θ with θ > 0, β ∈ R d , and β θ = ln θ ∈ R, defined as follows: k) ; i = 1, . . ., n}, and , This is consistent with Cox' (1972) proportional hazards regression model: the probability that the ith participant witnesses an event, assuming he/she is still at risk, is proportional to the exponentiated weighted covariates, with group membership being one of the covariates.In case β = 0, this is easily seen to coincide with the definition of P θ via (2.5) with θ = e β θ .

B.1 E-Variables and Martingales
Let W be a prior distribution on β ∈ R d for some d > 0. (W may be degenerate, i.e., put mass one on a specific parameter vector β 1 ).For each such W, we let q W,θ,(k) be the probability distribution on R (k) defined by Consider a measure ρ on R d (e.g., Lebesgue or some counting measure) and we let W be the set of all distributions on R d which have a density relative to ρ, and W • ⊂ W be any convex subset of W (we may take W • = W, for example).We define q←W,θ0 to be the reverse information projection [Li, 1999] (RIPr) of q W,θ,(k) on {q W,θ0,(k) : W ∈ W • }, defined as the probability distribution on R (k) such that We know from Li [1999] and Grünwald et al. [2020] that q←W,θ0,(k) exists for each k.Grünwald et al. [2020] show, in the context of E-variables for 2 × 2 contingency tables, that the infimum in the previous display is in fact achieved by some distribution W with finite support on R d if the random variables y 1 (k) , . . ., y m (k) constituting our random process have a finite range.For given hazard ratios θ be our analogue of (3.3).
Note that the result does not require the prior W to be well specified in any way: under any (β, θ 0 ) in the null distribution, even if β is completely disconnected to W, R θ1 W,θ0,(k) is an Evariable conditional on past data.
In parallel to the discussion in Section 3.1, we can therefore, for each prior W 1 , construct a test martingale S k := l≤k R θ1 W l ,θ0,(l) that "learns" β from the data, analogously to (3.7), and computes a new RIPr at each event time k.

B.2 Finding the RIPr
While it is not clear how to calculate the RIPr q ←W,θ0,(k) in general, it can be well approximated with the efficient algorithm design by Li [1999] andLi andBarron [1999].Their algorithm is computationally feasible as long as we restrict W • δ to be the set of all priors W for which min i∈R (k) q W,θ0,(k) (i) ≥ δ, for some δ > 0. In that case, when run for M steps, the algorithm achieves an approximation error of O(ln(1/δ)/M ), where each step is linear in the dimension d.Since the approximation error is logarithmic in 1/δ, we can take a very small value of δ, which makes the requirement less restrictive.Exploring whether the Li-Barron algorithm really allows us to compute the RIPr for the Cox model, and hence R θ1 W k ,θ0,(k) in practice, is a major goal for future work.

B.3 Ties
Without covariates, our E-variables allow for ties correspond to a likelihood ratio of Fisher's noncentral hypergeometric distributions (see Section 3.3), the situation is not so simple in the presence of covariates.Although deriving the appropriate extension of the noncentral hypergeometric partial likelihood is possible, one ends up with a hard-to-calculate formula [Peto, 1972].Various approximations have been proposed in the literature [Cox, 1972, Efron, 1977].In case these preserve the E-variable and martingale properties, they would retain type-I error probabilities under optional stopping and we could use them without problems.We do not know whether this is the case however; for the time being, we recommend handling ties by putting the events in a worst-case order, leading to the smallest values of the Evariable of interest, as this is bound to preserve the type-I error guarantees.

APPENDIX C. GAUSSIAN AV LOGRANK TEST
In this section we derive the Gaussian AV logrank test of Section 4, and investigate the validity of the Gaussian approximation.In Appendix C.1, we show that this approximation is only valid when the allocation of participants to each group under investigation is balanced, that is, when m A = m B .In Appendix C.2 we investigate numerically the sample size needed to reject the null hypothesis under both the exact AV logrank test and its Gaussian approximation.
We start with the derivation of (4.2).For this we use (local) asymptotic normality of the Z-score (4.1).Under the null distribution, Z k from (4.1) has an asymptotic standard Gaussian distribution.Under any alternative distribution under which the hazard ratio is θ, Schoenfeld [1981] showed that, in the absence of ties, the Z-statistic also follows a Gaussian distribution with unit variance, but this time with mean µ 1 given by log(θ).
Note that µ 1 depends on more than the summary statistic Z k .In the case that the number of observed events is much smaller than the initial risk set sizes, the mean µ 1 under the alternative can be further approximated by where Nk is the total number of observations up until time t k , and the resulting approximation only depends on summary statistics.It is exactly this value µ 1 that we use in the Gaussian AV logrank test.The asymptotic result of Schoenfeld relies on two conditions: (1) that the hazard ratio θ 1 under the alternative is close enough to one so that a first-order Taylor approximation around θ 0 = 1 is adequate; (2) that the expected number of events E B k stays approximately constant over time, that is, close to the initial allocation proportion E B 1 = m B /(m B + m A ).This indicates that the asymptotic approximation is reasonable for values of θ 1 close to 1 and the initial risk sets are both large in comparison to the number of events witnessed.Notice that in this regime of large risk sets the multiplicity correction in V k is also negligible.
This raises the question whether a sequential Gaussian approximation is sensible for the logrank statistic-a priori it is not at all clear whether Schoenfeld's asymptotic fixed-sample result has a nonasymptotic counterpart.Define the the logrank statistic per observation time We investigate whether the exact AV logrank statistic behaves similarly to the Gaussian likelihood ratio Here θ 0 = 1, µ 0 = 0, and µ 1 = µ 1 (θ 1 ) as in in (C.1).Note that both axis are logarithmic.
for θ 0 = 1 we have µ 0 = 0, µ 1 = log(θ) m B m A /(m A + m B ) 2 , and φ µ is the Gaussian density with unit variance and mean µ.Note that the statistic still depends on elements of the full data set; more approximations are needed.Write the Gaussian densities, and use that in the limit of large risk sets p B i ≈ m B /(m A + m B ) and that consequently (m A +m B ) 2 .This approximations valid under Schoenfeld's second assumption.With these approximations at hand, the Z-statistic is approximated by where S G k is as in (4.2).In Figure 5 we show, in case of balanced allocation, that the Gaussian approximation S G k a single event time from the Gaussian approximation are very similar to the exact S θ1 θ0,(k) for alternative hazard ratios θ 1 between 0.5 and 2.

C.1 Safety only for balanced allocation
In order to assess whether the Gaussian AV logrank test is indeed AV, that is, whether the type-I error guarantees holds, we inspect whether the expected value of each of its multiplicative increments is bellow 1.In relation to our discussion in Section 3.1, this would imply that all multiplicative increments are conditional E-variables and that the resulting test is, at least approximately, a test martingale.Figure 6 shows the expectation of these increments as a function of the hazard ratio for several initial allocation ratios.In case of balanced 1:1 allocation S G k is an E-variable, since its expectation is 1 or smaller.However, in case of unbalanced 2:1 or 3:1 allocation and designs with hazard ratio θ 1 < 1, S G k is not an E-variable.Of course, even if the initial allocation is balanced, it can become unbalanced.Figure 6 shows that in case of designs outside the range 0.5 ≤ θ 1 ≤ 2 the deviations from expectation 1 can be problematic.Hence we do not recommend to use the Gaussian approximation on the logrank statistic for unbalanced designs and designs for θ 1 < 0.5 or θ 1 > 2. For balanced designs with 0.5 ≤ θ 1 ≤ 2, we found that in practice they are safe to use, the reason being that scenarios in which the allocation becomes highly unbalanced after some time (e.g.y B i = 80, y A i = 20) are extremely unlikely to occur under the null.

C.2 Sample size
In this section we compare the stopping time distribution τ G := inf{k : ξ G k = 1} of the Gaussian approximation to that of τ = inf{k : ξ k = 1}.We use tests with tolerable type I error α = 0.05, thus, the threshold 1/α = 20 for both tests.In the previous section we showed that the Gaussian approximation to the AV logrank statistic is valid when the initial allocation is 1:1 and for values 0.5 ≤ θ 1 ≤ 2, where θ 1 is the hazard ratio under the alternative.In these scenarios, we simulate a survival process from a distribution according to which the true data generating hazard ratio is θ = θ 1 and sampled realizations τ G and τ for the same data set.The results of the simulation are shown in Figure 7, where we plot the realizations of τ G against those θ 1 = 0.5 θ 1 = 0.6 θ 1 = 0.7 θ 1 = 0.8 θ 1 = 0.9 m B m A = 100 100 m B m A = 1000 1000 m B m A = 10000 10000 10 1 10 2 10 3 10 4 10 1 10 2 10 3 10 4 10 1 10 2 10 3 10 4 10 1 10 2 10 3 10 4 10 1 10 2 10 3 10 4  The stopping times under the Gaussian approximation often coincide with the exact ones, and are often more conservative (see Appendix C.2).Note that both axes are logarithmic. .
−Fleming α −spending Null hypothesis rejection regions for a left−sided test

Figure 2 :
Figure 2: Left-sided rejection regions for continuous-monitoring using O'Brien-Fleming α-spending or the Gaussian AV logrank test.Allocation is balanced (m A = m B ) and α = 0.05.Also shown are the O'Brien-Fleming and Pocock α-spending boundaries for 10 interim analyses.The α-spending boundaries are designed to have 80% power when detecting a hazard ratio 0.7.For more details, including the values of n max , see Section 4.1.

Figure 3 :
Figure3: Null hypothesis rejections on simulated data.The rejection regions are the same as shown in Figure2(designed to detect a hazard ratio of 0.7 with 80% power).Data are simulated under balanced allocation (m 1 = m 0 = 5000) and as time-to-event data with possible ties.The logrank Z-statistic does not have a value for all n; it sometimes jumps with several additional events at a time.
fixed−design classic logrank test

Lemma A. 3 .
Conditionally on G k−1 , the following hold: 1.The number of events ∆ N A k has a binomial distribution with parameters ȳA k and p A k where p A k number of events ∆ N B k has a binomial distribution with parameters ȳB k and p B k where p B k = 1 − exp −θ t k t k−1 λ A s ds and θ is the hazard ratio.
Corollary A.4.Let G k−1 = G k−1 ∨ σ(∆ Nk ),and let p A k and p B k as in Lemma A.3.Define the odd ratios ω
n, AV logrank test (exact) Number of events n, AV logrank test (Gaussian)Stopping times for the Gaussian and the exact AV logrank test

Figure 7 :
Figure 7: Stopping times for the Gaussian and exact AV logrank tests under continuous monitoring (no ties) with threshold 1/α = 20.The stopping times under the Gaussian approximation often coincide with the exact ones, and are often more conservative (see Appendix C.2).Note that both axes are logarithmic.
where O A k and O B k are the number of events witnessed in each group in the time interval (t k−1 , t k ], and O k