On Bayesian Sequential Clinical Trial Designs

Clinical trials usually involve sequential patient entry. When designing a clinical trial, it is often desirable to include a provision for interim analyses of accumulating data with the potential for stopping the trial early. We review Bayesian sequential clinical trial designs based on posterior probabilities, posterior predictive probabilities, and decision-theoretic frameworks. A pertinent question is whether Bayesian sequential designs need to be adjusted for the planning of interim analyses. We answer this question from three perspectives: a frequentist-oriented perspective, a calibrated Bayesian perspective, and a subjective Bayesian perspective. We also provide new insights into the likelihood principle, which is commonly tied to statistical inference and decision making in sequential clinical trials. Some theoretical results are derived, and numerical studies are conducted to illustrate and assess these designs.

1. Introduction 1.1. Background. In most clinical trials, patient enrollment is staggered, and patients' data are collected sequentially. When designing a clinical trial, it is often desirable to include a provision for interim analyses of accumulating data with the potential for modifying the conduct of the study (Pocock, 1977;Armitage, 1991).
For example, in a randomized-controlled trial, if an interim analysis demonstrates that the investigational drug is deemed superior than the standard of care, the trial could be stopped early on grounds of ethics and trial efficiency (Geller and Pocock, 1987). The BNT162b2 COVID-19 vaccine trial is a recent case in which four interim analyses were planned with the possibility for declaring vaccine efficacy before the planned end of the trial (Polack et al., 2020).
It is well known that frequentist sequential designs need to be adjusted for the planning of interim analyses to maintain desirable frequentist properties (Jennison and Turnbull, 1990). For Bayesian sequential designs, however, there has been some controversy regarding whether similar adjustments are required .
In this article, we review different perspectives on Bayesian sequential designs and answer the question of whether Bayesian sequential designs need to be adjusted for interim analyses. Our review is not meant to be comprehensive with regard to methodological details including the type of trial (e.g., single-arm or randomized-controlled), type of outcome (e.g., binary, continuous, or time-to-event), or distributional assumption. Instead, we focus on the fundamentals of Bayesian sequential designs. A single-arm trial example (to be introduced in Section 1.2) will be used throughout to demonstrate these designs, but we present an extension for randomized-controlled trials in Section 2.7. We consider early stopping rules for efficacy, as futility stopping does not increase the type I error rate of a design (it actually reduces the type I error rate). Discussion on futility stopping is deferred to Section 6.
There is a rich literature on sequential designs (e.g., Jennison and Turnbull, 1990;Whitehead, 1997;Jennison and Turnbull, 2000), but the majority is centered around frequentist approaches. There are also comprehensive reviews on Bayesian trial designs in general (e.g., Spiegelhalter et al., 1994;Berry, 2006;Berry et al., 2010), but most do not extensively address sequential trials. Lastly, there are many insightful discussions on Bayesian sequential designs, such as Cornfield (1966b); Berry (1985Berry ( , 1987; Freedman and Spiegelhalter (1989); Jennison and Turnbull (1990) ;Freedman et al. (1994); Emerson et al. (2007); Harrell (2020a); Ryan et al. (2020); Stallard et al. (2020). However, a systematic review on the fundamentals of Bayesian sequential designs has been lacking, and we attempt to fill this important gap. Furthermore, as mentioned earlier, in existing works, different authors seem to have vastly different opinions on how Bayesian sequential designs should be formulated. It turns out that different authors mean quite different things by "Bayesian sequential designs need/do not need to be adjusted for interim analyses". We aim to disentangle the practical and philosophical implications behind these different perspectives.
Our contributions include the following. (i) In Bayesian sequential designs, a pertinent question is whether adjustments for the planning of interim analyses are necessary. We attempt to answer this question from multiple perspectives. From a frequentist-oriented perspective, such adjustments are necessary for achieving desirable frequentist properties such as controlling the type I error rates; from a calibrated Bayesian perspective, such adjustments may be needed to achieve desirable operating characteristics under plausible scenarios (we will discuss the differences between achieving desirable operating characteristics versus achieving desirable frequentist properties); lastly, from a subjective Bayesian perspective, such adjustments are unnecessary, and the design only needs to reflect subjective beliefs. We comment on the three perspectives and make our recommendation. (ii) We put forward a proposal for a calibrated Bayesian approach to sequential designs. Specifically, we propose false discovery rate (FDR) and false positive rate (FPR) as potential metrics to evaluate sequential designs. We derive theoretical results regarding the FDR and FPR of a Bayesian sequential design and present simulation studies to demonstrate the practical usage of the calibrated Bayesian approach. (iii) We summarize Bayesian sequential designs based on posterior probabilities, posterior predictive probabilities, and decision-theoretic frameworks. We discuss the connections between designs using posterior credible intervals and those using formal Bayesian hypothesis testing. (iv) It is often believed that according to the likelihood principle (LP), decision making in a sequential trial should not depend on unrealized events. However, our investigation shows that the LP gives little guidance in assessing the overall performance of a decision procedure. In particular, the LP does not preclude one from utilizing additional information (including unrealized events) for decision making. Therefore, our view is that the LP should not be used as an argument for or against Bayesian or frequentist sequential designs. To illustrate our findings, we present an example of a Bayesian decision-theoretic design in which different decisions will be made based on the same observed data but different interim analysis plans.
1.2. An Illustrative Example. To illustrate the discussion, consider a single-arm trial that aims to establish the therapeutic effect of an investigational drug. Suppose that a total of K analyses, including (K − 1) interim analyses and a final analysis, are planned during the course of the trial. At the jth analysis, data of n j patients are accumulated, denoted by y 1 , y 2 , . . . , y n j and assumed independently and normally distributed with mean θ and variance σ 2 . Here, θ is parameterized such that a positive value of θ is indicative of a therapeutic effect, and σ 2 is assumed known for simplicity.
The planned maximum sample size is denoted by n K and can be determined based on a power requirement or the amount of available resources. As a simple example, assume patients are enrolled in groups of equal size g, thus n j = jg. If g = 1, it leads to the fully sequential case, known as continuous monitoring; if g > 1, it is called the group sequential case, which is more feasible in practice. The primary research question of the trial can be formulated as the following hypothesis test, At each analysis, the hypothesis test is performed. If certain stopping rule is triggered, say the z-statistic z j > c j for some stopping boundary c j , H 0 is rejected, and the trial is terminated for efficacy. Here, This is referred to as data-dependent or optional stopping. When σ is unknown, one would replace the z-statistics with the corresponding t-statistics; little would change in the overall setup. A question central to sequential designs is the specification of those stopping boundaries.
1.3. Overview of Frequentist and Bayesian Sequential Designs. Frequentist sequential designs are concerned with controlling the overall type I error rate of the sequential testing procedure. The type I error rate refers to the probability of falsely rejecting H 0 at any analysis (in hypothetical repetitions of the trial), given that H 0 is true. In the single-arm trial example, the maximum type I error rate is attained when θ = 0 and is given by If each test is performed at a constant nominal level, α will inflate as K grows and will eventually converge to 1 as K → ∞ (Armitage et al., 1969). Therefore, adjustments to the stopping boundaries are necessary to ensure that the type I error rate is maintained at a desirable level. Examples of such adjustments include the Pocock or O'Brien-Fleming procedure (Pocock, 1977;O'Brien and Fleming, 1979), the error spending approach (Slud and Wei, 1982;Lan and DeMets, 1983), and the stochastic curtailment approach (Lan et al., 1982). We provide a brief review of some frequentist sequential designs in Appendix A.
Without accounting for the sequential nature of the hypothesis test, Bayesian designs can suffer the same problem of type I error inflation, which can be unsettling for statisticians who care about controlling the type I error rates. Therefore, in many Bayesian sequential trial designs, the stopping boundaries are also determined to control the type I error rate at a desirable level (Zhu and Yu, 2017;Shi and Yin, 2019). As an example, the recent BNT162b2 COVID-19 vaccine trial was designed using a Bayesian approach with four planned interim analyses (Polack et al., 2020).
The stopping boundaries were chosen such that the overall type I error rate was controlled at 2.5%. Indeed, regulatory agencies generally recommend demonstration of adequate control of the type I error rate for any trial design to be acceptable (Food andDrug Administration, 2010, 2019). On the other hand, the type I error rate is a frequentist concept, the calculation of which involves an average over unrealized events such as hypothetical repetitions of the trial. Bayesian inference can be performed based solely on the observed data from the actual (and lone) trial and does not have to be concerned with type I error rate control, since the same trial is not assumed to repeat, hypothetically or in practice. Some think that the type I error rate is not the quantity that one should pay most attention to (Harrell, 2020b). Also, according to the likelihood principle (LP), unrealized events should be irrelevant to the statistical evidence about a parameter (Berger and Wolpert, 1988). Therefore, some Bayesian statisticians have written that the choice of the stopping rules does not need to depend on the planning of interim analyses (Berry, 1985(Berry, , 1987. For example, one may stop the trial at any analysis provided that Pr(θ > 0 | data) exceeds some threshold, or if stopping minimizes the posterior expected loss. We will elaborate on these issues in the upcoming sections.
The remainder of the paper is structured as follows. In Section 2, motivated by a sequential design based on posterior probabilities, we summarize the philosophy of Bayesian sequential designs into three categories. In Section 3, we review selected Bayesian sequential designs based on posterior predictive probabilities and decisiontheoretic frameworks. In Section 4, we comment on the LP, which is commonly tied to statistical inference and decision making in sequential clinical trials. In Section 5, we present some numerical studies. Finally, in Section 6, we conclude and discuss some other considerations including futility stopping rules and two-sided tests. A brief review of frequentist designs and the proof of the theoretical results are provided in the Appendix.

Three Perspectives on Bayesian Sequential Designs
Consider the single-arm trial in Section 1.2. In Bayesian sequential designs, the early stopping rules are typically based on the posterior probability (PP) of θ being greater than some threshold (e.g., Thall and Simon, 1994;Heitjan, 1997). Assume the time and frequency of interim analyses are given in advance. Let π(θ) denote the prior distribution of θ. At analysis j, the posterior distribution of θ is given by where y j = (y 1 , . . . , y n j ) is the vector of accumulating data up to analysis j, and f (y j | θ) denotes the sampling distribution of y j . When the prior for θ is a conjugate normal distribution, θ ∼ N(µ, ν 2 ), the above posterior is available in closed form, for some threshold γ j , H 0 is rejected, the trial is stopped, and efficacy of the drug is declared. This is equivalent to and q 1−γ j is the upper (1 − γ j ) quantile of the standard normal distribution. It remains to specify the prior π(θ) and threshold values {γ 1 , . . . , γ K }. We present three perspectives next and our comments and recommendation later in Section 2.4.
2.1. The Frequentist-oriented Perspective. Without accounting for multiple looks at the data, the stopping rule in Equation (4) can lead to type I error rate inflation. As an example, consider a N(0, 1 2 ) prior on θ and constant threshold values γ 1 = · · · = γ K = 0.95. Suppose the outcome variance σ 2 = 1, the maximum sample size n K = 1000, and patients are enrolled in equal group sizes. Using Equation (2), the type I error rates are α = 0.05, 0.08, 0.13, 0.17, 0.30, and 0.39 for K = 1, 2, 5, 10, 100, and 1000, respectively. Therefore, due to regulatory guidance (Food and Drug Administration, 2010, 2019), one should adjust π(θ) and {γ 1 , . . . , γ K } according to the planning of interim analyses to achieve desirable type I error rate control (and possibly other frequentist properties). We refer to this as a frequentist-oriented approach.
With an intended type I error rate, the parameters in a Bayesian sequential design can be chosen in multiple ways. For prespecified threshold values, type I error rate control can be achieved by using a conservative prior. Freedman and Spiegelhalter (1989) and Freedman et al. (1994) demonstrated that by tuning the prior distribution of θ, one could achieve stopping boundaries similar to or more conservative than Pocock's or O'Brien-Fleming's boundaries. In our case, we can simply set µ = 0 and adjust ν 2 according to the planning of interim analyses. From Equation (4), when µ = 0, the stopping boundaries monotonically increase as ν 2 decreases. For example, consider the single-arm trial with an outcome variance of σ 2 = 1, a maximum sample size of 1000, K = 5 analyses, and equal group sizes. Then, with threshold values γ j ≡ 0.95, a N(0, 0.054 2 ) prior for θ controls the type I error rate at 0.05. The corresponding stopping boundaries for z j 's are shown in Table 3.
Alternatively, for a given prior π(θ), type I error rate control can be attained by adjusting the threshold values {γ 1 , . . . , γ K }. For the single-arm trial example, one may equate the stopping boundaries in Equation (4) to the corresponding boundaries in any frequentist sequential design. For example, suppose {c 1 , . . . , c K } are O'Brien-Fleming boundaries, then γ j may be set at For more complicated trials (e.g., randomized-controlled, binary outcome), tuning π(θ) and {γ 1 , . . . , γ K } to achieve desirable type I error rate control is more challenging  Goldstein, 2006;Robinson, 2019), the prior π(θ) should be specified to reflect a subjective belief on θ before the trial, and the threshold values {γ 1 , . . . , γ K } should be chosen to represent personal tolerance of risk. For example, a positive (or negative) prior mean for θ represents that the investigator's prior belief on the treatment effect is optimistic (or pessimistic). Similarly, the prior variance for θ reflects the investigator's uncertainty about the prior opinion. In practice, π(θ) could be elicited from preclinical data and historical clinical trials with a similar setting. On the other hand, the choice of the threshold values can be justified from a decisiontheoretic perspective. See, e.g., Robert (2007) (Chapter 5.2). At analysis j, the possible decision is denoted by ϕ j , where ϕ j = 1 (or 0) indicates rejecting H 0 and stopping the trial (or failing to reject H 0 and continuing enrollment if j < K). Assume the loss associated with decision ϕ j is Then, the posterior expected loss of ϕ j is L j (ϕ j , y j ) = j (ϕ j , θ)p(θ | y j )dθ, and the decision that minimizes L j (ϕ j , y j ) is otherwise.
By setting γ j at ξ 1j /(ξ 0j + ξ 1j ), the stopping rule in Equation (4) minimizes the posterior expected loss. In practice, one could specify the loss function j (ϕ j , θ) based on personal tolerance of risk and then derive the γ j 's subsequently. For example, if one wants to be conservative about rejections early in the trial, one could consider increasing the loss of false rejections at early interim analyses (Rosner and Berry, 1995). Of course, the particular loss function in Equation (5)  We see that by taking this particular subjective Bayesian approach, one does not need to take frequentist properties into account. For example, suppose that ξ 1j = 19 · ξ 0j for all j, then one can reject H 0 and stop the trial at any analysis as long as Pr(θ > 0 | y j ) > 0.95. As Edwards et al. (1963) stated, "it is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience." This point has also been made by Harrell (2020a).
Such a procedure is vulnerable to type I error rate inflation, which would bother many practitioners. However, it has been argued that the type I error rate is not the quantity that one should pay most attention to (Harrell, 2020b), because its calculation is conditioned on an assumption rather than something knowable. Subjective Bayesians argue that what matters is the probability of "regulator's regret", Pr(θ ≤ 0 | data), conditioned on the available data. Also, the calculation of the type I error rate involves an average over unrealized event that may arise for hypothetical values of θ. However, based on the LP, unobserved events are irrelevant to the evidence about θ (Berry, 1985(Berry, , 1987. We provide more discussion in Section 4. A similar critique on the subjective Bayesian approach is the issue of "sampling to a foregone conclusion" (Cornfield, 1966a). However, Berry (1985Berry ( , 1987 argued that this is not a threat, because the sequence of posterior probabilities, {Pr(θ > 0 | y 1 , . . . , y n ) : n = 1, 2, . . .}, is a martingale. If the posterior probability of {θ > 0} is less than 0.95 given n observations, say 0.94, then after the next observation, it may increase or decrease with an expected value of 0.94. In other words, one cannot guarantee reaching Pr(θ > 0 | data) > 0.95 with more data. Specifically, when the sampling distribution of y i 's is normal, the expected number of additional observations required to raise Pr(θ > 0 | data) any prescribed amount is infinite. This is analogous to the expected hitting time of a Brownian motion, which is infinite (see, e.g., Chapter 8.2 in Ross, 1996).
2.3. The Calibrated Bayesian Perspective. Although Bayesian probabilities represent degrees of belief in some formal sense, for practitioners and regulatory agencies, it can be pertinent to examine the operating characteristics of Bayesian designs in repeated practices. One could calibrate the prior and threshold values in a Bayesian sequential design to achieve desirable operating characteristics under a range of plausible scenarios, and we refer to this as a calibrated Bayesian approach (Rubin, 1984;Little, 2006). We provide more background on the calibrated Bayesian perspective in Appendix B.1.
We distinguish between operating characteristics and frequentist properties: we use the former to refer to the long-run average behaviors of a statistical procedure in a series of (possibly different) trials, and use the latter to refer to those in (imaginary) repetitions of the same trial. In other words, operating characteristics represent averages over a joint data-parameter distribution, while frequentist properties represent averages over a data distribution given a fixed parameter. See, e.g., Rubin (1984); Bayarri and Berger (2004). Frequentist properties are a special class of operating characteristics.
What kinds of operating characteristics could be examined? Consider the singlearm trial example. Imagine an infinite series of such trials with true but unknown treatment effects {θ (1) , θ (2) , . . .}, which constitute some population distribution π 0 (θ).
For each trial, patient outcomes y K ∼ f 0 (y K | θ) and are observed sequentially, where y K = (y 1 , . . . , y n K ). Suppose a Bayesian design with stopping rules given by Equation (3) is applied to every trial with a prior model π(θ), a sampling model f (y K | θ), and threshold values {γ 1 , . . . , γ K }. Similar to the rationale of type I error rate control, we propose to control the FDR and FPR of the design in the infinite series of trials for a range of plausible f 0 (y K | θ)π 0 (θ). This is because false rejections of the null may result in continuation of a drug development program that will ultimately fail, increasing the cost associated with the failure. The FDR is the relative frequency of false rejections among all trials in which H 0 is rejected, and the FPR is the relative frequency of false rejections among all trials with nonpositive treatment effects θ's.
Mathematically, let denote the rejection region of the design. That is, H 0 is rejected if y K ∈ Γ. Then, Our definitions of the FDR and FPR are slightly different from, but closely related to, their typical definitions in a frequentist sense (see, e.g., Storey, 2003).
The calibration of the design parameters is typically done through computer sim- K } (for some large S). Then, the FDR and FPR are respectively approximated by The prior and threshold values in the Bayesian design can be chosen such that FDR and FPR do not exceed some prespecified levels for every plausible f 0 (y K | θ)π 0 (θ).
Note that the simulations here are different from those for frequentist-oriented approaches. For the latter, hypothetical repetitions of the same trial are simulated with an assumed true treatment effect.
In certain contexts, there are theoretical guarantees on the operating characteristics of Bayesian sequential designs. Specifically, the following proposition provides such an example.
Proposition 2.1. Let Γ in Equation (6) represent the rejection region of a Bayesian design. Assume the joint model for (y K , θ) in the Bayesian design is the same as the actual joint distribution of (y K , θ) in a series of trials, i.e., f (y K | θ)π(θ) = f 0 (y K | θ)π 0 (θ). Then, the FDR and FPR of the Bayesian design are upper bounded regardless of the time (n j 's) and frequency (K) of interim analyses, The proof is given in Appendices B.2 and B.3. Therefore, from a calibrated Bayesian perspective, the prior on θ could be elicited to resemble the actual distribution of θ in repeated practices, and the threshold values reflect acceptable FDR and FPR levels.
In general, requiring a design to have good operating characteristics (under plausible scenarios) is more lenient than requiring it to have good frequentist properties (for all possible parameter values). For example, the type I error rate is essentially the FPR when π 0 (θ) is a point mass. Stringent type I error rate requires that the FPR is controlled for all possible π 0 (θ), even when π 0 (θ) is a point mass at 0, while the calibrated Bayesian approach only requires the FPR to be controlled for plausible π 0 (θ). In this sense, the calibrated Bayesian approach can be thought of as a middle ground between the frequentist-oriented approach and the subjective Bayesian approach.
2.4. Our Comments on the Three Perspectives. We have reviewed three perspectives on Bayesian sequential designs, which are summarized in Table 1. Although the three perspectives seem contradictory, they are not mutually exclusive. For example, if the investigator is conservative about a new drug and is cautious about false rejections, then he/she may take a subjective Bayesian approach with a large loss for a false positive decision. This can lead to low FDR and FPR, or even a low type I error rate. In other words, subjective Bayesians may produce desirable operating characteristics for calibrated Bayesians, or desirable frequentist properties for frequentist-oriented Bayesians.
In some contexts, a specific approach can be more applicable and acceptable compared to the others. For example, for large-scale confirmatory trials (e.g., COVID-19 vaccine trials), type I error rate control is enforced by regulators, and thus only the frequentist-oriented perspective is accepted. Indeed, there are some challenges with the subjective and calibrated Bayesian approaches in those settings. See, e.g., (Berry et al., 2010;Spiegelhalter et al., 1994). With a large number of enrolled patients, a large population that could potentially benefit from the treatment, and multiple decision makers with distinctive prior opinions and tolerances for risk, the process  (1994) noted, "when the decision is whether or not to discontinue the trial, coupled with whether or not to recommend one treatment in preference to the other, the consequences of any particular course of action are so uncertain that they make the meaningful specification of utilities rather speculative." From a calibrated Bayesian perspective, one could elicit the prior for θ based on historical trials of similar drugs and/or conditions. However, there may be concerns that high or low rates of historical success (e.g., pembrolizumab for solid tumors with a high success rate) may bias the inference for a new trial and trigger incentives for investigators to concentrate clinical research toward attractive areas and selected conditions. On the other hand, the prior for θ could also be based on all historical trials regardless of drugs and conditions. However, the distribution of treatment effects can be highly variable over time, and different types of trials have vastly different endpoints, which are difficult to summarize into a common distribution. As a result, utilization of Bayesian designs for phase III trials requires a case-by-case discussion that involves extensive examination of prior elicitation, inference procedures, and simulation results, which has been highlighted by several guidances from the U.S. Food and Drug Administration Food and Drug Administration (2010, 2019, 2020).
The subjective Bayesian perspective can be useful in trials for rare diseases and pediatric trials for small populations. In those situations, simple loss functions may be elicited, and prior distributions can be derived by eliciting expert opinion (Kidwell et al., 2022). The elicitation process usually involves interviewing multiple subject experts such as physicians and their team members, and summaries of the interviews can be reported in the form of statistics like medians, modes, and percentiles. Lastly, a prior distribution can be estimated by fitting a parametric distribution to match the summary statistics.
Lastly, the calibrated Bayesian perspective is suitable in exploratory settings, such as animal studies for drug screening and early-phase trials (e.g., dose finding). For those trials, stringent type I error rate control is optional and often at the discretion of the sponsors. Eliciting the prior for θ from previous studies and focusing on FDR/FPR control allow an efficient selection of promising drugs for further development.
Then, the prior probability for each hypothesis is also specified, Pr(H 0 ) = 1 − ω and Pr(H 1 ) = ω. At analysis j, the posterior probability of H 1 is which can be used to decide whether to stop the trial early. For example, if Pr(H 1 | y j ) > γ j , H 0 is rejected, and the trial is stopped. This approach is equivalent to specifying a mixture prior distribution for θ, and then stop the trial at analysis j if Pr(θ > 0 | y j ) > γ j . Note that under the mixture prior, This relationship has been noted by Zhou et al. (2021). Although these two approaches are equivalent, when the primary goal is hypothesis testing, the prior for θ is usually specified as a mixture of two truncated distributions; when the primary goal is parameter estimation, the prior for θ is usually specified as a single continuous distribution.
A special case is when H 0 is a point hypothesis, say when we test H 0 : θ = 0 vs H 1 : θ = 0. From a hypothesis testing perspective, the prior for θ should be a mixture of a point mass at θ = 0 (denoted by δ 0 (θ)) and a continuous distribution, π(θ) = (1 − ω)δ 0 (θ) + ωπ (1) (θ). Such a prior distribution is rarely used when the primary goal is parameter estimation. Lastly, Johnson and Cook (2009) and Johnson and Rossell (2010) recommended the use of non-local prior densities, which incorporate a minimally significant separation between the null and alternative hypotheses, for Bayesian hypothesis testing and applications in trial monitoring.
2.6. Analysis at the Conclusion of a Sequential Trial. From a Bayesian perspective, after a clinical trial has been completed, all the information about θ is contained in its posterior distribution. Let t denote the stopping time of a sequential trial. For example, based on the stopping rule in Equation (4), Then, y t = (y 1 , . . . , y nt ) is the vector of accumulating data up to the time of stopping.
At the time of stopping, the posterior distribution of θ is given by One may be worried that the stopping time t is not included in the conditional of p(θ | y t ). However, assuming that θ and t are independent conditional on y t , we have Most often (and in all the designs that we have reviewed), θ affects t only through the observations y t , in which case the conditional independence assumption is satisfied, the equation holds, and the stopping rule plays no role in the posterior distribution of θ. See, e.g., Hendriksen et al. (2021). However, we note that in some situations, θ could affect t other than just via y t . For example, if an interim analysis happens because an external trial found a positive treatment effect, which is more likely if θ is positive and large, this would affect t via external data other than via the current data.
The posterior mean, E(θ | y t ), is a commonly used point estimator for θ. On the other hand, a 100(1 − α)% credible interval for θ can be constructed as (θ L , θ U ), where θ L and θ U are the lower and upper (α/2) quantiles of p(θ | y t ), respectively.
This credible interval has its asserted coverage in repeated practices if the model specification is correct (see Appendix B.1), but the coverage may deteriorate in the presence of model misspecification. Lastly, the posterior probability of the alternative hypothesis, Pr(θ > 0 | y t ), is also reported.
2.7. Randomized-controlled Trial and Minimum Clinically Important Difference. So far, we have been using a single-arm trial to illustrate the designs. In practice, multi-arm trials such as randomized-controlled trials are also very common.
We briefly outline an extension of the designs for a randomized-controlled trial. For simplicity, assume the trial outcomes are normally distributed. At analysis j, observed data are y r1 , y r2 , . . . , y rn rj ∼ N(θ r , σ 2 r ) for arm r, where r = 1 and 0 represent the investigational drug and control arms, respectively. The goal may be to test Assume σ 2 1 and σ 2 0 are known. One can specify a prior distribution for θ = θ 1 − θ 0 , say θ ∼ N(µ, ν 2 ). The posterior distribution of θ at analysis j is given by Then, one can proceed similarly as before. An alternative approach is to specify independent priors separately for θ 1 and θ 0 and then use these to obtain a posterior distribution of θ. This will lead to slightly different designs. See Stallard et al. (2020). When σ 2 1 and σ 2 0 are unknown, one needs to specify priors for these parameters as well and calculate the marginal posterior distribution of θ.
In some trials, such as proof-of-concept trials, it may be of interest to evaluate the evidence of the treatment effect being greater than a minimum clinically important difference, denoted by ∆ Chuang-Stein et al. (2011); Fisch et al. (2015). In this case, one may replace the stopping rule in Equation (3) by Alternatively, the efficacy stopping rule can be based on both Equations (3) and (9). Here, Equation (3) speaks to "does the drug work at all", while Equation (9) addresses "does the drug have a clinically relevant effect". In proof-of-concept trials, Equation (9) may be a necessary criterion for a drug to be promoted into full development  et al., 1994). First, with a chosen probability model, the data affect posterior inference only through the likelihood function. In this way, Bayesian inference obeys the LP (Gelman et al., 2013, p. 7). This can be philosophically appealing. Frequentist inference, on the other hand, may be affected by unrealized events. We will elaborate on this point in Section 4. Second, the stopping rule of an experiment is irrelevant to the construction and interpretation of a Bayesian credible interval. In contrast, a frequentist interval estimate of treatment effect following a group sequential trial crucially depends on the stopping rule. As Freedman et al. (1994) pointed out, such an interval may be quite unintuitive. Depending on the choice of sample space ordering, the interval may not always include the sample mean and can include zero difference even for data that lead to a recommendation to stop the trial at the first interim analysis (see Rosner and Tsiatis, 1988). Third, stringent frequentist inference can be challenging or unsatisfactory if the prescribed stopping rule is not followed.
For example, a trial may be stopped due to unforeseeable circumstances such as the outbreak of COVID-19; in some cases, it may be desirable to extended a trial beyond the planned sample size. Some have criticized that the relevance of stopping rules makes it almost impossible to conduct any frequentist inference in a strict sense (Berger, 1980;Berry, 1985;Berger and Wolpert, 1988;Wagenmakers, 2007). Oftentimes, statisticians are presented with a dataset without knowing how the stopping of the study was decided and why the study was not stopped earlier. Both factors can affect the frequentist properties of a statistical procedure, while in practice it is infeasible to keep track of them. Lastly, when reliable historical information is available, it can be formally incorporated into the design and analysis of the current trial via Bayesian methods. This may lead to improvements in trial efficiency in terms of higher power and saving in sample size (see Shi and Yin, 2019).

Other Types of Bayesian Sequential Designs
3.1. Designs Based on Posterior Predictive Probabilities. In the upcoming sections, we review some other types of Bayesian sequential designs whose early stopping rules are not directly based on Pr(θ > 0 | y j ) > γ j . Similar to the idea of stochastic curtailment (Lan et al., 1982), posterior predictive probabilities can be used to determine whether to stop a trial early. See, e.g., Dmitrienko and Wang (2006); Lee and Liu (2008); Saville et al. (2014). Suppose that at the final analysis, efficacy of the drug will be declared if Pr(θ > 0 | y K ) > 1 − η. At analysis j ∈ {1, . . . , K − 1}, the posterior predictive distribution of future observations y * j,K = (y * n j +1 , . . . , y * n K ) is and the posterior predictive probability of success (PPOS) is One may stop the trial early if PPOS j > γ j for some threshold γ j . To specify the prior for θ and the threshold values {γ 1 , . . . , γ K−1 } and η, one may take one of the approaches in Sections 2.1-2.3.
For the single-arm trial example, we havē whereȳ * j,K = y * n j +1 + · · · + y * n K /(n K − n j ). The criterion Pr θ > 0 | y j , y * j,K > 1 − η is equivalent tō Finally, it can be derived that The PPOS depends on η and n K . In general, the stopping rules based on PPOS and PP are different, although for given η and n K , one may select γ j such that {PPOS j > γ j } and {PP j > γ j } are equivalent. As a result, one may also impose type I error rate control on PPOS stopping rules based on the arguments in Section 2.1. As noted by Saville et al. (2014), if at the jth interim analysis, the amount of data remain to be collected (n K − n j ) is infinity, then PPOS j = PP j regardless of η.
Typically, the PPOS is close to the PP at the beginning of a trial and moves toward either 0 or 1 as the trial nears completion.
3.2. Decision-theoretic Designs. As described in Section 2.2, the decisions in a sequential clinical trial can be made by minimizing the expected loss under a decisiontheoretic framework. This approach has been considered by Berry and Ho (1988); Lewis and Berry (1994); Stallard et al. (1999); Ventz and Trippa (2015), among others. The idea is that, at each interim analysis, the decision to stop the trial early and reject H 0 is associated with some loss if the decision is wrong. On the other hand, continuing the trial results in more cost in terms of patient recruitment. But with more data, the chance of making a wrong decision may be decreased. By considering both factors, decision-theoretic designs combine the strengths of designs based on posterior and posterior predictive probabilities.
We illustrate the idea of decision-theoretic designs through the single-arm trial example. Let ϕ j denote a possible decision at analysis j. For j = 1, . . . , K − 1, ϕ j = 1 (or 0) represents rejecting H 0 and stopping the trial early (or failing to reject and continuing enrollment). For j = K, ϕ K = 1 (or 0) represents rejecting (or failing to reject) H 0 at the final analysis, and the trial is stopped in either case. Let j (ϕ j , θ, y j ) denote the loss of making decision ϕ j at analysis j given parameter θ and data y j . The posterior expected loss is then L j (ϕ j , y j ) = θ j (ϕ j , θ, y j )p(θ | y j )dθ.
The optimal decision isφ j (y j ) = arg min ϕ j L j (ϕ j , y j ) and the associated expected loss isL j (y j ) = min ϕ j L j (ϕ j , y j ), i.e., the Bayes risk.
Suppose that the loss of making decision ϕ j = 1 at analysis j (j = 1, . . . , K − 1) is where ξ 1j is the loss of mistakenly rejecting H 0 and stopping the trial if θ ≤ 0. On the other hand, if ϕ j = 0, the trial continues, (n j+1 − n j ) patients will be enrolled until the next analysis, and we assume a unit loss for recruiting each patient. We Here, y * j,j+1L j+1 (y j , y * j,j+1 )p(y * j,j+1 | y j )dy * j,j+1 is the Bayes risk at analysis (j + 1) marginalized over the posterior predictive distribution on y * j,j+1 = (y * n j +1 , . . . , y * n j+1 ), that is, the observations between analyses j and j + 1.
We also assume the loss of making decision ϕ K at the final analysis is Here, ξ 1K is the loss of mistakenly rejecting H 0 at the final analysis if θ ≤ 0 (a type I error), and ξ 0 is the loss of failing to reject H 0 if θ > 0 (a type II error).
At analysis j, the optimal decisionφ j (y j ) can be solved by backward induction (DeGroot (1970), Chapter 12). First, we calculateL K (y K ) for all possible data y K that can arise at the final analysis. Next, using Equations (10) and (11), we can calculateL K−1 (y K−1 ) for all possible data y K−1 that can arise at analysis (K − 1).
Proceeding backward in this way givesL K−2 (y K−2 ), . . . ,L j (y j ). This procedure requires many minimizations and integrations which may not be analytically tractable.
Simulation-based approaches have been proposed to mitigate these computational challenges (Müller et al., 2007). Lewis and Berry (1994) demonstrated that by tuning the loss functions, decisiontheoretic designs can achieve desirable type I error rate control. Ventz and Trippa (2015) considered constrained optimal designs with explicit frequentist requisites.
Alternatively, the loss functions and prior can be chosen by taking the subjective or calibrated Bayesian approach.
We summarize in Table 2 the various methods and measures that give rise to different types of sequential designs, including frequentist designs reviewed in Appendix A.

The Likelihood Principle
Statistical inference and decision making in sequential clinical trials are typically tied to the LP. We provide some discussions in this section.
Let Y denote a random variable with density f θ (y). The likelihood function for θ, given the observed outcome y of the random variable Y , is L y (θ) = f θ (y). That is, the density evaluated at y and considered as a function of θ. The (strong) LP, as in Birnbaum (1962) and Berger and Wolpert (1988), can be summarized as follows: The Likelihood Principle. All the statistical evidence about θ arising from an experiment is contained in the likelihood function for θ given y. Two likelihood functions for θ (from the same or different experiments) contain the same statistical evidence about θ if they are proportional to one another. Birnbaum (1962) showed that the LP can be deduced from two widely accepted principles: the sufficiency principle and the conditionality principle. There have been debates regarding Birnbaum's proof and the validity of the LP in general. A detailed treatment of the LP is outside the scope of this paper. We refer interested readers to Berger and Wolpert (1988); Robins and Wasserman (2000); Evans (2013); Mayo (2014); Gandenberger (2015a); Peña and Berger (2017).
What would be the consequences if we accept the LP? Since the LP deals only with the observed y, data that did not obtain and experiments not carried out have no impact on the evidence about θ (Berry, 1987;Berger and Wolpert, 1988). Also, as in Berger and Wolpert (1988), the LP implies that the reason for stopping an experiment (the stopping rule) should be irrelevant to the evidence about θ. In a clinical trial, the implication is that early stopping would not affect the evidential meaning of the trial outcome.
As an illustration, consider the example given by Berry (1987). Imagine that a single-arm trial as described in Section 1.2 has been conducted, and 200 outcomes have been recorded that result in a z-statistic of z 1 = 1.75. These results are be- The conflict here does not mean we have to either reject the LP or reject frequentist procedures. Explained previously (e.g., Berger and Wolpert, 1988;Gandenberger, 2015bGandenberger, , 2017, the LP is not a decision procedure and gives little guidance in assessing the overall performance of a decision procedure. The LP implies that only the observed data are relevant to the evidence about θ, but the consequences for making a specific decision may depend on other aspects of an experiment. First, while the evidence about θ is trial-specific, a decision procedure is applied to many trials. For example, from a regulatory agency's perspective, the action to approve a drug reflects not only the consequences of administering this drug to patients, but also the downstream consequences of that decision rule for other drugs in the future (Gandenberger, 2017). Therefore, frequentist measures such as the type I error rate can be factored into the decision procedure. Second, even for a single trial, it is not unreasonable to associate the consequences of a decision with unrealized data patterns. investigator D did not plan to conduct any additional interim analysis. Suppose the z-statistic at the interim analysis is z 1 = 1.75. Then, using the design and loss functions described in Section 3.2 with ξ 0 = 400 and ξ 1j ≡ 19ξ 0 for all j, the optimal decisions for investigators C and D are continuing enrollment and stopping the trial, respectively. Specifically, Figure 1 shows the posterior expected losses for possible decisions that can be made by the two investigators. We can see that the existence of a planned future interim analysis has an impact on the posterior expected loss associated with continuing the trial. In summary, if a dichotomous decision must be made, the LP does not preclude one from utilizing other information in addition to the observed data. Therefore, our view is that the LP should not be used as an argument for or against Bayesian or frequentist sequential designs.
Still, the conflict does suggest that if we accept the LP, then frequentist measures such as type I/II error rates and p-values may not be used as measures of statistical evidence for or against a hypothesis in a clinical trial (Berger and Wolpert, 1988).
This point has been raised by many others as well. For example, Royall (1997)  In summary, in an ideal world, one may use frequentist measures to design a trial.
However, when reporting statistical analyses results as evidence after trial completion, Bayesian measures that conform the LP should be preferred.
It should also be noted that not all Bayesian procedures are in compliance with the LP. For example, eliciting the prior for θ based on the sampling plan, such as using the Jeffreys prior (Jeffreys, 1946), results in violation of the LP (Berger and Wolpert, 1988, p. 21). We have mentioned in Section 2.1 that one may control the type I error rate of a Bayesian sequential design by calibrating the prior or threshold values. To avoid violation of the LP, however, we recommend taking the latter approach and not selecting the prior based on trial planning. Intuitively, changing the threshold values only affects decision making, while changing the prior affects both the evidence about θ (e.g., point and interval estimations) and decision making.

5.1.
Illustration of the Frequentist-oriented Approach. As an illustration of the frequentist-oriented approach, we calculate the stopping boundaries for the zstatistics given by some of the aforementioned Bayesian sequential designs with the type I error rate controlled at α = 0.05. That is, we compute the {c 1 , . . . , c K } values for which we would stop the trial at analysis j if z j > c j . We consider the single-arm trial example described in Section 1.2. Suppose that a total of K = 5 (interim and final) analyses are planned, the maximum sample size is n K = 1000, and patients are enrolled in groups of size 200 (n j = 200j). The variance for the outcomes is set at σ 2 = 1 and is assumed known. Specifically: (i) For stopping boundaries based on posterior probabilities (Equation 4), we consider the following two versions. In the first version, we use γ j ≡ 0.95 and find that a N(0, 0.054 2 ) prior for θ leads to α = 0.05. In the second version, we place a N(0, 1 2 ) prior on θ and find that setting γ j ≡ 0.983 leads to α = 0.05.
The stopping boundaries are summarized in Table 3. For comparison, we also include the stopping boundaries produced by the Pocock and O'Brien-Fleming procedures (Pocock, 1977;O'Brien and Fleming, 1979) and the linear error spending function (Kim and DeMets, 1987b). See Appendices A.1 and A.2 for more details.   Table 3. Stopping boundaries for the z-statistics given by several Bayesian and frequentist sequential designs. The single-arm trial in Section 1.2 is considered with K = 5 analyses, a maximum sample size of n K = 1000, and equal group sizes (n j = 200j). The design parameters are calibrated such that the type I error rate at θ = 0 is α = 0.05 for every design. expected sample size. This is due to its large stopping boundaries at early analyses and progressively smaller stopping boundaries at later analyses. On the contrary, the Pocock boundaries and the boundaries based on posterior probabilities (version 2) lead to the lowest expected sample size but also have the lowest power. For more discussion on the frequentist evaluation of sequential designs, refer to Jennison and Turnbull (2000).

Illustration of the Calibrated Bayesian Approach.
To demonstrate the calibrated Bayesian approach, we conduct simulation studies to explore the operating characteristics of a Bayesian design under a variety of plausible scenarios. Consider the single-arm trial example in Section 1.2 with a maximum sample size of n K = 1000 and the Bayesian design with stopping rules given by Equation (3). Suppose the actual effect size of the trial, θ, is a random draw from N(µ 0 , ν 2 0 ). As the trial progresses, patient outcomes become available sequentially and follow a normal distribution, y 1 , y 2 , . . . ∼ N(θ, σ 2 ). The trial statistician, on the other hand, uses a N(µ, ν 2 ) prior to draw inference about θ, which may or may not be identical to the actual population distribution of θ. For simplicity, assume the sampling model  used by the statistician, f (y K | θ), is correctly specified. At prespecified time and frequency, the statistician conducts interim analyses of accumulating data. If the stopping rule is triggered, H 0 is rejected, the trial is stopped, and efficacy of the drug is declared.
We consider 72 simulation scenarios, one for each combination of ν 0 ∈ {0.1, 0.5, 1}, ν ∈ {0.1, 0.5, 1, 10}, and K ∈ {1, 2, 5, 10, 100, 1000}. For simplicity, we fix the other parameters: µ 0 = µ = 0, and σ = 1. Here, a larger (or smaller) value of ν 0 indicates that the actual effect size is more likely to be larger (or smaller). We do not consider ν 0 > 1 as in practice, a standardized effect size that is much larger than what could be drawn from a N(0, 1 2 ) distribution is not common. A larger (or smaller) value of ν represents that the assumed prior for θ is more diffuse (or more concentrated around zero). When ν 0 = ν, the population distribution of θ over different trials is the same as the prior for θ used for analysis. Lastly, K is the total number of (interim and final) analyses. We assume that patients are enrolled in groups of equal size n K /K.
For each scenario, we simulate S = 10, 000 hypothetical trials by first generating θ (1) , . . . , θ (S) ∼ N(µ 0 , ν 2 0 ). Next, for each θ (s) , trial outcomes are sequentially generated from N (θ (s) , σ 2 ). Interim analyses are performed after every n K /K outcomes have been observed, and the trial is stopped if the stopping rule as in Equation (4) is satisfied with γ j ≡ γ = 0.95. We record the FDR and FPR as defined in Equation (7). In addition, we record the percentage of 95% credible intervals for θ, calculated as in Section 2.6, that cover the true values. Table 4 summarizes the simulation results. Although the FDR and FPR increase with the number of analyses, according to Proposition 2.1, the FDR and FPR are upper bounded when the statistician's model is correctly specified. These theoretical results are corroborated by the simulations: when ν 0 = ν, the FDR is roughly bounded by 1−γ = 5% (due to Monte Carlo errors and a finite number of simulations, the FDR may sometimes exceed 5%), and the FPR is always below (1−γ)/γ = 5.3%.
In addition, when ν 0 = ν, the coverage of the 95% credible intervals for θ is around 95% regardless of K.
In the presence of model misspecification, however, Bayesian statements may not attain their asserted coverage, and the discrepancy becomes larger with more frequent applications of data-dependent stopping rules. These results are consistent with the findings in Rubin (1984) and Rosenbaum and Rubin (1984). When the assumed prior is more diffuse than the actual distribution of θ, the FDR and FPR are inflated, and the degree of FDR and FPR inflation becomes greater when K is larger. For example, when ν 0 = 0.1, ν = 10, and K = 1000, the FDR and FPR are around 20%. For Table 4. Operating characteristics of the Bayesian design with stopping rules given by Equation (3) to be small. In addition, when ν 0 = ν, the coverage of the 95% credible intervals for θ is below 95% and decreases as K increases. Interestingly, an overly conservative prior (that is more concentrated around zero) results in low coverage of the credible intervals, while a diffuse prior has less impact on the coverage.
From a calibrated Bayesian point of view, simulation studies of this type can be used to guide the choice of π(θ) and {γ 1 , . . . , γ K }. Suppose the trial statistician decides to use a constant threshold value γ j ≡ γ = 0.95 and wants to select ν such that the FDR and FPR of the design are controlled at below 5% for plausible ν 0 and K scenarios (assume µ 0 = µ = 0). To achieve this goal for all possible ν 0 and K considered here, ν should be set at ≤ 0.1. However, if one plans to conduct no more than K = 10 analyses, then setting ν ≤ 1 is sufficient.
We do not present additional numerical studies for the subjective Bayesian approach, in which case the prior and threshold values may be chosen based on a subjective belief rather than simulations.

Discussion
We have summarized three perspectives on Bayesian sequential designs, namely the frequentist-oriented perspective, the subjective Bayesian perspective, and the calibrated Bayesian perspective, and have discussed their implications. We have reviewed Bayesian sequential designs based on posterior probabilities, posterior predictive probabilities, and decision-theoretic frameworks. We have also commented on the role of the LP in sequential trial designs. While the LP implies that unrealized events are irrelevant to the statistical evidence about the treatment effect, it gives little guidance in assessing a decision procedure thus does not preclude the use of additional information in decision-making.
So far, we have only considered early stopping for efficacy. In practice, it may be desirable to allow for early stopping when interim results suggest the investigational drug is unlikely to have a clinically meaningful treatment effect (Snapinn et al., 2006). This is known as early stopping for futility. A sequential trial design can include a provision for either early efficacy stopping, early futility stopping, or both. Consider the single-arm trial example. One could stop the trial at analysis j in favor of the null hypothesis if Pr(θ > 0 | y j ) < τ j for some threshold τ j . Futility stopping rules do not inflate the type I error rate; actually, they decrease the type I error rate. However, futility stopping rules also decrease the power and increase the false negative rate (FNR) and false omission rate (FOR) of a design. The futility boundaries could be specified to either satisfy certain power and type I error rate requirements (similar to Pampallona and Tsiatis (1994) There have been several criticisms of testing a point null hypothesis (Berger and Sellke, 1987), such as the plausibility of θ being equal to 0 exactly. As a result, we have focused on a one-sided test with a composite null hypothesis (Equation 1).
Most of our discussions are still applicable to tests like Equation (12), although from a Bayesian hypothesis testing perspective, the prior for θ should include a discrete mass at the location indicated by the point hypothesis.
From a frequentist perspective, the issue of type I error rate inflation (or multiplicity) can arise from repeatedly testing a single hypothesis over time, or testing multiple hypotheses simultaneously (Simon, 1994). From a subjective Bayesian perspective, however, repeated hypothesis testing is not necessarily a problem (see Section 2.2), and multiplicity adjustments are needed only when there are multiple tests. It is worth noting that frequentist and Bayesian philosophies on multiple testing are also quite different (Berry and Hochberg, 1999;Sjölander and Vansteelandt, 2019).
Several R packages have been developed to facilitate the use of frequentist and Bayesian sequential designs in clinical trials. These include gsDesign (Anderson, 2022) and gsbDesign (Gerber and Gsponer, 2016).

Appendix A. Frequentist Sequential Designs
We provide a brief review of frequentist sequential designs. Consider the singlearm trial example in Section 1.2. The maximum type I error rate of this sequential testing procedure is given by Equation (2). Frequentist group sequential designs are concerned with the specification of the stopping boundaries {c 1 , . . . , c K } such that Equation (2) holds for prespecified α, K, and {n 1 , . . . , n K }. The solution to Equation (2) is not unique, thus restrictions on the stopping boundaries have been considered.
We give some examples next.
A.1. The Pocock and O'Brien-Fleming Procedures. In the case of equal group sizes (that is, n j = jg for some g), Pocock (1977) proposed to use equal stopping boundaries by setting c 1 = · · · = c K = c P (K, α), while O'Brien and Fleming (1979) suggested decreasing boundaries with c j = c OBF (K, α) K/j. In either case, the stopping boundaries can be solved through a numerical search. Note that z = (z 1 , . . . , z K ) follows a multivariate normal distribution with E(z j ) = θ √ n j /σ, Var(z j ) = 1, and Cov(z j , z j ) = n j /n j for j < j . Therefore, where Φ K (·; ·, ·) is the cumulative distribution function of a multivariate Gaussian random variable, c = (c 1 , c 2 , . . . , c K ) , and Σ is the covariance matrix of z.
A.2. The Error Spending Approach. Slud and Wei (1982) first considered the idea of specifying the error rate spent at each analysis, defined as κ j = Pr(z 1 ≤ c 1 , . . . , z j−1 ≤ c j−1 , z j > c j | θ = 0). This represents the probability of rejecting H 0 at stage j but not at any previous stages, given that θ = 0. We have α = K j=1 κ j . Once the κ j 's are specified, one can successively calculate the stopping boundaries. Lan and DeMets (1983) further extended this idea and suggested to use a function to characterize the rate at which the error rate is spent. This function, denoted by h(u) (0 ≤ u ≤ 1), satisfies h(0) = 0 and h(1) = α. The κ j 's can be chosen such that κ j = h(n j /n K ) − h(n j−1 /n K ) (with the understanding that n 0 = 0). Common choices of h(u) include h 1 (u) = α log (1 + (e − 1)u) , Here, Φ(·) is the cumulative distribution function of the standard normal distribution, and q α/2 = Φ −1 (1 − α/2) is the upper (α/2) quantile of the standard normal distribution, Φ(q α/2 ) = 1 − α/2. It has been shown that in the case of equal group sizes, h 1 (u) and h 2 (u) produce stopping boundaries similar to those given by Pocock's and O'Brien-Fleming's procedures, respectively. Function h 3 is known as the power spending function and has been studied by Kim and DeMets (1987b). The error spending approach introduces greater flexibility to sequential designs, as the frequency and timing of the interim analyses do not need to be specified in advance.
A.3. Stochastic Curtailment Based on Conditional Power. Lan et al. (1982) proposed the idea of stochastic curtailment that at any point in a sequential clinical trial, if the result at the end of the trial is inevitable, the study can be terminated early. Consider the single-arm trial example. Suppose that at the final analysis, H 0 will be rejected if the final z-statistic z K > q η , where q η is the upper η quantile of the standard normal distribution. Then, at analysis j ∈ {1, . . . , K − 1}, the probability that H 0 will be rejected upon completion of the study, given θ, is given by where y j = (y 1 , . . . , y n j ) is the vector of accumulating data up to analysis j. This is known as the conditional power. A simple calculation shows that If based on current data, H 0 will likely be rejected at the final analysis even if the investigational drug has no treatment effect (θ = 0), then the trial may be stopped early. Mathematically, one may stop the trial early if CP j (0) > γ for some threshold γ. This is equivalent to z j > q η n K /n j + q 1−γ (n K − n j )/n j .
If desirable, one may use different thresholds γ j 's at different interim analyses. An important consideration is the type I error rate of this procedure, but Lan et al. (1982) showed that the error rate is upper bounded by η/γ, regardless of the number of interim analyses. Therefore, if η and γ are chosen such that η/γ ≤ α, the type I error rate is maintained at or below α, even if interim analyses are conducted at arbitrary times. The stopping boundaries based on this argument are typically conservative. However, if the timing of the interim analyses is specified in advance, tighter stopping boundaries can be constructed by calculating the exact type I error rate numerically.
A.4. Analysis at the Conclusion of a Sequential Trial. Once a sequential trial has been completed, it is often of interest to construct a point estimate and a confidence interval for the treatment effect θ. Consider again the single-arm trial example.
The results of the trial can be represented by a bivariate random vector (t, z t ), where t denotes the time of stopping, and z t is the corresponding test statistic. Following Armitage et al. (1969) or Jennison and Turnbull (2000) (Chapter 8), the density of (t, z t ) is and for t = 2, . . . , K, f (t, z t | θ) = c t−1 −∞f (t − 1, u | θ) · √ n t √ n t − n t−1 · φ z t √ n t − u √ n t−1 − (n t − n t−1 )θ/σ √ n t − n t−1 du, with φ(·) denoting the standard normal density.
The sample mean estimator,θ =ȳ t , is a straightforward point estimator for θ. It can be shown thatθ is also the maximum likelihood estimator (MLE). However, it is known that the MLE following a sequential trial is biased, and one may correct it by subtracting an estimate of its bias. See, e.g., Whitehead (1986) for more details.
To construct a confidence interval for θ, one needs to define an ordering of the sample space (Tsiatis et al., 1984;Kim and DeMets, 1987a;Rosner and Tsiatis, 1988). For example, based on the stage-wise ordering, (t , z t ) is above (t, z t ) if either (i) t = t and z t > z t , or (ii) t < t. In this case, (t , z t ) is indicative of a larger value of θ compared to (t, z t ). It can be shown that Pr[Observing an outcome above (t, z t ) | θ] is a continuous and monotonically increasing function of θ for every possible trial outcome (t, z t ) (Kim and DeMets, 1987a). Thus, one can find unique values θ L and θ U which satisfy Pr[Observing an outcome above (t, z t ) | θ L ] = α/2, Pr[Observing an outcome above (t, z t ) | θ U ] = 1 − α/2.

Appendix B. The Calibrated Bayesian Perspective
We present more details about the calibrated Bayesian perspective described in Section 2.3. We consider the setup of an infinite series of single-arm trials (described in Section 1.2) with true but unknown treatment effects θ (1) , θ (2) , . . . ∼ π 0 (θ). For each trial, patient outcomes y K ∼ f 0 (y K | θ) and are observed sequentially. The Bayesian design with stopping rules given by Equation (3) is applied to every trial with a prior model π(θ), a sampling model f (y K | θ), and threshold values {γ 1 , . . . , γ K }. We are interested in the operating characteristics of the Bayesian design over this infinite series of trials, in particular its FDR and FPR.
B.1. Background. We first provide more background on the calibrated Bayesian perspective. Rubin (1984) called a statistical procedure (conservatively) calibrated if the resulting probability statements (at least) have their asserted coverage in repeated practices. Clearly, calibrated procedures are desirable, and Rubin recommended examining operating characteristics to select calibrated Bayesian procedures. Rubin's points were echoed by Little (2006).
The interpretation is that, among the possible θ values from π 0 (θ) that might have generated the observed y K from f 0 (y K | θ), 95% of them belong to I(y K ). Therefore, when the procedure of calculating I(y K ) from f (y K | θ)π(θ) is repeatedly applied to data drawn from f 0 (y K | θ)π 0 (θ), 95% of the calculated credible intervals will cover the true parameter values. We see that posterior probabilities correspond to frequencies of actual events. Similarly, when we claim Pr(θ > 0 | y K ) > 0.95, it means that among the possible θ values that might have generated y K , more than 95% are positive. Rubin (1984) and Rosenbaum and Rubin (1984) also demonstrated that when the model specification is correct, the coverage and interpretation of Bayesian statements are still valid under data-dependent stopping rules. For example, if we conclude Pr(θ > 0 | y j ) > 0.95 at any interim analysis j, it means that more than 95% of the possible θ values that might have generated y j are positive, even if the trial is optionally stopped at analysis j based on the observed data.
Of course, in the presence of model misspecification, the coverage of Bayesian statements is not warranted. In particular, Rubin (1984) and Rosenbaum and Rubin (1984) noted that data-dependent stopping rules increase the sensitivity of Bayesian inference to model specification. Therefore, especially for sequential trial designs, one might want to examine their operating characteristics for a range of plausible f 0 (y | θ)π 0 (θ) (which may deviate from f (y | θ)π(θ)) to select appropriate design parameters.