1 Introduction
1.1 Background
In most clinical trials, patient enrollment is staggered, and patients’ data are collected sequentially. When designing a clinical trial, it is often desirable to include a provision for interim analyses of accumulating data with the potential for modifying the conduct of the study [58, 2]. For example, in a randomized-controlled trial, if an interim analysis demonstrates that the investigational drug is deemed superior than the standard of care, the trial could be stopped early on grounds of ethics and trial efficiency [33]. The BNT162b2 COVID-19 vaccine trial is a recent case in which four interim analyses were planned with the possibility for declaring vaccine efficacy before the planned end of the trial [59].
It is well known that frequentist sequential designs need to be adjusted for the planning of interim analyses to maintain desirable frequentist properties [42]. For Bayesian sequential designs, however, there has been some controversy regarding whether similar adjustments are required [69]. Some advocated the necessity of these adjustments (e.g., [25, 26]), while others claimed the opposite (e.g., [9, 10, 37]).
In this article, we review different perspectives on Bayesian sequential designs and answer the question of whether Bayesian sequential designs need to be adjusted for interim analyses. Our review is not meant to be comprehensive with regard to methodological details including the type of trial (e.g., single-arm or randomized-controlled), type of outcome (e.g., binary, continuous, or time-to-event), or distributional assumption. Instead, we focus on the fundamentals of Bayesian sequential designs. A single-arm trial example (to be introduced in Section 1.2) will be used throughout to demonstrate these designs, but we present an extension for randomized-controlled trials in Section 2.7. We consider early stopping rules for efficacy, as futility stopping does not increase the type I error rate of a design (it actually reduces the type I error rate). Discussion on futility stopping is deferred to Section 6.
There is a rich literature on sequential designs (e.g., [42, 83, 43]), but the majority is centered around frequentist approaches. There are also comprehensive reviews on Bayesian trial designs in general (e.g., [76, 11, 14]), but most do not extensively address sequential trials. Lastly, there are many insightful discussions on Bayesian sequential designs, such as [18, 9, 10, 28, 42, 29, 22, 37, 69, 78]. However, a systematic review on the fundamentals of Bayesian sequential designs has been lacking, and we attempt to fill this important gap. Furthermore, as mentioned earlier, in existing works, different authors seem to have vastly different opinions on how Bayesian sequential designs should be formulated. It turns out that different authors mean quite different things by “Bayesian sequential designs need/do not need to be adjusted for interim analyses”. We aim to disentangle the practical and philosophical implications behind these different perspectives.
Our contributions include the following. (i) In Bayesian sequential designs, a pertinent question is whether adjustments for the planning of interim analyses are necessary. We attempt to answer this question from multiple perspectives. From a frequentist-oriented perspective, such adjustments are necessary for achieving desirable frequentist properties such as controlling the type I error rates; from a calibrated Bayesian perspective, such adjustments may be needed to achieve desirable operating characteristics under plausible scenarios (we will discuss the differences between achieving desirable operating characteristics versus achieving desirable frequentist properties); lastly, from a subjective Bayesian perspective, such adjustments are unnecessary, and the design only needs to reflect subjective beliefs. We comment on the three perspectives and make our recommendation. (ii) We put forward a proposal for a calibrated Bayesian approach to sequential designs. Specifically, we propose false discovery rate (FDR) and false positive rate (FPR) as potential metrics to evaluate sequential designs. We derive theoretical results regarding the FDR and FPR of a Bayesian sequential design and present simulation studies to demonstrate the practical usage of the calibrated Bayesian approach. (iii) We summarize Bayesian sequential designs based on posterior probabilities, posterior predictive probabilities, and decision-theoretic frameworks. We discuss the connections between designs using posterior credible intervals and those using formal Bayesian hypothesis testing. (iv) It is often believed that according to the likelihood principle (LP), decision making in a sequential trial should not depend on unrealized events. However, our investigation shows that the LP gives little guidance in assessing the overall performance of a decision procedure. In particular, the LP does not preclude one from utilizing additional information (including unrealized events) for decision making. Therefore, our view is that the LP should not be used as an argument for or against Bayesian or frequentist sequential designs. To illustrate our findings, we present an example of a Bayesian decision-theoretic design in which different decisions will be made based on the same observed data but different interim analysis plans.
1.2 An Illustrative Example
To illustrate the discussion, consider a single-arm trial that aims to establish the therapeutic effect of an investigational drug. Suppose that a total of K analyses, including $(K-1)$ interim analyses and a final analysis, are planned during the course of the trial. At the jth analysis, data of ${n_{j}}$ patients are accumulated, denoted by ${y_{1}},{y_{2}},\dots ,{y_{{n_{j}}}}$ and assumed independently and normally distributed with mean θ and variance ${\sigma ^{2}}$. Here, θ is parameterized such that a positive value of θ is indicative of a therapeutic effect, and ${\sigma ^{2}}$ is assumed known for simplicity. The planned maximum sample size is denoted by ${n_{K}}$ and can be determined based on a power requirement or the amount of available resources. As a simple example, assume patients are enrolled in groups of equal size g, thus ${n_{j}}=jg$. If $g=1$, it leads to the fully sequential case, known as continuous monitoring; if $g\gt 1$, it is called the group sequential case, which is more feasible in practice. The primary research question of the trial can be formulated as the following hypothesis test,
At each analysis, the hypothesis test is performed. If certain stopping rule is triggered, say the z-statistic ${z_{j}}\gt {c_{j}}$ for some stopping boundary ${c_{j}}$, ${H_{0}}$ is rejected, and the trial is terminated for efficacy. Here,
This is referred to as data-dependent or optional stopping. When σ is unknown, one would replace the z-statistics with the corresponding t-statistics; little would change in the overall setup. A question central to sequential designs is the specification of those stopping boundaries.
1.3 Overview of Frequentist and Bayesian Sequential Designs
Frequentist sequential designs are concerned with controlling the overall type I error rate of the sequential testing procedure. The type I error rate refers to the probability of falsely rejecting ${H_{0}}$ at any analysis (in hypothetical repetitions of the trial), given that ${H_{0}}$ is true. In the single-arm trial example, the maximum type I error rate is attained when $\theta =0$ and is given by
If each test is performed at a constant nominal level, α will inflate as K grows and will eventually converge to 1 as $K\to \infty $ [3]. Therefore, adjustments to the stopping boundaries are necessary to ensure that the type I error rate is maintained at a desirable level. Examples of such adjustments include the Pocock or O’Brien-Fleming procedure [58, 55], the error spending approach [74, 48], and the stochastic curtailment approach [49]. We provide a brief review of some frequentist sequential designs in Section S.1 of the Supplementary Material.
(1.2)
\[ \alpha =\Pr ({z_{1}}\gt {c_{1}}\hspace{2.5pt}\text{or}\hspace{2.5pt}{z_{2}}\gt {c_{2}}\hspace{2.5pt}\text{or}\hspace{2.5pt}\cdots \hspace{2.5pt}\text{or}\hspace{2.5pt}{z_{K}}\gt {c_{K}}\mid \theta =0).\]Without accounting for the sequential nature of the hypothesis test, Bayesian designs can suffer the same problem of type I error inflation, which can be unsettling for statisticians who care about controlling the type I error rates. Therefore, in many Bayesian sequential trial designs, the stopping boundaries are also determined to control the type I error rate at a desirable level [85, 71]. As an example, the recent BNT162b2 COVID-19 vaccine trial was designed using a Bayesian approach with four planned interim analyses [59]. The stopping boundaries were chosen such that the overall type I error rate was controlled at 2.5%. Indeed, regulatory agencies generally recommend demonstration of adequate control of the type I error rate for any trial design to be acceptable [25, 26]. On the other hand, the type I error rate is a frequentist concept, the calculation of which involves an average over unrealized events such as hypothetical repetitions of the trial. Bayesian inference can be performed based solely on the observed data from the actual (and lone) trial and does not have to be concerned with type I error rate control, since the same trial is not assumed to repeat, hypothetically or in practice. Some think that the type I error rate is not the quantity that one should pay most attention to [38]. Also, according to the likelihood principle (LP), unrealized events should be irrelevant to the statistical evidence about a parameter [7]. Therefore, some Bayesian statisticians have written that the choice of the stopping rules does not need to depend on the planning of interim analyses [9, 10]. For example, one may stop the trial at any analysis provided that $\Pr (\theta \gt 0\mid \text{data})$ exceeds some threshold, or if stopping minimizes the posterior expected loss. We will elaborate on these issues in the upcoming sections.
The remainder of the paper is structured as follows. In Section 2, motivated by a sequential design based on posterior probabilities, we summarize the philosophy of Bayesian sequential designs into three categories. In Section 3, we review selected Bayesian sequential designs based on posterior predictive probabilities and decision-theoretic frameworks. In Section 4, we comment on the LP, which is commonly tied to statistical inference and decision making in sequential clinical trials. In Section 5, we present some numerical studies. Finally, in Section 6, we conclude and discuss some other considerations including futility stopping rules and two-sided tests. A brief review of frequentist designs, proof of the theoretical results, and the code for reproducing the simulation studies are provided in the Supplementary Material.
2 Three Perspectives on Bayesian Sequential Designs
Consider the single-arm trial in Section 1.2. In Bayesian sequential designs, the early stopping rules are typically based on the posterior probability (PP) of θ being greater than some threshold (e.g., [80, 39]). Assume the time and frequency of interim analyses are given in advance. Let $\pi (\theta )$ denote the prior distribution of θ. At analysis j, the posterior distribution of θ is given by Bayes’ rule,
and ${q_{1-{\gamma _{j}}}}$ is the upper $(1-{\gamma _{j}})$ quantile of the standard normal distribution. It remains to specify the prior $\pi (\theta )$ and threshold values $\{{\gamma _{1}},\dots ,{\gamma _{K}}\}$. We present three perspectives next and our comments and recommendation later in Section 2.4.
\[ p(\theta \mid {\boldsymbol{y}_{j}})=\frac{f({\boldsymbol{y}_{j}}\mid \theta )\pi (\theta )}{\textstyle\int f({\boldsymbol{y}_{j}}\mid \theta )\pi (\theta )\text{d}\theta },\]
where ${\boldsymbol{y}_{j}}=({y_{1}},\dots ,{y_{{n_{j}}}})$ is the vector of accumulating data up to analysis j, and $f({\boldsymbol{y}_{j}}\mid \theta )$ denotes the sampling distribution of ${\boldsymbol{y}_{j}}$. When the prior for θ is a conjugate normal distribution, $\theta \sim \text{N}(\mu ,{\nu ^{2}})$, the above posterior is available in closed form,
\[ \theta \mid {\boldsymbol{y}_{j}}\sim \text{N}\left(\frac{\mu {\nu ^{-2}}+{\bar{y}_{j}}{n_{j}}{\sigma ^{-2}}}{{\nu ^{-2}}+{n_{j}}{\sigma ^{-2}}},\frac{1}{{\nu ^{-2}}+{n_{j}}{\sigma ^{-2}}}\right).\]
If
for some threshold ${\gamma _{j}}$, ${H_{0}}$ is rejected, the trial is stopped, and efficacy of the drug is declared. This is equivalent to
(2.2)
\[ {z_{j}}\gt {c_{j}},\hspace{1em}\text{where}\hspace{1em}{c_{j}}={q_{1-{\gamma _{j}}}}\sqrt{1+\frac{{\nu ^{-2}}}{{n_{j}}{\sigma ^{-2}}}}-\frac{\mu {\nu ^{-2}}}{\sqrt{{n_{j}}{\sigma ^{-2}}}},\]2.1 The Frequentist-oriented Perspective
Without accounting for multiple looks at the data, the stopping rule in Equation (2.2) can lead to type I error rate inflation. As an example, consider a $\text{N}(0,{1^{2}})$ prior on θ and constant threshold values ${\gamma _{1}}=\cdots ={\gamma _{K}}=0.95$. Suppose the outcome variance ${\sigma ^{2}}=1$, the maximum sample size ${n_{K}}=1000$, and patients are enrolled in equal group sizes. Using Equation (1.2), the type I error rates are $\alpha =0.05,0.08,0.13,0.17,0.30$, and 0.39 for $K=1,2,5,10,100$, and 1000, respectively. Therefore, due to regulatory guidance [25, 26], one should adjust $\pi (\theta )$ and $\{{\gamma _{1}},\dots ,{\gamma _{K}}\}$ according to the planning of interim analyses to achieve desirable type I error rate control (and possibly other frequentist properties). We refer to this as a frequentist-oriented approach.
With an intended type I error rate, the parameters in a Bayesian sequential design can be chosen in multiple ways. For prespecified threshold values, type I error rate control can be achieved by using a conservative prior. [28] and [29] demonstrated that by tuning the prior distribution of θ, one could achieve stopping boundaries similar to or more conservative than Pocock’s or O’Brien-Fleming’s boundaries. In our case, we can simply set $\mu =0$ and adjust ${\nu ^{2}}$ according to the planning of interim analyses. From Equation (2.2), when $\mu =0$, the stopping boundaries monotonically increase as ${\nu ^{2}}$ decreases. For example, consider the single-arm trial with an outcome variance of ${\sigma ^{2}}=1$, a maximum sample size of 1000, $K=5$ analyses, and equal group sizes. Then, with threshold values ${\gamma _{j}}\equiv 0.95$, a $\text{N}(0,{0.054^{2}})$ prior for θ controls the type I error rate at 0.05. The corresponding stopping boundaries for ${z_{j}}$’s are shown in Table 3.
Alternatively, for a given prior $\pi (\theta )$, type I error rate control can be attained by adjusting the threshold values $\{{\gamma _{1}},\dots ,{\gamma _{K}}\}$. For the single-arm trial example, one may equate the stopping boundaries in Equation (2.2) to the corresponding boundaries in any frequentist sequential design. For example, suppose $\{{c_{1}},\dots ,{c_{K}}\}$ are O’Brien-Fleming boundaries, then ${\gamma _{j}}$ may be set at
2.2 The Subjective Bayesian Perspective
From a subjective Bayesian point of view (see, e.g., [36, 62]), the prior $\pi (\theta )$ should be specified to reflect a subjective belief on θ before the trial, and the threshold values $\{{\gamma _{1}},\dots ,{\gamma _{K}}\}$ should be chosen to represent personal tolerance of risk. For example, a positive (or negative) prior mean for θ represents that the investigator’s prior belief on the treatment effect is optimistic (or pessimistic). Similarly, the prior variance for θ reflects the investigator’s uncertainty about the prior opinion. In practice, $\pi (\theta )$ could be elicited from preclinical data and historical clinical trials with a similar setting. On the other hand, the choice of the threshold values can be justified from a decision-theoretic perspective. See, e.g., [60] (Chapter 5.2). At analysis j, the possible decision is denoted by ${\varphi _{j}}$, where ${\varphi _{j}}=1$ (or 0) indicates rejecting ${H_{0}}$ and stopping the trial (or failing to reject ${H_{0}}$ and continuing enrollment if $j\lt K$). Assume the loss associated with decision ${\varphi _{j}}$ is
Then, the posterior expected loss of ${\varphi _{j}}$ is ${L_{j}}({\varphi _{j}},{\boldsymbol{y}_{j}})=\textstyle\int {\ell _{j}}({\varphi _{j}},\theta )p(\theta \mid {\boldsymbol{y}_{j}})\text{d}\theta $, and the decision that minimizes ${L_{j}}({\varphi _{j}},{\boldsymbol{y}_{j}})$ is
(2.3)
\[ {\ell _{j}}({\varphi _{j}},\theta )=\left\{\begin{array}{l@{\hskip10.0pt}l}{\xi _{1j}}\cdot \mathbf{1}(\theta \le 0),\hspace{1em}\hspace{1em}& \text{if}\hspace{2.5pt}{\varphi _{j}}=1\text{;}\\ {} {\xi _{0j}}\cdot \mathbf{1}(\theta \gt 0),\hspace{1em}\hspace{1em}& \text{if}\hspace{2.5pt}{\varphi _{j}}=0\text{.}\end{array}\right.\]
\[ {\tilde{\varphi }_{j}}({\boldsymbol{y}_{j}})=\left\{\begin{array}{l@{\hskip10.0pt}l}1,\hspace{1em}\hspace{1em}& \text{if}\hspace{2.5pt}\Pr (\theta \gt 0\mid {\boldsymbol{y}_{j}})\gt \frac{{\xi _{1j}}}{{\xi _{0j}}+{\xi _{1j}}}\text{;}\\ {} 0,\hspace{1em}\hspace{1em}& \text{otherwise.}\end{array}\right.\]
By setting ${\gamma _{j}}$ at ${\xi _{1j}}/({\xi _{0j}}+{\xi _{1j}})$, the stopping rule in Equation (2.2) minimizes the posterior expected loss. In practice, one could specify the loss function ${\ell _{j}}({\varphi _{j}},\theta )$ based on personal tolerance of risk and then derive the ${\gamma _{j}}$’s subsequently. For example, if one wants to be conservative about rejections early in the trial, one could consider increasing the loss of false rejections at early interim analyses [64]. Of course, the particular loss function in Equation (2.3) is a naive choice and ignores the cost of patient enrollment. A more stringent way of formulating the loss function should take into account the sequential nature of the trial. For example, a decision to continue the trial should be made based on balancing the cost of enrolling more patients and the gain of acquiring more information. More discussion on this point is deferred to Section 3.2.We see that by taking this particular subjective Bayesian approach, one does not need to take frequentist properties into account. For example, suppose that ${\xi _{1j}}=19\cdot {\xi _{0j}}$ for all j, then one can reject ${H_{0}}$ and stop the trial at any analysis as long as $\Pr (\theta \gt 0\mid {\boldsymbol{y}_{j}})\gt 0.95$. As [21] stated, “it is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience.” This point has also been made by [37].
Such a procedure is vulnerable to type I error rate inflation, which would bother many practitioners. However, it has been argued that the type I error rate is not the quantity that one should pay most attention to [38], because its calculation is conditioned on an assumption rather than something knowable. Subjective Bayesians argue that what matters is the probability of “regulator’s regret”, $\Pr (\theta \le 0\mid \text{data})$, conditioned on the available data. Also, the calculation of the type I error rate involves an average over unrealized event that may arise for hypothetical values of θ. However, based on the LP, unobserved events are irrelevant to the evidence about θ [9, 10]. We provide more discussion in Section 4.
A similar critique on the subjective Bayesian approach is the issue of “sampling to a foregone conclusion” [17]. However, [9, 10] argued that this is not a threat, because the sequence of posterior probabilities, $\{\Pr (\theta \gt 0\mid {y_{1}},\dots ,{y_{n}}):n=1,2,\dots \}$, is a martingale. If the posterior probability of $\{\theta \gt 0\}$ is less than 0.95 given n observations, say 0.94, then after the next observation, it may increase or decrease with an expected value of 0.94. In other words, one cannot guarantee reaching $\Pr (\theta \gt 0\mid \text{data})\gt 0.95$ with more data. Specifically, when the sampling distribution of ${y_{i}}$’s is normal, the expected number of additional observations required to raise $\Pr (\theta \gt 0\mid \text{data})$ any prescribed amount is infinite. This is analogous to the expected hitting time of a Brownian motion, which is infinite (see, e.g., Chapter 8.2 in [66]).
2.3 The Calibrated Bayesian Perspective
Although Bayesian probabilities represent degrees of belief in some formal sense, for practitioners and regulatory agencies, it can be pertinent to examine the operating characteristics of Bayesian designs in repeated practices. One could calibrate the prior and threshold values in a Bayesian sequential design to achieve desirable operating characteristics under a range of plausible scenarios, and we refer to this as a calibrated Bayesian approach [68, 52]. We provide more background on the calibrated Bayesian perspective in Section S.2.1 of the Supplementary Material.
We distinguish between operating characteristics and frequentist properties: we use the former to refer to the long-run average behaviors of a statistical procedure in a series of (possibly different) trials, and use the latter to refer to those in (imaginary) repetitions of the same trial. In other words, operating characteristics represent averages over a joint data-parameter distribution, while frequentist properties represent averages over a data distribution given a fixed parameter. See, e.g., [68, 4]. Frequentist properties are a special class of operating characteristics.
What kinds of operating characteristics could be examined? Consider the single-arm trial example. Imagine an infinite series of such trials with true but unknown treatment effects $\{{\theta ^{(1)}},{\theta ^{(2)}},\dots \}$, which constitute some population distribution ${\pi _{0}}(\theta )$. For each trial, patient outcomes ${\boldsymbol{y}_{K}}\sim {f_{0}}({\boldsymbol{y}_{K}}\mid \theta )$ and are observed sequentially, where ${\boldsymbol{y}_{K}}=({y_{1}},\dots ,{y_{{n_{K}}}})$. Suppose a Bayesian design with stopping rules given by Equation (2.1) is applied to every trial with a prior model $\pi (\theta )$, a sampling model $f({\boldsymbol{y}_{K}}\mid \theta )$, and threshold values $\{{\gamma _{1}},\dots ,{\gamma _{K}}\}$. Similar to the rationale of type I error rate control, we propose to control the FDR and FPR of the design in the infinite series of trials for a range of plausible ${f_{0}}({\boldsymbol{y}_{K}}\mid \theta ){\pi _{0}}(\theta )$. This is because false rejections of the null may result in continuation of a drug development program that will ultimately fail, increasing the cost associated with the failure. The FDR is the relative frequency of false rejections among all trials in which ${H_{0}}$ is rejected, and the FPR is the relative frequency of false rejections among all trials with nonpositive treatment effects θ’s. Mathematically, let
denote the rejection region of the design. That is, ${H_{0}}$ is rejected if ${\boldsymbol{y}_{K}}\in \Gamma $. Then,
(2.4)
\[\begin{array}{cc}& \displaystyle \Gamma =\big\{{\boldsymbol{y}_{K}}:\exists j\in \{1,\dots ,K\}\hspace{0.2778em}\text{s.t.}\hspace{2.5pt}\Pr (\theta \gt 0\mid {\boldsymbol{y}_{j}})\gt {\gamma _{j}}\hspace{0.2778em}\text{at analysis}\hspace{2.5pt}j\big\}\end{array}\]
\[\begin{aligned}{}\text{FDR}({\pi _{0}},{f_{0}},\Gamma )& =\frac{{\textstyle\int _{{\boldsymbol{y}_{K}}\in \Gamma }}{\textstyle\int _{\theta \le 0}}{f_{0}}({\boldsymbol{y}_{K}}\mid \theta ){\pi _{0}}(\theta )\text{d}\theta \text{d}{\boldsymbol{y}_{K}}}{{\textstyle\int _{{\boldsymbol{y}_{K}}\in \Gamma }}{f_{0}}({\boldsymbol{y}_{K}})\text{d}{\boldsymbol{y}_{K}}},\hspace{1em}\text{and}\\ {} \text{FPR}({\pi _{0}},{f_{0}},\Gamma )& =\frac{{\textstyle\int _{{\boldsymbol{y}_{K}}\in \Gamma }}{\textstyle\int _{\theta \le 0}}{f_{0}}({\boldsymbol{y}_{K}}\mid \theta ){\pi _{0}}(\theta )\text{d}\theta \text{d}{\boldsymbol{y}_{K}}}{{\textstyle\int _{\theta \le 0}}{\pi _{0}}(\theta )\text{d}\theta }.\end{aligned}\]
Our definitions of the FDR and FPR are slightly different from, but closely related to, their typical definitions in a frequentist sense (see, e.g., [79]).The calibration of the design parameters is typically done through computer simulations. For each plausible ${f_{0}}({\boldsymbol{y}_{K}}\mid \theta ){\pi _{0}}(\theta )$, one could generate S hypothetical trials with treatment effects $\{{\theta ^{(1)}},{\theta ^{(2)}},\dots ,{\theta ^{(S)}}\}$ and outcomes $\{{\boldsymbol{y}_{K}^{(1)}},{\boldsymbol{y}_{K}^{(2)}},\dots ,{\boldsymbol{y}_{K}^{(S)}}\}$ (for some large S). Then, the FDR and FPR are respectively approximated by
The prior and threshold values in the Bayesian design can be chosen such that $\widehat{\text{FDR}}$ and $\widehat{\text{FPR}}$ do not exceed some prespecified levels for every plausible ${f_{0}}({\boldsymbol{y}_{K}}\mid \theta ){\pi _{0}}(\theta )$. Note that the simulations here are different from those for frequentist-oriented approaches. For the latter, hypothetical repetitions of the same trial are simulated with an assumed true treatment effect.
(2.5)
\[ \begin{aligned}{}\widehat{\text{FDR}}& =\frac{{\textstyle\textstyle\sum _{s=1}^{S}}\mathbf{1}\left({\boldsymbol{y}_{K}^{(s)}}\in \Gamma ,{\theta ^{(s)}}\le 0\right)}{{\textstyle\textstyle\sum _{s=1}^{S}}\mathbf{1}\left({\boldsymbol{y}_{K}^{(s)}}\in \Gamma \right)},\hspace{1em}\text{and}\\ {} \widehat{\text{FPR}}& =\frac{{\textstyle\textstyle\sum _{s=1}^{S}}\mathbf{1}\left({\boldsymbol{y}_{K}^{(s)}}\in \Gamma ,{\theta ^{(s)}}\le 0\right)}{{\textstyle\textstyle\sum _{s=1}^{S}}\mathbf{1}\left({\theta ^{(s)}}\le 0\right)}.\end{aligned}\]In certain contexts, there are theoretical guarantees on the operating characteristics of Bayesian sequential designs. Specifically, the following proposition provides such an example.
Proposition 2.1.
Let Γ in Equation (2.4) represent the rejection region of a Bayesian design. Assume the joint model for $({\boldsymbol{y}_{K}},\theta )$ in the Bayesian design is the same as the actual joint distribution of $({\boldsymbol{y}_{K}},\theta )$ in a series of trials, i.e., $f({\boldsymbol{y}_{K}}\mid \theta )\pi (\theta )={f_{0}}({\boldsymbol{y}_{K}}\mid \theta ){\pi _{0}}(\theta )$. Then, the FDR and FPR of the Bayesian design are upper bounded regardless of the time (${n_{j}}$’s) and frequency (K) of interim analyses,
\[\begin{aligned}{}& \textit{FDR}({\pi _{0}},{f_{0}},\Gamma )\le 1-{\gamma _{\min }},\hspace{1em}\textit{and}\\ {} & \textit{FPR}({\pi _{0}},{f_{0}},\Gamma )\le \frac{(1-{\gamma _{\min }})\cdot {\textstyle\int _{\theta \gt 0}}\pi (\theta )\textit{d}\theta }{{\gamma _{\min }}\cdot {\textstyle\int _{\theta \le 0}}\pi (\theta )\textit{d}\theta },\end{aligned}\]
where ${\gamma _{\min }}=\min \{{\gamma _{1}},\dots ,{\gamma _{K}}\}$.
The proof is given in Sections S.2.2 and S.2.3 of the Supplementary Material. Therefore, from a calibrated Bayesian perspective, the prior on θ could be elicited to resemble the actual distribution of θ in repeated practices, and the threshold values reflect acceptable FDR and FPR levels.
In general, requiring a design to have good operating characteristics (under plausible scenarios) is more lenient than requiring it to have good frequentist properties (for all possible parameter values). For example, the type I error rate is essentially the FPR when ${\pi _{0}}(\theta )$ is a point mass. Stringent type I error rate requires that the FPR is controlled for all possible ${\pi _{0}}(\theta )$, even when ${\pi _{0}}(\theta )$ is a point mass at 0, while the calibrated Bayesian approach only requires the FPR to be controlled for plausible ${\pi _{0}}(\theta )$. In this sense, the calibrated Bayesian approach can be thought of as a middle ground between the frequentist-oriented approach and the subjective Bayesian approach.
2.4 Our Comments on the Three Perspectives
We have reviewed three perspectives on Bayesian sequential designs, which are summarized in Table 1. Although the three perspectives seem contradictory, they are not mutually exclusive. For example, if the investigator is conservative about a new drug and is cautious about false rejections, then he/she may take a subjective Bayesian approach with a large loss for a false positive decision. This can lead to low FDR and FPR, or even a low type I error rate. In other words, subjective Bayesians may produce desirable operating characteristics for calibrated Bayesians, or desirable frequentist properties for frequentist-oriented Bayesians.
Table 1
Summary of the three perspectives on Bayesian sequential designs.
Perspective | Description | Suitable contexts |
Frequentist-oriented | Specifying design parameters to achieve desirable frequentist properties (e.g., type I error rate) | Large-scale confirmatory trials |
Subjective Bayesian | Specifying design parameters to reflect subjective beliefs and personal tolerance of risk | Trials for rare diseases; pediatric trials for small populations |
Calibrated Bayesian | Specifying design parameters to achieve desirable operating characteristics (e.g., FDR and FPR) under plausible scenarios | Animal studies for drug screening; early-phase trials (e.g., dose finding) |
In some contexts, a specific approach can be more applicable and acceptable compared to the others. For example, for large-scale confirmatory trials (e.g., COVID-19 vaccine trials), type I error rate control is enforced by regulators, and thus only the frequentist-oriented perspective is accepted. Indeed, there are some challenges with the subjective and calibrated Bayesian approaches in those settings. See, e.g., [14, 76]. With a large number of enrolled patients, a large population that could potentially benefit from the treatment, and multiple decision makers with distinctive prior opinions and tolerances for risk, the process of eliciting costs and benefits can be difficult for subjective Bayesians. As [76] noted, “when the decision is whether or not to discontinue the trial, coupled with whether or not to recommend one treatment in preference to the other, the consequences of any particular course of action are so uncertain that they make the meaningful specification of utilities rather speculative.” From a calibrated Bayesian perspective, one could elicit the prior for θ based on historical trials of similar drugs and/or conditions. However, there may be concerns that high or low rates of historical success (e.g., pembrolizumab for solid tumors with a high success rate) may bias the inference for a new trial and trigger incentives for investigators to concentrate clinical research toward attractive areas and selected conditions. On the other hand, the prior for θ could also be based on all historical trials regardless of drugs and conditions. However, the distribution of treatment effects can be highly variable over time, and different types of trials have vastly different endpoints, which are difficult to summarize into a common distribution. As a result, utilization of Bayesian designs for phase III trials requires a case-by-case discussion that involves extensive examination of prior elicitation, inference procedures, and simulation results, which has been highlighted by several guidances from the U.S. Food and Drug Administration [25, 26, 27].
The subjective Bayesian perspective can be useful in trials for rare diseases and pediatric trials for small populations. In those situations, simple loss functions may be elicited, and prior distributions can be derived by eliciting expert opinion [46]. The elicitation process usually involves interviewing multiple subject experts such as physicians and their team members, and summaries of the interviews can be reported in the form of statistics like medians, modes, and percentiles. Lastly, a prior distribution can be estimated by fitting a parametric distribution to match the summary statistics.
Lastly, the calibrated Bayesian perspective is suitable in exploratory settings, such as animal studies for drug screening and early-phase trials (e.g., dose finding). For those trials, stringent type I error rate control is optional and often at the discretion of the sponsors. Eliciting the prior for θ from previous studies and focusing on FDR/FPR control allow an efficient selection of promising drugs for further development.
Influenced by [68, 52, 62], our recommendation is to regard the subjective Bayesian paradigm as ideal in principle but often rely on frequentist-type metrics to better communicate Bayesian designs and understand the practical implications of different priors, loss functions, and threshold values. The LP is sometimes viewed as an argument against the consideration of frequentist-type metrics in hypothetical trials. However, we will demonstrate in Section 4 that the LP does not preclude one from utilizing frequentist-type metrics to assess a decision procedure. Still, we advocate the use of operating characteristics under plausible scenarios, in addition to standard frequentist properties, for evaluating trial designs in either exploratory or confirmatory settings. Metrics like the FDR and FPR have not been used for drug approval, but arguably, they reflect the reality better than frequentist properties. In real life, different clinical trials would have different treatment effects.
2.5 Bayesian Hypothesis Testing
Before moving on to other topics, we discuss some additional considerations in Bayesian sequential designs. First, we present a special class of Bayesian designs based on the posterior probability of the alternative hypothesis through formal Bayesian hypothesis testing. See, e.g., [44]. For the single-arm trial example, to test Equation (1.1), we need to specify the priors for θ under both the null and alternative hypotheses,
which can be used to decide whether to stop the trial early. For example, if $\Pr ({H_{1}}\mid {\boldsymbol{y}_{j}})\gt {\gamma _{j}}$, ${H_{0}}$ is rejected, and the trial is stopped. This approach is equivalent to specifying a mixture prior distribution for θ,
\[ \theta \mid {H_{0}}\sim {\pi ^{(0)}}(\theta ),\hspace{2em}\theta \mid {H_{1}}\sim {\pi ^{(1)}}(\theta ).\]
Importantly, ${\pi ^{(0)}}(\theta )$ and ${\pi ^{(1)}}(\theta )$ have supports on $(-\infty ,0]$ and $(0,\infty )$, respectively. Then, the prior probability for each hypothesis is also specified, $\Pr ({H_{0}})=1-\omega $ and $\Pr ({H_{1}})=\omega $. At analysis j, the posterior probability of ${H_{1}}$ is
(2.6)
\[ \begin{aligned}{}& \Pr ({H_{1}}\mid {\boldsymbol{y}_{j}})=\frac{\Pr ({H_{1}})f({\boldsymbol{y}_{j}}\mid {H_{1}})}{\Pr ({H_{1}})f({\boldsymbol{y}_{j}}\mid {H_{1}})+\Pr ({H_{0}})f({\boldsymbol{y}_{j}}\mid {H_{0}})}=\\ {} & \frac{\omega {\textstyle\int _{\theta \gt 0}}f({\boldsymbol{y}_{j}}\mid \theta ){\pi ^{(1)}}(\theta )\text{d}\theta }{\omega {\textstyle\int _{\theta \gt 0}}f({\boldsymbol{y}_{j}}\mid \theta ){\pi ^{(1)}}(\theta )\text{d}\theta +(1-\omega ){\textstyle\int _{\theta \le 0}}f({\boldsymbol{y}_{j}}\mid \theta ){\pi ^{(0)}}(\theta )\text{d}\theta },\end{aligned}\]
\[ \theta \sim \pi (\theta )=(1-\omega )\cdot {\pi ^{(0)}}(\theta )+\omega \cdot {\pi ^{(1)}}(\theta ),\]
and then stop the trial at analysis j if $\Pr (\theta \gt 0\mid {\boldsymbol{y}_{j}})\gt {\gamma _{j}}$. Note that under the mixture prior,
\[\begin{aligned}{}\Pr (\theta \gt 0\mid {\boldsymbol{y}_{j}})& ={\int _{\theta \gt 0}}p(\theta \mid {\boldsymbol{y}_{j}})\text{d}\theta \\ {} & =\frac{{\textstyle\int _{\theta \gt 0}}f({\boldsymbol{y}_{j}}\mid \theta )\pi (\theta )\text{d}\theta }{{\textstyle\int _{\theta }}f({\boldsymbol{y}_{j}}\mid \theta )\pi (\theta )\text{d}\theta }=\text{(2.6)}\end{aligned}\]
This relationship has been noted by [84]. Although these two approaches are equivalent, when the primary goal is hypothesis testing, the prior for θ is usually specified as a mixture of two truncated distributions; when the primary goal is parameter estimation, the prior for θ is usually specified as a single continuous distribution.A special case is when ${H_{0}}$ is a point hypothesis, say when we test ${H_{0}}:\theta =0$ vs. ${H_{1}}:\theta \ne 0$. From a hypothesis testing perspective, the prior for θ should be a mixture of a point mass at $\theta =0$ (denoted by ${\delta _{0}}(\theta )$) and a continuous distribution, $\pi (\theta )=(1-\omega ){\delta _{0}}(\theta )+\omega {\pi ^{(1)}}(\theta )$. Such a prior distribution is rarely used when the primary goal is parameter estimation. Lastly, [44] and [45] recommended the use of non-local prior densities, which incorporate a minimally significant separation between the null and alternative hypotheses, for Bayesian hypothesis testing and applications in trial monitoring.
2.6 Analysis at the Conclusion of a Sequential Trial
From a Bayesian perspective, after a clinical trial has been completed, all the information about θ is contained in its posterior distribution. Let t denote the stopping time of a sequential trial. For example, based on the stopping rule in Equation (2.2),
\[ t=\left\{\begin{array}{l@{\hskip10.0pt}l}\min \{j:{z_{j}}\gt {c_{j}}\},\hspace{0.2778em}\hspace{1em}& \text{if}\hspace{2.5pt}\exists j\in \{1,\dots ,K\}\hspace{2.5pt}\text{s.t.}\hspace{2.5pt}{z_{j}}\gt {c_{j}}\text{;}\\ {} K,\hspace{0.2778em}\hspace{1em}& \text{if}\hspace{2.5pt}{z_{j}}\le {c_{j}}\hspace{2.5pt}\text{for all}\hspace{2.5pt}j\text{.}\end{array}\right.\]
Then, ${\boldsymbol{y}_{t}}=({y_{1}},\dots ,{y_{{n_{t}}}})$ is the vector of accumulating data up to the time of stopping. At the time of stopping, the posterior distribution of θ is given by
\[ p(\theta \mid {\boldsymbol{y}_{t}})=\frac{f({\boldsymbol{y}_{t}}\mid \theta )\pi (\theta )}{{\textstyle\int _{\theta }}f({\boldsymbol{y}_{t}}\mid \theta )\pi (\theta )\text{d}\theta }.\]
One may be worried that the stopping time t is not included in the conditional of $p(\theta \mid {\boldsymbol{y}_{t}})$. However, assuming that θ and t are independent conditional on ${\boldsymbol{y}_{t}}$, we havebecause $f(t,{\boldsymbol{y}_{t}}\mid \theta )=f(t\mid {\boldsymbol{y}_{t}},\theta )f({\boldsymbol{y}_{t}}\mid \theta )=f(t\mid {\boldsymbol{y}_{t}})f({\boldsymbol{y}_{t}}\mid \theta )$. Most often (and in all the designs that we have reviewed), θ affects t only through the observations ${\boldsymbol{y}_{t}}$, in which case the conditional independence assumption is satisfied, the equation holds, and the stopping rule plays no role in the posterior distribution of θ. See, e.g., [40]. However, we note that in some situations, θ could affect t other than just via ${\boldsymbol{y}_{t}}$. For example, if an interim analysis happens because an external trial found a positive treatment effect, which is more likely if θ is positive and large, this would affect t via external data other than via the current data.
The posterior mean, $\text{E}(\theta \mid {\boldsymbol{y}_{t}})$, is a commonly used point estimator for θ. On the other hand, a $100(1-\alpha )\% $ credible interval for θ can be constructed as $({\theta ^{\text{L}}},{\theta ^{\text{U}}})$, where ${\theta ^{\text{L}}}$ and ${\theta ^{\text{U}}}$ are the lower and upper $(\alpha /2)$ quantiles of $p(\theta \mid {\boldsymbol{y}_{t}})$, respectively. This credible interval has its asserted coverage in repeated practices if the model specification is correct (see Section S.2.1 of the Supplementary Material), but the coverage may deteriorate in the presence of model misspecification. Lastly, the posterior probability of the alternative hypothesis, $\Pr (\theta \gt 0\mid {\boldsymbol{y}_{t}})$, is also reported.
2.7 Randomized-controlled Trial and Minimum Clinically Important Difference
So far, we have been using a single-arm trial to illustrate the designs. In practice, multi-arm trials such as randomized-controlled trials are also very common. We briefly outline an extension of the designs for a randomized-controlled trial. For simplicity, assume the trial outcomes are normally distributed. At analysis j, observed data are ${y_{r1}},{y_{r2}},\dots ,{y_{r{n_{rj}}}}\sim \text{N}({\theta _{r}},{\sigma _{r}^{2}})$ for arm r, where $r=1$ and 0 represent the investigational drug and control arms, respectively. The goal may be to test
\[ {H_{0}}:{\theta _{1}}-{\theta _{0}}\le 0\hspace{1em}\text{vs.}\hspace{1em}{H_{1}}:{\theta _{1}}-{\theta _{0}}\gt 0.\]
Assume ${\sigma _{1}^{2}}$ and ${\sigma _{0}^{2}}$ are known. One can specify a prior distribution for $\theta ={\theta _{1}}-{\theta _{0}}$, say $\theta \sim \text{N}(\mu ,{\nu ^{2}})$. The posterior distribution of θ at analysis j is given by
\[ \theta \mid {\boldsymbol{y}_{1j}},{\boldsymbol{y}_{0j}}\sim \text{N}\Bigg[\frac{\mu {\nu ^{-2}}+({\bar{y}_{1j}}-{\bar{y}_{0j}}){({\sigma _{1}^{2}}/{n_{1j}}+{\sigma _{0}^{2}}/{n_{0j}})^{-1}}}{{\nu ^{-2}}+{({\sigma _{1}^{2}}/{n_{1j}}+{\sigma _{0}^{2}}/{n_{0j}})^{-1}}},\frac{1}{{\nu ^{-2}}+{({\sigma _{1}^{2}}/{n_{1j}}+{\sigma _{0}^{2}}/{n_{0j}})^{-1}}}\Bigg],\]
where ${\bar{y}_{rj}}=\frac{1}{{n_{rj}}}{\textstyle\sum _{i=1}^{{n_{rj}}}}{y_{ri}}$. Then, one can proceed similarly as before. An alternative approach is to specify independent priors separately for ${\theta _{1}}$ and ${\theta _{0}}$ and then use these to obtain a posterior distribution of θ. This will lead to slightly different designs. See [78]. When ${\sigma _{1}^{2}}$ and ${\sigma _{0}^{2}}$ are unknown, one needs to specify priors for these parameters as well and calculate the marginal posterior distribution of θ.In some trials, such as proof-of-concept trials, it may be of interest to evaluate the evidence of the treatment effect being greater than a minimum clinically important difference, denoted by Δ [16, 24]. In this case, one may replace the stopping rule in Equation (2.1) by
Alternatively, the efficacy stopping rule can be based on both Equations (2.1) and (2.7). Here, Equation (2.1) speaks to “does the drug work at all”, while Equation (2.7) addresses “does the drug have a clinically relevant effect”. In proof-of-concept trials, Equation (2.7) may be a necessary criterion for a drug to be promoted into full development [24].
2.8 Comparison with Frequentist Sequential Designs
Compared to their frequentist counterparts, Bayesian designs involve additional complexities such as prior elicitation and computational challenges when the posterior distribution is not analytically tractable. Still, Bayesian designs have certain advantages (see, e.g., [29]). First, with a chosen probability model, the data affect posterior inference only through the likelihood function. In this way, Bayesian inference obeys the LP ([34], p. 7). This can be philosophically appealing. Frequentist inference, on the other hand, may be affected by unrealized events. We will elaborate on this point in Section 4. Second, the stopping rule of an experiment is irrelevant to the construction and interpretation of a Bayesian credible interval. In contrast, a frequentist interval estimate of treatment effect following a group sequential trial crucially depends on the stopping rule. As [29] pointed out, such an interval may be quite unintuitive. Depending on the choice of sample space ordering, the interval may not always include the sample mean and can include zero difference even for data that lead to a recommendation to stop the trial at the first interim analysis (see [65]). Third, stringent frequentist inference can be challenging or unsatisfactory if the prescribed stopping rule is not followed. For example, a trial may be stopped due to unforeseeable circumstances such as the outbreak of COVID-19; in some cases, it may be desirable to extended a trial beyond the planned sample size. Some have criticized that the relevance of stopping rules makes it almost impossible to conduct any frequentist inference in a strict sense [5, 9, 7, 82]. Oftentimes, statisticians are presented with a dataset without knowing how the stopping of the study was decided and why the study was not stopped earlier. Both factors can affect the frequentist properties of a statistical procedure, while in practice it is infeasible to keep track of them. Lastly, when reliable historical information is available, it can be formally incorporated into the design and analysis of the current trial via Bayesian methods. This may lead to improvements in trial efficiency in terms of higher power and saving in sample size (see [71]).
3 Other Types of Bayesian Sequential Designs
3.1 Designs Based on Posterior Predictive Probabilities
In the upcoming sections, we review some other types of Bayesian sequential designs whose early stopping rules are not directly based on $\Pr (\theta \gt 0\mid {\boldsymbol{y}_{j}})\gt {\gamma _{j}}$. Similar to the idea of stochastic curtailment [49], posterior predictive probabilities can be used to determine whether to stop a trial early. See, e.g., [20, 50, 70]. Suppose that at the final analysis, efficacy of the drug will be declared if $\Pr (\theta \gt 0\mid {\boldsymbol{y}_{K}})\gt 1-\eta $. At analysis $j\in \{1,\dots ,K-1\}$, the posterior predictive distribution of future observations ${\boldsymbol{y}_{j,K}^{\ast }}=({y_{{n_{j}}+1}^{\ast }},\dots ,{y_{{n_{K}}}^{\ast }})$ is
\[ p({\boldsymbol{y}_{j,K}^{\ast }}\mid {\boldsymbol{y}_{j}})={\int _{\theta }}f({\boldsymbol{y}_{j,K}^{\ast }}\mid \theta )p(\theta \mid {\boldsymbol{y}_{j}})\text{d}\theta ,\]
and the posterior predictive probability of success (PPOS) is
\[ {\text{PPOS}_{j}}={\int _{{\boldsymbol{y}_{j,K}^{\ast }}}}\mathbf{1}\left[\Pr \left(\theta \gt 0\mid {\boldsymbol{y}_{j}},{\boldsymbol{y}_{j,K}^{\ast }}\right)\gt 1-\eta \right]\cdot p({\boldsymbol{y}_{j,K}^{\ast }}\mid {\boldsymbol{y}_{j}})\text{d}{\boldsymbol{y}_{j,K}^{\ast }}.\]
One may stop the trial early if ${\text{PPOS}_{j}}\gt {\gamma _{j}}$ for some threshold ${\gamma _{j}}$. To specify the prior for θ and the threshold values $\{{\gamma _{1}},\dots ,{\gamma _{K-1}}\}$ and η, one may take one of the approaches in Sections 2.1–2.3.For the single-arm trial example, we have
\[ {\bar{y}_{j,K}^{\ast }}\mid {\boldsymbol{y}_{j}}\sim \text{N}\Bigg(\frac{\mu {\nu ^{-2}}+{\bar{y}_{j}}{n_{j}}{\sigma ^{-2}}}{{\nu ^{-2}}+{n_{j}}{\sigma ^{-2}}},\frac{1}{{\nu ^{-2}}+{n_{j}}{\sigma ^{-2}}}+\frac{1}{({n_{K}}-{n_{j}}){\sigma ^{-2}}}\Bigg),\]
where ${\bar{y}_{j,K}^{\ast }}=\left({y_{{n_{j}}+1}^{\ast }}+\cdots +{y_{{n_{K}}}^{\ast }}\right)/({n_{K}}-{n_{j}})$. The criterion $\Pr \left(\theta \gt 0\mid {\boldsymbol{y}_{j}},{\boldsymbol{y}_{j,K}^{\ast }}\right)\gt 1-\eta $ is equivalent to
\[\begin{aligned}{}{\bar{y}_{K}^{\ast }}& =\frac{1}{{n_{K}}}\left[{n_{j}}{\bar{y}_{j}}+({n_{K}}-{n_{j}}){\bar{y}_{j,K}^{\ast }}\right]\\ {} & \gt {q_{\eta }}\cdot \frac{\sqrt{{\nu ^{-2}}+{n_{K}}{\sigma ^{-2}}}}{{n_{K}}{\sigma ^{-2}}}-\frac{\mu {\nu ^{-2}}}{{n_{K}}{\sigma ^{-2}}}.\end{aligned}\]
Finally, it can be derived that
\[ {\text{PPOS}_{j}}=1-\Phi \Bigg\{{\left[\frac{1}{{\nu ^{-2}}+{n_{j}}{\sigma ^{-2}}}+\frac{1}{({n_{K}}-{n_{j}}){\sigma ^{-2}}}\right]^{-1/2}}\cdot \Bigg[\frac{{n_{K}}}{{n_{K}}-{n_{j}}}\cdot \Bigg({q_{\eta }}\cdot \frac{\sqrt{{\nu ^{-2}}+{n_{K}}{\sigma ^{-2}}}}{{n_{K}}{\sigma ^{-2}}}-\frac{\mu {\nu ^{-2}}}{{n_{K}}{\sigma ^{-2}}}-\frac{{\bar{y}_{j}}{n_{j}}}{{n_{K}}}\Bigg)-\frac{\mu {\nu ^{-2}}+{\bar{y}_{j}}{n_{j}}{\sigma ^{-2}}}{{\nu ^{-2}}+{n_{j}}{\sigma ^{-2}}}\Bigg]\Bigg\}.\]
The PPOS depends on η and ${n_{K}}$. In general, the stopping rules based on PPOS and PP are different, although for given η and ${n_{K}}$, one may select ${\gamma ^{\prime }_{j}}$ such that $\{{\text{PPOS}_{j}}\gt {\gamma ^{\prime }_{j}}\}$ and $\{{\text{PP}_{j}}\gt {\gamma _{j}}\}$ are equivalent. As a result, one may also impose type I error rate control on PPOS stopping rules based on the arguments in Section 2.1. As noted by [70], if at the jth interim analysis, the amount of data remain to be collected (${n_{K}}-{n_{j}}$) is infinity, then ${\text{PPOS}_{j}}={\text{PP}_{j}}$ regardless of η. Typically, the PPOS is close to the PP at the beginning of a trial and moves toward either 0 or 1 as the trial nears completion.3.2 Decision-theoretic Designs
As described in Section 2.2, the decisions in a sequential clinical trial can be made by minimizing the expected loss under a decision-theoretic framework. This approach has been considered by [12, 51, 77, 81], among others. The idea is that, at each interim analysis, the decision to stop the trial early and reject ${H_{0}}$ is associated with some loss if the decision is wrong. On the other hand, continuing the trial results in more cost in terms of patient recruitment. But with more data, the chance of making a wrong decision may be decreased. By considering both factors, decision-theoretic designs combine the strengths of designs based on posterior and posterior predictive probabilities.
We illustrate the idea of decision-theoretic designs through the single-arm trial example. Let ${\varphi _{j}}$ denote a possible decision at analysis j. For $j=1,\dots ,K-1$, ${\varphi _{j}}=1$ (or 0) represents rejecting ${H_{0}}$ and stopping the trial early (or failing to reject and continuing enrollment). For $j=K$, ${\varphi _{K}}=1$ (or 0) represents rejecting (or failing to reject) ${H_{0}}$ at the final analysis, and the trial is stopped in either case. Let ${\ell _{j}}({\varphi _{j}},\theta ,{\boldsymbol{y}_{j}})$ denote the loss of making decision ${\varphi _{j}}$ at analysis j given parameter θ and data ${\boldsymbol{y}_{j}}$. The posterior expected loss is then ${L_{j}}({\varphi _{j}},{\boldsymbol{y}_{j}})={\textstyle\int _{\theta }}{\ell _{j}}({\varphi _{j}},\theta ,{\boldsymbol{y}_{j}})p(\theta \mid {\boldsymbol{y}_{j}})\text{d}\theta $. The optimal decision is ${\tilde{\varphi }_{j}}({\boldsymbol{y}_{j}})=\arg {\min _{{\varphi _{j}}}}{L_{j}}({\varphi _{j}},{\boldsymbol{y}_{j}})$ and the associated expected loss is ${\tilde{L}_{j}}({\boldsymbol{y}_{j}})={\min _{{\varphi _{j}}}}{L_{j}}({\varphi _{j}},{\boldsymbol{y}_{j}})$, i.e., the Bayes risk.
Suppose that the loss of making decision ${\varphi _{j}}=1$ at analysis j ($j=1,\dots ,K-1$) is
where ${\xi _{1j}}$ is the loss of mistakenly rejecting ${H_{0}}$ and stopping the trial if $\theta \le 0$. On the other hand, if ${\varphi _{j}}=0$, the trial continues, $({n_{j+1}}-{n_{j}})$ patients will be enrolled until the next analysis, and we assume a unit loss for recruiting each patient. We have
Here, ${\textstyle\int _{{\boldsymbol{y}_{j,j+1}^{\ast }}}}{\tilde{L}_{j+1}}({\boldsymbol{y}_{j}},{\boldsymbol{y}_{j,j+1}^{\ast }})p({\boldsymbol{y}_{j,j+1}^{\ast }}\mid {\boldsymbol{y}_{j}})\text{d}{\boldsymbol{y}_{j,j+1}^{\ast }}$ is the Bayes risk at analysis $(j+1)$ marginalized over the posterior predictive distribution on ${\boldsymbol{y}_{j,j+1}^{\ast }}=({y_{{n_{j}}+1}^{\ast }},\dots ,{y_{{n_{j+1}}}^{\ast }})$, that is, the observations between analyses j and $j+1$.
(3.1)
\[ {\ell _{j}}({\varphi _{j}}=1,\theta ,{\boldsymbol{y}_{j}})={\xi _{1j}}\cdot \mathbf{1}(\theta \le 0),\](3.2)
\[\begin{array}{cc}& \displaystyle {\ell _{j}}({\varphi _{j}}=0,\theta ,{\boldsymbol{y}_{j}})=\left({n_{j+1}}-{n_{j}}\right)+{\int _{{\boldsymbol{y}_{j,j+1}^{\ast }}}}{\tilde{L}_{j+1}}({\boldsymbol{y}_{j}},{\boldsymbol{y}_{j,j+1}^{\ast }})p({\boldsymbol{y}_{j,j+1}^{\ast }}\mid {\boldsymbol{y}_{j}})\text{d}{\boldsymbol{y}_{j,j+1}^{\ast }}.\end{array}\]We also assume the loss of making decision ${\varphi _{K}}$ at the final analysis is
\[ {\ell _{K}}({\varphi _{K}},\theta ,{\boldsymbol{y}_{K}})=\left\{\begin{array}{l@{\hskip10.0pt}l}{\xi _{1K}}\cdot \mathbf{1}(\theta \le 0),\hspace{1em}\hspace{1em}& \text{if}\hspace{2.5pt}{\varphi _{K}}=1\text{;}\\ {} {\xi _{0}}\cdot \mathbf{1}(\theta \gt 0),\hspace{1em}\hspace{1em}& \text{if}\hspace{2.5pt}{\varphi _{K}}=0\text{.}\end{array}\right.\]
Here, ${\xi _{1K}}$ is the loss of mistakenly rejecting ${H_{0}}$ at the final analysis if $\theta \le 0$ (a type I error), and ${\xi _{0}}$ is the loss of failing to reject ${H_{0}}$ if $\theta \gt 0$ (a type II error).At analysis j, the optimal decision ${\tilde{\varphi }_{j}}({\boldsymbol{y}_{j}})$ can be solved by backward induction ([19], Chapter 12). First, we calculate ${\tilde{L}_{K}}({\boldsymbol{y}_{K}})$ for all possible data ${\boldsymbol{y}_{K}}$ that can arise at the final analysis. Next, using Equations (3.1) and (3.2), we can calculate ${\tilde{L}_{K-1}}({\boldsymbol{y}_{K-1}})$ for all possible data ${\boldsymbol{y}_{K-1}}$ that can arise at analysis $(K-1)$. Proceeding backward in this way gives ${\tilde{L}_{K-2}}({\boldsymbol{y}_{K-2}}),\dots ,{\tilde{L}_{j}}({\boldsymbol{y}_{j}})$. This procedure requires many minimizations and integrations which may not be analytically tractable. Simulation-based approaches have been proposed to mitigate these computational challenges [54].
[51] demonstrated that by tuning the loss functions, decision-theoretic designs can achieve desirable type I error rate control. [81] considered constrained optimal designs with explicit frequentist requisites. Alternatively, the loss functions and prior can be chosen by taking the subjective or calibrated Bayesian approach.
We summarize in Table 2 the various methods and measures that give rise to different types of sequential designs, including frequentist designs reviewed in Section S.1 of the Supplementary Material.
Table 2
Summary of methods and measures that give rise to different types of sequential designs.
Method/measure | Stopping criteria for efficacy | Design parameters |
Bayesian designs: | ||
Posterior probability | Posterior probability (PP) of drug being efficacious exceeds a prespecified threshold | Prior for treatment effect; PP thresholds at interim and final analyses |
Posterior predictive probability | Posterior predictive probability of trial success (PPOS) exceeds a prespecified threshold | Prior for treatment effect; PP threshold at final analysis; PPOS thresholds at interim analyses |
Decision-theoretic | Efficacy stopping minimizes posterior expected loss for a prespecified loss function | Prior for treatment effect; loss functions associated with possible decisions |
Frequentist designs: | ||
Frequentist group sequential | Test statistic exceeds a prespecified stopping boundary | Stopping boundaries for test statistics that define a critical region |
Stochastic curtailment | Conditional power (CP) of trial success, given a hypothetical treatment effect, exceeds a prespecified threshold | Critical value for test statistic at final analysis; CP thresholds at interim analyses |
4 The Likelihood Principle
Statistical inference and decision making in sequential clinical trials are typically tied to the LP. We provide some discussions in this section.
Let Y denote a random variable with density ${f_{\theta }}(y)$. The likelihood function for θ, given the observed outcome y of the random variable Y, is ${L_{y}}(\theta )={f_{\theta }}(y)$. That is, the density evaluated at y and considered as a function of θ. The (strong) LP, as in [15] and [7], can be summarized as follows:
The Likelihood Principle.
All the statistical evidence about θ arising from an experiment is contained in the likelihood function for θ given y. Two likelihood functions for θ (from the same or different experiments) contain the same statistical evidence about θ if they are proportional to one another.
[15] showed that the LP can be deduced from two widely accepted principles: the sufficiency principle and the conditionality principle. There have been debates regarding Birnbaum’s proof and the validity of the LP in general. A detailed treatment of the LP is outside the scope of this paper. We refer interested readers to [7, 61, 23, 53, 30, 57].
What would be the consequences if we accept the LP? Since the LP deals only with the observed y, data that did not obtain and experiments not carried out have no impact on the evidence about θ [10, 7]. Also, as in [7], the LP implies that the reason for stopping an experiment (the stopping rule) should be irrelevant to the evidence about θ. In a clinical trial, the implication is that early stopping would not affect the evidential meaning of the trial outcome.
As an illustration, consider the example given by [10]. Imagine that a single-arm trial as described in Section 1.2 has been conducted, and 200 outcomes have been recorded that result in a z-statistic of ${z_{1}}=1.75$. These results are being reported by two investigators A and B, who used the same probability model (including the prior model for θ, if they were to take a Bayesian approach) but had different plans about the next step. Investigator A planned a second stage for the trial to enroll 200 more patients should it happen that ${z_{1}}\le 1.88$ (the Pocock stopping boundary, see [58]), while investigator B did not plan to enroll any more patients. According to the LP, the evidence about θ provided by the 200 observations is not affected by the investigators’ plans.
Although the LP seems compelling, it has been a source of controversy. Under the Bayesian paradigm, for any specified prior distribution for θ, if the likelihood functions are proportional as functions of θ, the resulting posterior densities for θ are identical. In this sense, Bayesian inference conforms to the LP ([8], p. 249; [34], p. 7). On the other hand, the LP seems to be incompatible with many frequentist procedures. In the previous example, investigator A cannot claim statistical significance using the Pocock design after 200 observations (and may fail again after all 400 observations), while investigator B can using a fixed design with 200 patients (${z_{1}}\gt {q_{0.05}}=1.645$). In other words, these investigators can reach completely different conclusions about the effectiveness of the drug with the exact same data.
The conflict here does not mean we have to either reject the LP or reject frequentist procedures. Explained previously (e.g., [7, 32, 31]), the LP is not a decision procedure and gives little guidance in assessing the overall performance of a decision procedure. The LP implies that only the observed data are relevant to the evidence about θ, but the consequences for making a specific decision may depend on other aspects of an experiment. First, while the evidence about θ is trial-specific, a decision procedure is applied to many trials. For example, from a regulatory agency’s perspective, the action to approve a drug reflects not only the consequences of administering this drug to patients, but also the downstream consequences of that decision rule for other drugs in the future [31]. Therefore, frequentist measures such as the type I error rate can be factored into the decision procedure. Second, even for a single trial, it is not unreasonable to associate the consequences of a decision with unrealized data patterns. For example, in a Bayesian sequential design based on posterior predictive probabilities (Section 3.1), the calculation of the PPOS involves an average over the posterior predictive distribution of future data. Such averaging is also required in a Bayesian decision-theoretic design (Section 3.2) when calculating the posterior expected loss of a decision based on backward induction. Imagine an ongoing clinical trial with a maximum sample size of 400 patients and an outcome variance of ${\sigma ^{2}}=1$. Suppose the Bayesian decision-theoretic design in Section 3.2 is used. After 200 outcomes have been recorded, an interim analysis is being performed by two investigators C and D, who used the same probability model with a $\text{N}(0,{1^{2}})$ prior on θ but had different plans. Investigator C planned another interim analysis after 300 observations, while investigator D did not plan to conduct any additional interim analysis. Suppose the z-statistic at the interim analysis is ${z_{1}}=1.75$. Then, using the design and loss functions described in Section 3.2 with ${\xi _{0}}=400$ and ${\xi _{1j}}\equiv 19{\xi _{0}}$ for all j, the optimal decisions for investigators C and D are continuing enrollment and stopping the trial, respectively. Specifically, Figure 1 shows the posterior expected losses for possible decisions that can be made by the two investigators. We can see that the existence of a planned future interim analysis has an impact on the posterior expected loss associated with continuing the trial. In summary, if a dichotomous decision must be made, the LP does not preclude one from utilizing other information in addition to the observed data. Therefore, our view is that the LP should not be used as an argument for or against Bayesian or frequentist sequential designs.
Figure 1
Posterior expected losses, as functions of the z-statistic, for possible decisions that can be made by investigators C and D at an interim analysis after 200 observations. The trial has a maximum sample size of 400 patients. Investigator C planned another interim analysis after 300 observations, while investigator D did not plan to conduct any additional interim analysis. The solid vertical line represents an observed z-statistic of 1.75 at the interim analysis. The optimal decisions for investigators C and D are continuing enrollment and stopping the trial, respectively.
Still, the conflict does suggest that if we accept the LP, then frequentist measures such as type I/II error rates and p-values may not be used as measures of statistical evidence for or against a hypothesis in a clinical trial [7]. This point has been raised by many others as well. For example, [67] stated that “Neyman-Pearson statistical theory is aimed at finding good rules for choosing from a specified set of possible actions. It does not address the problem of representing and interpreting statistical evidence, and the decision rules derived from Neyman-Pearson theory are not appropriate tools for interpreting data as evidence.” In summary, in an ideal world, one may use frequentist measures to design a trial. However, when reporting statistical analyses results as evidence after trial completion, Bayesian measures that conform the LP should be preferred.
It should also be noted that not all Bayesian procedures are in compliance with the LP. For example, eliciting the prior for θ based on the sampling plan, such as using the Jeffreys prior [41], results in violation of the LP ([7], p. 21). We have mentioned in Section 2.1 that one may control the type I error rate of a Bayesian sequential design by calibrating the prior or threshold values. To avoid violation of the LP, however, we recommend taking the latter approach and not selecting the prior based on trial planning. Intuitively, changing the threshold values only affects decision making, while changing the prior affects both the evidence about θ (e.g., point and interval estimations) and decision making.
5 Numerical Studies
5.1 Illustration of the Frequentist-oriented Approach
As an illustration of the frequentist-oriented approach, we calculate the stopping boundaries for the z-statistics given by some of the aforementioned Bayesian sequential designs with the type I error rate controlled at $\alpha =0.05$. That is, we compute the $\{{c_{1}},\dots ,{c_{K}}\}$ values for which we would stop the trial at analysis j if ${z_{j}}\gt {c_{j}}$. We consider the single-arm trial example described in Section 1.2. Suppose that a total of $K=5$ (interim and final) analyses are planned, the maximum sample size is ${n_{K}}=1000$, and patients are enrolled in groups of size 200 (${n_{j}}=200j$). The variance for the outcomes is set at ${\sigma ^{2}}=1$ and is assumed known. Specifically:
-
(i) For stopping boundaries based on posterior probabilities (Equation 2.2), we consider the following two versions. In the first version, we use ${\gamma _{j}}\equiv 0.95$ and find that a $\text{N}(0,{0.054^{2}})$ prior for θ leads to $\alpha =0.05$. In the second version, we place a $\text{N}(0,{1^{2}})$ prior on θ and find that setting ${\gamma _{j}}\equiv 0.983$ leads to $\alpha =0.05$.
-
(ii) For stopping boundaries based on posterior predictive probabilities (Section 3.1), we set ${\gamma _{j}}\equiv 0.8$, $\eta =0.05$, and find that a $\text{N}(0,{0.063^{2}})$ prior for θ leads to $\alpha =0.05$.
-
(iii) For the Bayesian decision-theoretic design (Section 3.2), we place a $\text{N}(0,{1^{2}})$ prior on θ, use ${\xi _{0}}=1000$, and find that setting ${\xi _{1j}}\equiv 34890$ leads to $\alpha =0.05$.
The stopping boundaries are summarized in Table 3. For comparison, we also include the stopping boundaries produced by the Pocock and O’Brien-Fleming procedures [58, 55] and the linear error spending function [47]. See Sections S.1.1 and S.1.2 of the Supplementary Material for more details. With ${\gamma _{j}}\equiv 0.95$ and a conservative prior $\text{N}(0,{0.054^{2}})$, the Bayesian design based on posterior probabilities leads to stopping boundaries that lie between Pocock’s and O’Brien-Fleming’s boundaries; with a $\text{N}(0,{1^{2}})$ prior and ${\gamma _{j}}\equiv 0.983$, it gives stopping boundaries that are similar to Pocock’s boundaries. The Bayesian design based on predictive probabilities with a conservative prior $\text{N}(0,{0.063^{2}})$ gives boundaries that lie between Pocock’s and O’Brien-Fleming’s boundaries. Lastly, by tuning the loss functions, the Bayesian decision-theoretic design leads to stopping boundaries similar to those given by the linear error spending function.
Table 3
Stopping boundaries for the z-statistics given by several Bayesian and frequentist sequential designs. The single-arm trial in Section 1.2 is considered with $K=5$ analyses, a maximum sample size of ${n_{K}}=1000$, and equal group sizes (${n_{j}}=200j$). The design parameters are calibrated such that the type I error rate at $\theta =0$ is $\alpha =0.05$ for every design.
Analysis | 1 | 2 | 3 | 4 | 5 |
No. of patients | 200 | 400 | 600 | 800 | 1000 |
Bayesian designs: | |||||
Post. prob. (ver. 1) | 2.71 | 2.24 | 2.06 | 1.97 | 1.91 |
Post. prob. (ver. 2) | 2.13 | 2.12 | 2.12 | 2.12 | 2.12 |
Post. pred. prob. | 2.50 | 2.26 | 2.18 | 2.11 | 1.84 |
Decision-theoretic | 2.33 | 2.22 | 2.15 | 2.09 | 1.91 |
Frequentist designs: | |||||
Pocock | 2.12 | 2.12 | 2.12 | 2.12 | 2.12 |
O’Brien-Fleming | 3.92 | 2.77 | 2.26 | 1.96 | 1.75 |
Linear error spending | 2.33 | 2.22 | 2.12 | 2.03 | 1.96 |
Figure 2
Visualization of the stopping boundaries given by different sequential designs, and comparison of the frequentist properties (power and expected sample size) of the designs for hypothetical values of θ, the treatment effect.
Figure 2 shows a visualization of the stopping boundaries and a comparison of the frequentist properties of the sequential designs. Here, we consider the power and expected sample size over a range of hypothetical θ values. There appears to be a trade-off between power and expected sample size. For example, the O’Brien-Fleming procedure has the highest power for all θ values but also requires the largest expected sample size. This is due to its large stopping boundaries at early analyses and progressively smaller stopping boundaries at later analyses. On the contrary, the Pocock boundaries and the boundaries based on posterior probabilities (version 2) lead to the lowest expected sample size but also have the lowest power. For more discussion on the frequentist evaluation of sequential designs, refer to [43].
5.2 Illustration of the Calibrated Bayesian Approach
To demonstrate the calibrated Bayesian approach, we conduct simulation studies to explore the operating characteristics of a Bayesian design under a variety of plausible scenarios. Consider the single-arm trial example in Section 1.2 with a maximum sample size of ${n_{K}}=1000$ and the Bayesian design with stopping rules given by Equation (2.1). Suppose the actual effect size of the trial, θ, is a random draw from $\text{N}({\mu _{0}},{\nu _{0}^{2}})$. As the trial progresses, patient outcomes become available sequentially and follow a normal distribution, ${y_{1}},{y_{2}},\dots \sim \text{N}(\theta ,{\sigma ^{2}})$. The trial statistician, on the other hand, uses a $\text{N}(\mu ,{\nu ^{2}})$ prior to draw inference about θ, which may or may not be identical to the actual population distribution of θ. For simplicity, assume the sampling model used by the statistician, $f({\boldsymbol{y}_{K}}\mid \theta )$, is correctly specified. At prespecified time and frequency, the statistician conducts interim analyses of accumulating data. If the stopping rule is triggered, ${H_{0}}$ is rejected, the trial is stopped, and efficacy of the drug is declared.
We consider 72 simulation scenarios, one for each combination of ${\nu _{0}}\in \{0.1,0.5,1\}$, $\nu \in \{0.1,0.5,1,10\}$, and $K\in \{1,2,5,10,100,1000\}$. For simplicity, we fix the other parameters: ${\mu _{0}}=\mu =0$, and $\sigma =1$. Here, a larger (or smaller) value of ${\nu _{0}}$ indicates that the actual effect size is more likely to be larger (or smaller). We do not consider ${\nu _{0}}\gt 1$ as in practice, a standardized effect size that is much larger than what could be drawn from a $\text{N}(0,{1^{2}})$ distribution is not common. A larger (or smaller) value of ν represents that the assumed prior for θ is more diffuse (or more concentrated around zero). When ${\nu _{0}}=\nu $, the population distribution of θ over different trials is the same as the prior for θ used for analysis. Lastly, K is the total number of (interim and final) analyses. We assume that patients are enrolled in groups of equal size ${n_{K}}/K$.
For each scenario, we simulate $S=10,000$ hypothetical trials by first generating ${\theta ^{(1)}},\dots ,{\theta ^{(S)}}\sim \text{N}({\mu _{0}},{\nu _{0}^{2}})$. Next, for each ${\theta ^{(s)}}$, trial outcomes are sequentially generated from $N({\theta ^{(s)}},{\sigma ^{2}})$. Interim analyses are performed after every ${n_{K}}/K$ outcomes have been observed, and the trial is stopped if the stopping rule as in Equation (2.2) is satisfied with ${\gamma _{j}}\equiv \gamma =0.95$. We record the $\widehat{\text{FDR}}$ and $\widehat{\text{FPR}}$ as defined in Equation (2.5). In addition, we record the percentage of 95% credible intervals for θ, calculated as in Section 2.6, that cover the true values.
Table 4
Operating characteristics of the Bayesian design with stopping rules given by Equation (2.1), a maximum sample size of ${n_{K}}=1000$, K planned analyses, and equal group sizes. Values are averages over 10,000 simulated trials. Each cell shows the corresponding metric ($\widehat{\text{FDR}}$, $\widehat{\text{FPR}}$, or Coverage) for a specific combination of ${\nu _{0}}$, ν, and K.
K | $\widehat{\text{FDR}}$ (%) | ${\widehat{\text{FPR}}^{\phantom{A}}}$ (%) | Coverage (%) | |||||||||
${\nu _{0}}=0.1$, different ν below | ||||||||||||
0.1 | 0.5 | 1 | 10 | 0.1 | 0.5 | 1 | 10 | 0.1 | 0.5 | 1 | 10 | |
1 | 0.8 | 0.6 | 0.8 | 0.9 | 0.5 | 0.4 | 0.5 | 0.6 | 95.0 | 95.2 | 95.3 | 94.7 |
2 | 1.1 | 1.5 | 1.5 | 1.4 | 0.7 | 1.0 | 1.0 | 0.9 | 94.9 | 95.4 | 94.8 | 94.9 |
5 | 1.8 | 2.8 | 3.6 | 3.1 | 1.2 | 2.0 | 2.4 | 2.1 | 94.9 | 94.7 | 94.1 | 94.5 |
10 | 2.7 | 4.8 | 4.8 | 5.2 | 1.9 | 3.6 | 3.5 | 3.9 | 95.0 | 94.1 | 93.9 | 93.9 |
100 | 4.2 | 11.3 | 11.7 | 12.1 | 2.9 | 9.7 | 10.3 | 10.7 | 95.1 | 93.1 | 91.8 | 91.5 |
1000 | 5.2 | 15.1 | 19.9 | 22.5 | 3.9 | 13.5 | 19.6 | 23.5 | 95.3 | 93.7 | 91.2 | 88.1 |
${\nu _{0}}=0.5$, different ν below | ||||||||||||
0.1 | 0.5 | 1 | 10 | 0.1 | 0.5 | 1 | 10 | 0.1 | 0.5 | 1 | 10 | |
1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.2 | 73.0 | 95.2 | 94.7 | 94.8 |
2 | 0.2 | 0.3 | 0.4 | 0.1 | 0.2 | 0.3 | 0.4 | 0.1 | 67.4 | 94.9 | 94.5 | 95.3 |
5 | 0.3 | 0.7 | 0.4 | 0.3 | 0.3 | 0.7 | 0.3 | 0.3 | 60.5 | 94.7 | 95.2 | 95.3 |
10 | 0.6 | 0.8 | 0.8 | 0.8 | 0.5 | 0.7 | 0.7 | 0.7 | 58.3 | 95.2 | 95.0 | 95.2 |
100 | 0.9 | 2.3 | 2.7 | 3.2 | 0.8 | 2.2 | 2.6 | 3.2 | 56.8 | 95.2 | 94.8 | 94.0 |
1000 | 0.8 | 3.2 | 5.8 | 8.6 | 0.8 | 3.2 | 6.0 | 8.7 | 57.1 | 95.2 | 94.4 | 92.2 |
${\nu _{0}}=1$, different ν below | ||||||||||||
0.1 | 0.5 | 1 | 10 | 0.1 | 0.5 | 1 | 10 | 0.1 | 0.5 | 1 | 10 | |
1 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.1 | 46.8 | 94.8 | 95.1 | 94.9 |
2 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 40.8 | 94.7 | 94.8 | 95.3 |
5 | 0.1 | 0.2 | 0.2 | 0.2 | 0.1 | 0.2 | 0.2 | 0.2 | 36.6 | 94.4 | 95.1 | 95.0 |
10 | 0.1 | 0.5 | 0.4 | 0.4 | 0.1 | 0.5 | 0.4 | 0.4 | 34.9 | 94.5 | 94.8 | 94.8 |
100 | 0.3 | 1.5 | 1.3 | 1.2 | 0.3 | 1.4 | 1.3 | 1.2 | 34.2 | 90.7 | 95.1 | 94.8 |
1000 | 0.3 | 2.2 | 3.5 | 5.1 | 0.3 | 2.2 | 3.5 | 5.3 | 33.8 | 87.6 | 94.7 | 93.4 |
Table 4 summarizes the simulation results. Although the FDR and FPR increase with the number of analyses, according to Proposition 2.1, the FDR and FPR are upper bounded when the statistician’s model is correctly specified. These theoretical results are corroborated by the simulations: when ${\nu _{0}}=\nu $, the $\widehat{\text{FDR}}$ is roughly bounded by $1-\gamma =5\% $ (due to Monte Carlo errors and a finite number of simulations, the $\widehat{\text{FDR}}$ may sometimes exceed 5%), and the $\widehat{\text{FPR}}$ is always below $(1-\gamma )/\gamma =5.3\% $. In addition, when ${\nu _{0}}=\nu $, the coverage of the 95% credible intervals for θ is around 95% regardless of K.
In the presence of model misspecification, however, Bayesian statements may not attain their asserted coverage, and the discrepancy becomes larger with more frequent applications of data-dependent stopping rules. These results are consistent with the findings in [68] and [63]. When the assumed prior is more diffuse than the actual distribution of θ, the FDR and FPR are inflated, and the degree of FDR and FPR inflation becomes greater when K is larger. For example, when ${\nu _{0}}=0.1$, $\nu =10$, and $K=1000$, the $\widehat{\text{FDR}}$ and $\widehat{\text{FPR}}$ are around 20%. For this reason, we caution against the use of diffuse priors for decision making if data-dependent stopping rules are in frequent use and the actual effect sizes are believed to be small. In addition, when ${\nu _{0}}\ne \nu $, the coverage of the 95% credible intervals for θ is below 95% and decreases as K increases. Interestingly, an overly conservative prior (that is more concentrated around zero) results in low coverage of the credible intervals, while a diffuse prior has less impact on the coverage.
From a calibrated Bayesian point of view, simulation studies of this type can be used to guide the choice of $\pi (\theta )$ and $\{{\gamma _{1}},\dots ,{\gamma _{K}}\}$. Suppose the trial statistician decides to use a constant threshold value ${\gamma _{j}}\equiv \gamma =0.95$ and wants to select ν such that the FDR and FPR of the design are controlled at below 5% for plausible ${\nu _{0}}$ and K scenarios (assume ${\mu _{0}}=\mu =0$). To achieve this goal for all possible ${\nu _{0}}$ and K considered here, ν should be set at $\le 0.1$. However, if one plans to conduct no more than $K=10$ analyses, then setting $\nu \le 1$ is sufficient.
We do not present additional numerical studies for the subjective Bayesian approach, in which case the prior and threshold values may be chosen based on a subjective belief rather than simulations.
6 Discussion
We have summarized three perspectives on Bayesian sequential designs, namely the frequentist-oriented perspective, the subjective Bayesian perspective, and the calibrated Bayesian perspective, and have discussed their implications. We have reviewed Bayesian sequential designs based on posterior probabilities, posterior predictive probabilities, and decision-theoretic frameworks. We have also commented on the role of the LP in sequential trial designs. While the LP implies that unrealized events are irrelevant to the statistical evidence about the treatment effect, it gives little guidance in assessing a decision procedure thus does not preclude the use of additional information in decision-making.
So far, we have only considered early stopping for efficacy. In practice, it may be desirable to allow for early stopping when interim results suggest the investigational drug is unlikely to have a clinically meaningful treatment effect [75]. This is known as early stopping for futility. A sequential trial design can include a provision for either early efficacy stopping, early futility stopping, or both. Consider the single-arm trial example. One could stop the trial at analysis j in favor of the null hypothesis if $\Pr (\theta \gt 0\mid {\boldsymbol{y}_{j}})\lt {\tau _{j}}$ for some threshold ${\tau _{j}}$. Futility stopping rules do not inflate the type I error rate; actually, they decrease the type I error rate. However, futility stopping rules also decrease the power and increase the false negative rate (FNR) and false omission rate (FOR) of a design. The futility boundaries could be specified to either satisfy certain power and type I error rate requirements (similar to [56]), reflect subjective beliefs, or achieve desirable FNR, FOR, FDR, and FPR under plausible scenarios.
Two-sided tests and point null hypotheses are very common in clinical trials. For example, for the single-arm trial in Section 1.2, one may test
There have been several criticisms of testing a point null hypothesis [6], such as the plausibility of θ being equal to 0 exactly. As a result, we have focused on a one-sided test with a composite null hypothesis (Equation 1.1). Most of our discussions are still applicable to tests like Equation (6.1), although from a Bayesian hypothesis testing perspective, the prior for θ should include a discrete mass at the location indicated by the point hypothesis.
From a frequentist perspective, the issue of type I error rate inflation (or multiplicity) can arise from repeatedly testing a single hypothesis over time, or testing multiple hypotheses simultaneously [72]. From a subjective Bayesian perspective, however, repeated hypothesis testing is not necessarily a problem (see Section 2.2), and multiplicity adjustments are needed only when there are multiple tests. It is worth noting that frequentist and Bayesian philosophies on multiple testing are also quite different [13, 73].