Supplementary Material to “On Bayesian Sequential Clinical Trial Designs”.

Clinical trials usually involve sequential patient entry. When designing a clinical trial, it is often desirable to include a provision for interim analyses of accumulating data with the potential for stopping the trial early. We review Bayesian sequential clinical trial designs based on posterior probabilities, posterior predictive probabilities, and decision-theoretic frameworks. A pertinent question is whether Bayesian sequential designs need to be adjusted for the planning of interim analyses. We answer this question from three perspectives: a frequentist-oriented perspective, a calibrated Bayesian perspective, and a subjective Bayesian perspective. We also provide new insights into the likelihood principle, which is commonly tied to statistical inference and decision making in sequential clinical trials. Some theoretical results are derived, and numerical studies are conducted to illustrate and assess these designs.

In most clinical trials, patient enrollment is staggered, and patients’ data are collected sequentially. When designing a clinical trial, it is often desirable to include a provision for

It is well known that frequentist sequential designs need to be adjusted for the planning of interim analyses to maintain desirable frequentist properties [

In this article, we review different perspectives on Bayesian sequential designs and answer the question of whether Bayesian sequential designs need to be adjusted for interim analyses. Our review is not meant to be comprehensive with regard to methodological details including the type of trial (e.g., single-arm or randomized-controlled), type of outcome (e.g., binary, continuous, or time-to-event), or distributional assumption. Instead, we focus on the fundamentals of Bayesian sequential designs. A single-arm trial example (to be introduced in Section

There is a rich literature on sequential designs (e.g., [

Our contributions include the following. (i) In Bayesian sequential designs, a pertinent question is whether adjustments for the planning of interim analyses are necessary. We attempt to answer this question from multiple perspectives. From a frequentist-oriented perspective, such adjustments are necessary for achieving desirable frequentist properties such as controlling the type I error rates; from a calibrated Bayesian perspective, such adjustments may be needed to achieve desirable operating characteristics under plausible scenarios (we will discuss the differences between achieving desirable operating characteristics versus achieving desirable frequentist properties); lastly, from a subjective Bayesian perspective, such adjustments are unnecessary, and the design only needs to reflect subjective beliefs. We comment on the three perspectives and make our recommendation. (ii) We put forward a proposal for a calibrated Bayesian approach to sequential designs. Specifically, we propose false discovery rate (FDR) and false positive rate (FPR) as potential metrics to evaluate sequential designs. We derive theoretical results regarding the FDR and FPR of a Bayesian sequential design and present simulation studies to demonstrate the practical usage of the calibrated Bayesian approach. (iii) We summarize Bayesian sequential designs based on posterior probabilities, posterior predictive probabilities, and decision-theoretic frameworks. We discuss the connections between designs using posterior credible intervals and those using formal Bayesian hypothesis testing. (iv) It is often believed that according to the likelihood principle (LP), decision making in a sequential trial should not depend on unrealized events. However, our investigation shows that the LP gives little guidance in assessing the overall performance of a decision procedure. In particular, the LP does not preclude one from utilizing additional information (including unrealized events) for decision making. Therefore, our view is that the LP should not be used as an argument for or against Bayesian or frequentist sequential designs. To illustrate our findings, we present an example of a Bayesian decision-theoretic design in which different decisions will be made based on the same observed data but different interim analysis plans.

To illustrate the discussion, consider a single-arm trial that aims to establish the therapeutic effect of an investigational drug. Suppose that a total of

Frequentist sequential designs are concerned with controlling the overall type I error rate of the sequential testing procedure. The type I error rate refers to the probability of falsely rejecting

Without accounting for the sequential nature of the hypothesis test, Bayesian designs can suffer the same problem of type I error inflation, which can be unsettling for statisticians who care about controlling the type I error rates. Therefore, in many Bayesian sequential trial designs, the stopping boundaries are also determined to control the type I error rate at a desirable level [

The remainder of the paper is structured as follows. In Section

Consider the single-arm trial in Section

Without accounting for multiple looks at the data, the stopping rule in Equation (

With an intended type I error rate, the parameters in a Bayesian sequential design can be chosen in multiple ways. For prespecified threshold values, type I error rate control can be achieved by using a conservative prior. [

Alternatively, for a given prior

For more complicated trials (e.g., randomized-controlled, binary outcome), tuning

From a subjective Bayesian point of view (see, e.g., [

We see that by taking this particular subjective Bayesian approach, one does not need to take frequentist properties into account. For example, suppose that

Such a procedure is vulnerable to type I error rate inflation, which would bother many practitioners. However, it has been argued that the type I error rate is not the quantity that one should pay most attention to [

A similar critique on the subjective Bayesian approach is the issue of “sampling to a foregone conclusion” [

Although Bayesian probabilities represent degrees of belief in some formal sense, for practitioners and regulatory agencies, it can be pertinent to examine the operating characteristics of Bayesian designs in repeated practices. One could calibrate the prior and threshold values in a Bayesian sequential design to achieve desirable operating characteristics under a range of plausible scenarios, and we refer to this as a calibrated Bayesian approach [

We distinguish between

What kinds of operating characteristics could be examined? Consider the single-arm trial example. Imagine an infinite series of such trials with true but unknown treatment effects

The calibration of the design parameters is typically done through computer simulations. For each plausible

In certain contexts, there are theoretical guarantees on the operating characteristics of Bayesian sequential designs. Specifically, the following proposition provides such an example.

The proof is given in Sections S.2.2 and S.2.3 of the Supplementary Material. Therefore, from a calibrated Bayesian perspective, the prior on

In general, requiring a design to have good operating characteristics (under plausible scenarios) is more lenient than requiring it to have good frequentist properties (for all possible parameter values). For example, the type I error rate is essentially the FPR when

We have reviewed three perspectives on Bayesian sequential designs, which are summarized in Table

Summary of the three perspectives on Bayesian sequential designs.

Perspective | Description | Suitable contexts |

Frequentist-oriented | Specifying design parameters to achieve desirable frequentist properties (e.g., type I error rate) | Large-scale confirmatory trials |

Subjective Bayesian | Specifying design parameters to reflect subjective beliefs and personal tolerance of risk | Trials for rare diseases; pediatric trials for small populations |

Calibrated Bayesian | Specifying design parameters to achieve desirable operating characteristics (e.g., FDR and FPR) under plausible scenarios | Animal studies for drug screening; early-phase trials (e.g., dose finding) |

In some contexts, a specific approach can be more applicable and acceptable compared to the others. For example, for large-scale confirmatory trials (e.g., COVID-19 vaccine trials), type I error rate control is enforced by regulators, and thus only the frequentist-oriented perspective is accepted. Indeed, there are some challenges with the subjective and calibrated Bayesian approaches in those settings. See, e.g., [

The subjective Bayesian perspective can be useful in trials for rare diseases and pediatric trials for small populations. In those situations, simple loss functions may be elicited, and prior distributions can be derived by eliciting expert opinion [

Lastly, the calibrated Bayesian perspective is suitable in exploratory settings, such as animal studies for drug screening and early-phase trials (e.g., dose finding). For those trials, stringent type I error rate control is optional and often at the discretion of the sponsors. Eliciting the prior for

Influenced by [

Before moving on to other topics, we discuss some additional considerations in Bayesian sequential designs. First, we present a special class of Bayesian designs based on the posterior probability of the alternative hypothesis through formal Bayesian hypothesis testing. See, e.g., [

A special case is when

From a Bayesian perspective, after a clinical trial has been completed, all the information about

because

The posterior mean,

So far, we have been using a single-arm trial to illustrate the designs. In practice, multi-arm trials such as randomized-controlled trials are also very common. We briefly outline an extension of the designs for a randomized-controlled trial. For simplicity, assume the trial outcomes are normally distributed. At analysis

In some trials, such as proof-of-concept trials, it may be of interest to evaluate the evidence of the treatment effect being greater than a minimum clinically important difference, denoted by Δ [

Compared to their frequentist counterparts, Bayesian designs involve additional complexities such as prior elicitation and computational challenges when the posterior distribution is not analytically tractable. Still, Bayesian designs have certain advantages (see, e.g., [

In the upcoming sections, we review some other types of Bayesian sequential designs whose early stopping rules are not directly based on

For the single-arm trial example, we have

As described in Section

We illustrate the idea of decision-theoretic designs through the single-arm trial example. Let

Suppose that the loss of making decision

We also assume the loss of making decision

At analysis

[

We summarize in Table

Summary of methods and measures that give rise to different types of sequential designs.

Method/measure | Stopping criteria for efficacy | Design parameters |

Posterior probability | Posterior probability (PP) of drug being efficacious exceeds a prespecified threshold | Prior for treatment effect; PP thresholds at interim and final analyses |

Posterior predictive probability | Posterior predictive probability of trial success (PPOS) exceeds a prespecified threshold | Prior for treatment effect; PP threshold at final analysis; PPOS thresholds at interim analyses |

Decision-theoretic | Efficacy stopping minimizes posterior expected loss for a prespecified loss function | Prior for treatment effect; loss functions associated with possible decisions |

Frequentist group sequential | Test statistic exceeds a prespecified stopping boundary | Stopping boundaries for test statistics that define a critical region |

Stochastic curtailment | Conditional power (CP) of trial success, given a hypothetical treatment effect, exceeds a prespecified threshold | Critical value for test statistic at final analysis; CP thresholds at interim analyses |

Statistical inference and decision making in sequential clinical trials are typically tied to the LP. We provide some discussions in this section.

Let

[

What would be the consequences if we accept the LP? Since the LP deals only with the observed

As an illustration, consider the example given by [

Although the LP seems compelling, it has been a source of controversy. Under the Bayesian paradigm, for any specified prior distribution for

The conflict here does not mean we have to either reject the LP or reject frequentist procedures. Explained previously (e.g., [

Posterior expected losses, as functions of the

Still, the conflict does suggest that if we accept the LP, then frequentist measures such as type I/II error rates and

It should also be noted that not all Bayesian procedures are in compliance with the LP. For example, eliciting the prior for

As an illustration of the frequentist-oriented approach, we calculate the stopping boundaries for the

For stopping boundaries based on posterior probabilities (Equation

For stopping boundaries based on posterior predictive probabilities (Section

For the Bayesian decision-theoretic design (Section

The stopping boundaries are summarized in Table

Stopping boundaries for the

Analysis | 1 | 2 | 3 | 4 | 5 |

No. of patients | 200 | 400 | 600 | 800 | 1000 |

Post. prob. (ver. 1) | 2.71 | 2.24 | 2.06 | 1.97 | 1.91 |

Post. prob. (ver. 2) | 2.13 | 2.12 | 2.12 | 2.12 | 2.12 |

Post. pred. prob. | 2.50 | 2.26 | 2.18 | 2.11 | 1.84 |

Decision-theoretic | 2.33 | 2.22 | 2.15 | 2.09 | 1.91 |

Pocock | 2.12 | 2.12 | 2.12 | 2.12 | 2.12 |

O’Brien-Fleming | 3.92 | 2.77 | 2.26 | 1.96 | 1.75 |

Linear error spending | 2.33 | 2.22 | 2.12 | 2.03 | 1.96 |

Visualization of the stopping boundaries given by different sequential designs, and comparison of the frequentist properties (power and expected sample size) of the designs for hypothetical values of

Figure

To demonstrate the calibrated Bayesian approach, we conduct simulation studies to explore the operating characteristics of a Bayesian design under a variety of plausible scenarios. Consider the single-arm trial example in Section

We consider 72 simulation scenarios, one for each combination of

For each scenario, we simulate

Operating characteristics of the Bayesian design with stopping rules given by Equation (

Coverage (%) | ||||||||||||

0.5 | 1 | 10 | 0.5 | 1 | 10 | 0.5 | 1 | 10 | ||||

1 | 0.8 | 0.6 | 0.8 | 0.9 | 0.5 | 0.4 | 0.5 | 0.6 | 95.0 | 95.2 | 95.3 | 94.7 |

2 | 1.1 | 1.5 | 1.5 | 1.4 | 0.7 | 1.0 | 1.0 | 0.9 | 94.9 | 95.4 | 94.8 | 94.9 |

5 | 1.8 | 2.8 | 3.6 | 3.1 | 1.2 | 2.0 | 2.4 | 2.1 | 94.9 | 94.7 | 94.1 | 94.5 |

10 | 2.7 | 4.8 | 4.8 | 5.2 | 1.9 | 3.6 | 3.5 | 3.9 | 95.0 | 94.1 | 93.9 | 93.9 |

100 | 4.2 | 11.3 | 11.7 | 12.1 | 2.9 | 9.7 | 10.3 | 10.7 | 95.1 | 93.1 | 91.8 | 91.5 |

1000 | 5.2 | 15.1 | 19.9 | 22.5 | 3.9 | 13.5 | 19.6 | 23.5 | 95.3 | 93.7 | 91.2 | 88.1 |

0.1 | 1 | 10 | 0.1 | 1 | 10 | 0.1 | 1 | 10 | ||||

1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 | 0.1 | 0.2 | 73.0 | 95.2 | 94.7 | 94.8 |

2 | 0.2 | 0.3 | 0.4 | 0.1 | 0.2 | 0.3 | 0.4 | 0.1 | 67.4 | 94.9 | 94.5 | 95.3 |

5 | 0.3 | 0.7 | 0.4 | 0.3 | 0.3 | 0.7 | 0.3 | 0.3 | 60.5 | 94.7 | 95.2 | 95.3 |

10 | 0.6 | 0.8 | 0.8 | 0.8 | 0.5 | 0.7 | 0.7 | 0.7 | 58.3 | 95.2 | 95.0 | 95.2 |

100 | 0.9 | 2.3 | 2.7 | 3.2 | 0.8 | 2.2 | 2.6 | 3.2 | 56.8 | 95.2 | 94.8 | 94.0 |

1000 | 0.8 | 3.2 | 5.8 | 8.6 | 0.8 | 3.2 | 6.0 | 8.7 | 57.1 | 95.2 | 94.4 | 92.2 |

0.1 | 0.5 | 10 | 0.1 | 0.5 | 10 | 0.1 | 0.5 | 10 | ||||

1 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.1 | 46.8 | 94.8 | 95.1 | 94.9 |

2 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 40.8 | 94.7 | 94.8 | 95.3 |

5 | 0.1 | 0.2 | 0.2 | 0.2 | 0.1 | 0.2 | 0.2 | 0.2 | 36.6 | 94.4 | 95.1 | 95.0 |

10 | 0.1 | 0.5 | 0.4 | 0.4 | 0.1 | 0.5 | 0.4 | 0.4 | 34.9 | 94.5 | 94.8 | 94.8 |

100 | 0.3 | 1.5 | 1.3 | 1.2 | 0.3 | 1.4 | 1.3 | 1.2 | 34.2 | 90.7 | 95.1 | 94.8 |

1000 | 0.3 | 2.2 | 3.5 | 5.1 | 0.3 | 2.2 | 3.5 | 5.3 | 33.8 | 87.6 | 94.7 | 93.4 |

Table

In the presence of model misspecification, however, Bayesian statements may not attain their asserted coverage, and the discrepancy becomes larger with more frequent applications of data-dependent stopping rules. These results are consistent with the findings in [

From a calibrated Bayesian point of view, simulation studies of this type can be used to guide the choice of

We do not present additional numerical studies for the subjective Bayesian approach, in which case the prior and threshold values may be chosen based on a subjective belief rather than simulations.

We have summarized three perspectives on Bayesian sequential designs, namely the frequentist-oriented perspective, the subjective Bayesian perspective, and the calibrated Bayesian perspective, and have discussed their implications. We have reviewed Bayesian sequential designs based on posterior probabilities, posterior predictive probabilities, and decision-theoretic frameworks. We have also commented on the role of the LP in sequential trial designs. While the LP implies that unrealized events are irrelevant to the statistical evidence about the treatment effect, it gives little guidance in assessing a decision procedure thus does not preclude the use of additional information in decision-making.

So far, we have only considered early stopping for efficacy. In practice, it may be desirable to allow for early stopping when interim results suggest the investigational drug is unlikely to have a clinically meaningful treatment effect [

Two-sided tests and point null hypotheses are very common in clinical trials. For example, for the single-arm trial in Section

From a frequentist perspective, the issue of type I error rate inflation (or multiplicity) can arise from repeatedly testing a single hypothesis over time, or testing multiple hypotheses simultaneously [

Several R packages have been developed to facilitate the use of frequentist and Bayesian sequential designs in clinical trials. These include