Lessons Learned from the Bayesian Design and Analysis for the BNT162b2 COVID-19 Vaccine Phase 3 Trial

Ji, Yuan; Yuan, Shijie

doi:10.51387/26-NEJSDS93

Abstract

The phase III BNT162b2 mRNA COVID-19 vaccine trial is based on a Bayesian design and analysis, and the main evidence of vaccine efficacy is presented in Bayesian statistics. Confusion and mistakes arise in the presentation of the Bayesian results. Some key statistics, such as Bayesian credible intervals, are mislabeled and stated as confidence intervals. Posterior probabilities of the vaccine efficacy are not reported as the main results. We illustrate the main differences in the reporting of Bayesian analysis results for a clinical trial and provide four recommendations. We argue that statistical evidence from a Bayesian trial, when presented properly, is easier to interpret and directly addresses the main clinical questions, thereby better supporting regulatory decision making. We also recommend using the abbreviation “BI” to represent Bayesian credible interval as a differentiation to “CI” which stands for confidence interval.

1 Introduction

The phase III BNT162b2 COVID-19 vaccine trial uses a Bayesian design and analysis method for the primary efficacy endpoints [8]. Participants are randomized with a 1:1 ratio to receive the vaccine or the placebo. The primary efficacy endpoints are based on $\text{VE}=100\ast (1-IRR)$, in which $IRR$ is computed as the ratio of first confirmed COVID-19 illness rate in the vaccine group to the corresponding illness rate in the placebo group [7].

The BNT162b2 vaccine has received emergency use authorization (EUA) by the US FDA, among other countries and regions. The decision is based on the totality of the evidence, including the efficacy and safety of the vaccine, from a phase I/II/III trial which is reviewed by the US FDA and the Vaccine and Related Biologic Product Advisory Committee (VRBPAC). On December 10, 2020, the VRBPAC held a public meeting and voted overwhelmingly to support the EUA.

The safety of the vaccine has been adequately reviewed, which is not the main focus of our work. Instead, we discuss the evidence of the trial related to the vaccine efficacy. In particular, we show that the presentation of the trial data and primary efficacy results are not compatible with the Bayesian framework, and mistakes are made in the interpretation of the results. While these missteps do not change the regulatory decision for the Pfizer/BioNTech BNT162b2 vaccine – thanks to its superior efficacy – it is nonetheless critical to discuss and correct the mistakes for future trials that use Bayesian designs and methods. For example, the distinction in the definition and interpretation of the Bayesian credible interval and the frequentist confidence interval must be clearly explained to properly assess the clinical evidence for decision making.

In [8], we identify two main missed opportunities based on the use of Bayesian statistics. First, credible intervals are not accurately interpreted and occasionally misrepresented as confidence intervals. Second, the presentation and elaboration of the posterior probabilities of the true vaccine efficacy are not emphasized. This is an important point since unlike p-values, posterior probabilities directly answer the clinical question of the trial. A minor point is that the Bayesian models used in the trial are not clearly described. We reconstruct the model and reproduce the main efficacy results in [8]. We also propose an alternative Bayesian model that may be more compatible with the statistical sampling scheme of the design. Statistical code for these analyses are presented in the Supplemental Material.

2 Problems and Suggestions

2.1 Bayesian Credible Interval

The use of 95% credible intervals presented in Table 2 of [8] are correct. However, in the introductory text to Table 2, the article mistakenly refers to them as confidence intervals. Specifically, on page 2,609 of the published article (Polack et al., 2020), it states:

“This case split corresponds to 95.0% vaccine efficacy (95% confidence interval [CI], 90.3 to 97.6; Table 2).”

In this instance, the article actually refers to a Bayesian credible interval, not a confidence interval.

The main difference between confidence and credible intervals lies in their interpretations and the evidence the two intervals represent. In particular, a 95% credible interval $(90.3\% ,97.6\% )$ for VE means that given the data, the probability that VE is greater than 90.3% but smaller than 97.6% is 0.95. In contrast, a 95% confidence interval $(90.3\% ,97.6\% )$ for VE is interpreted as the following: were the vaccine trial repeated numerous times, the fraction of the calculated confidence intervals (which would be different for each trial) that encompass the true VE would tend towards 95% [3]. Apparently, the Bayesian credible interval directly evaluates the probability of the true vaccine efficacy based on the observed trial data, and the frequentist confidence interval only provides an indirect assessment assuming the vaccine trial were to repeat numerous times [6].

This mis-labeling of Bayesian credible interval and frequentist confidence interval is pervasive in the literature. For example, in [4], the authors report results of Bayesian meta-analysis but refer to the intervals as confidence intervals. Similarly, [2] initially define “CI” as confidence interval for a frequentist meta-analysis, yet continue to use the same term when presenting results from a Bayesian meta-analysis, which should be reported using credible intervals instead. Likewise, [5] apply a Bayesian classifier, but refer only to confidence intervals throughout, failing to distinguish the Bayesian framework from frequentist terminology.

Therefore, we recommend that authors explicitly include the interpretation of the Bayesian credible interval when it is first introduced in a publication to avoid confusion between the Bayesian credible interval and the confidence interval. Since part of the confusion is due to the shared abbreviation “CI” for both terms, which may mislead investigators into assuming they are interchangeable, we suggest using the abbreviation “BI” for Bayesian credible interval in publications, to clearly distinguish it from “CI”.

2.2 Posterior Probability

Interestingly, the posterior probabilities of the BNT162b2 VE are only reported in the DISCUSSION section of [8] but not the RESULTS section. The EFFICACY subsection in RESULTS focused on the reporting of the credible intervals, although the credible intervals were mistakenly written as confidence intervals. In addition, credible intervals are not directly linked to the clinically meaningful effect such as having the VE of vaccine greater than a clinical threshold like 30%.

An advantage of the Bayesian modeling for the BNT162b2 vaccine trial data is the ability to report the vaccine efficacy with probabilistic statements, a feature that is not available through p-values or confidence intervals. Clinically, a direct answer to the vaccine efficacy based on clinical trial data is a statement like the following: “Given the observed efficacy data in the BNT162b2 trial, the probability that the true vaccine efficacy exceeds x% is greater than y,” in which x is prespecified by investigators as a meaningful clinical threshold, and y is calculated based on Bayesian models.

3 Discussion

3.1 Recommended Statistical Reporting for a Bayesian Trial

Bayesian results provide direct answers to clinical questions. To see this, we list four recommended Bayesian reporting elements for clinicians and decision makers.

1) Report posterior probability of clinical benefits or treatment effects. For example, $\Pr (\text{VE}\gt X\mid data)=Y$ provides a direct assessment of the true vaccine efficacy greater than X with confidence (probability) of Y, given the observed trial data. In the case of BNT162b2 trial, $\Pr (\text{VE}\gt 30\% \mid data)\gt 0.9999$. This means that with a probability larger than 0.9999 the vaccine efficacy is greater than 30%. In fact, it can be shown (Supplemental Material) that $\Pr (\text{VE}\gt 90\% \mid data)=0.98$, which means with a probability 0.98 that the vaccine efficacy is greater than 90%. This statement is perhaps much more informative for decision making and reflects the superior efficacy of the vaccine. For example, the statement implies that there is only 2% chance that the vaccine is less than 90% efficacious.
2) Report Bayesian credible interval (BI) and interpret BI using a probability statement. For example, in the case of BNT162b2 trial, the 95% credible interval (90.3, 97.6) of VE means that with 95% probability, the true vaccine efficacy is greater than 90.3% and less than 97.6%, given the observed trial data. We recommend using the abbreviation “BI” to represent Bayesian credible interval to distinguish “CI” which stands for confidence interval.
3) Report posterior distribution (probability) of treatment effects and overlay it with the regulatory thresholds, if possible. For example, for the BNT162b2 vaccine trial, Figure 1 shows the histogram of VE based on its posterior distribution. It is clear that the vaccine efficacy is much higher than the regulatory thresholds of 0.3 and 0.5 [10], with most probability mass pointing to values greater than 0.8. In the BNT162b2 trial, the posterior probability that VE is greater than 0.3 or 0.5 is greater than 0.99.
4) Report the complete Bayesian models including the prior distributions and the likelihood functions. This allows transparency so that the assumptions of the models can be assessed and critiqued. In Supplemental Material, we present two such models, one reproducing the results in [8] and the other with better interpretation.

Figure 1

The posterior distribution of the BNT162b2 VE based on the beta-binomial model in [8]. The blue curve is the posterior density of VE. The red lines are the 95% credible intervals. The two dotted lines represent the two VE thresholds, 0.3 and 0.5 mentioned in the FDA guidance for COVID-19 efficacy. Specifically, a vaccine must exhibit observed VE of 0.5 and the lower bound of the 95% confidence interval must be greater than 0.3, in order to be considered for authorization.

3.2 Bayesian Models and Inference for the BNT162b2 Trial

The details of the Bayesian models and inference used in the BNT162b2 vaccine trial were not reported in either the trial protocol or the publication [8]. We reproduced the reported Bayesian results in the primary efficacy analysis in [8]. See Supplemental Material for details of our model that reproduced the results. It is not the first choice of a Bayesian model that we would use, however, since the model is not compatible with the sampling scheme based on the trial design. An alternative model that is more natural and compatible with the trial design is presented in Supplemental Material, where the BI from the alternative model is (90.9%, 97.9%), which is a bit shorter than the reported BI.

Supplemental Material

Reproducible Model

We report the following simple beta-binomial model that reproduces the BIs in [8]. This model assumes that the number of COVID cases in the vaccine group is sampled as a binomial random variable from the total number of COVID cases in both groups, vaccine and placebo. In mathematics, this means

\[ X\mid N,\theta \sim \text{Bin}(N,\theta ),\]

where X denotes the number of cases in the BNT162b2 group and N the total number of cases in both groups. Therefore, $(N-X)$ is the number of cases in the placebo group. Here, θ is interpreted as the probability that an observed COVID case is from the vaccine group and $(1-\theta )$ is the probability that it is from the placebo group, when a COVID case is observed. Note that the probability sampling space is restricted to only the COVID cases, not including any non-cases.

A beta prior Beta(0.700102,1) is proposed for θ in the BNT162b2 protocol [1]. Also, the trial protocol assumes $\theta =(1-\text{VE})/(2-\text{VE})$. This assumption means that $\text{VE}=1-\theta /(1-\theta )$, which resembles the definition of $\text{VE}=1-IRR$. However, it is important to note that θ is not the probability of COVID rate in the vaccine group.

Following this model, with fixed N and X, the posterior distribution of θ is also a beta distribution,

(3.1)

\[ \theta \mid N,X\sim \text{Beta}(0.700102+X,1+N-X).\]

For the BNT162b2 trial, $X=8$ and $N=170$ for ${\text{VE}_{1}}$ and $X=9$ and $N=178$ for ${\text{VE}_{2}}$, where ${\text{VE}_{1}}$ and ${\text{VE}_{2}}$ are the two primary efficacy endpoints of the trial. Therefore, the BIs of VE can be calculated via sampling θ from its beta posterior distribution (3.1) and are shown in Table S.1 below. They are identical to the reported BIs in [8]. In addition, using the posterior distribution of θ, we easily calculate the posterior probabilities of VEs. For example, $\Pr ({\text{VE}_{1}}\gt 30\% \mid \text{data})\gt 0.9999$ and $\Pr ({\text{VE}_{2}}\gt 90\% \mid \text{data})=0.98$. See attached computer program for detail.

Table S.1

Vaccine Efficacy with the prior of θ.

${^{\ast }}$The reported values of VE are the observed $\text{VE}=1-IRR$, the same as in [8], where $IRR$ is based on the observed COVID cases and sample sizes for both groups. We would recommend reporting the posterior means as well, which are 94.6 and 94.3 for ${\text{VE}_{1}}$ and ${\text{VE}_{2}}$, respectively, using the reproducible model. This is in alignment with the FDA guidance on the use of Bayesian methods [9].

An Alternative Model

We also propose an alternative model that is more natural and compatible with the trial design. Recall that the design of the BNT162b2 trial first enrolls participants without COVID diagnosis, and then follows them for certain time period to observe disease occurrence. This means that for each of the two groups, vaccine and placebo, a binomial sampling is carried out, assuming a homogeneous disease rate within each group and not considering different surveillance time of each patient. In particular, let ${p_{1}}$ and ${p_{2}}$ denote the probabilities of COVID for the vaccine and placebo groups, respectively. Among the ${N_{1}}$ participants in the vaccine group and ${N_{2}}$ in the placebo group, let ${X_{1}}$ and ${X_{2}}$ represent the corresponding numbers of COVID cases.

Since participants in the vaccine and placebo groups are treated and followed independently, we assume the following independent binomial sampling distributions, i.e.,

\[ {X_{i}}\mid {N_{i}},{p_{i}}\sim \text{Bin}({X_{i}}\mid {N_{i}},{p_{i}}),\hspace{1em}i=1,2.\]

Note that by definition, $\text{VE}=1-{p_{1}}/{p_{2}}$. Therefore, a Bayesian model and inference is completed by a prior and posterior distribution of $({p_{1}},{p_{2}})$.

We assume that ${p_{1}}$ and ${p_{2}}$ follow improper and independent prior distributions. In other words, $f({p_{i}})\sim 1$. This prior leads to proper independent posterior distributions, given by ${p_{i}}\sim \text{Beta}({X_{i}},{N_{i}}-{X_{i}})$, $i=1,2$. Using the two beta posterior distributions, the estimated VE and its BIs can be calculated by numerical methods with random sampling of the two beta distributions (see attached computer program). As a comparison to the results in [8], Table S.2 presents the reported BIs using the alternative model. The first BI (90.9, 97.9) in the table is slightly shorter than the one (90.3, 97.6) in the paper. More importantly, this model presents the posterior distributions of ${p_{1}}$ and ${p_{2}}$, the two infection rates for the vaccine and placebo, as shown in Figure S.1.

Table S.2

Vaccine Efficacy with the independent priors of ${p_{1}}$ and ${p_{2}}$.

${^{\ast }}$Under the alternative model, the posterior means of ${\text{VE}_{1}}$ and ${\text{VE}_{2}}$ are also 95.0 and 94.6, respectively.

Figure S.1

The posterior distributions of the infection rates for the BNT162b2 vaccine and placebo groups based on the alternative model. The blue curves indicate that the vaccine is highly efficacious relative to the placebo, indicated by the red curves.

Sensitivity Analysis

The prior Beta(0.700102,1) is chosen to reflect a conservative assumption $\theta =0.4118$ (corresponding to $\text{VE}=30\% $) while allowing for substantial uncertainty [1]. For comparison, we include additional prior settings to evaluate performance across the two models, as summarized in Table S.3. Specifically, we let ${p_{1}}\sim \text{Beta}(0.35,0.65)$ and ${p_{2}}\sim \text{Beta}(0.5,0.5)$ in the alternative model, reflecting the same conservative assumption of VE = 30% and keeping the prior weakly informative.

As shown in Table S.3, the posterior means and BIs under the improper prior are identical across the two models. Under the weakly informative priors—each encoding $\text{VE}=30\% $ as the prior mean—the posterior means from both models shift closer to 30%. Notably, the Beta(0.700102, 1) prior leads to a slightly smaller posterior mean due to its relatively larger impact compared with Beta(0.0700102, 0.1). However, the change in the posterior inference would not have altered the overall high efficacy of the vaccine.

Table S.3

Posterior means and 95% BIs of VE (%) under different prior specifications in two models.

Efficacy End Point	Reproducible model			Alternative model
	Improper Prior	Beta(0.0700102,0.1)	Beta(0.700102,1)	Improper Prior	Beta(0.35,0.65) & Beta(0.5,0.5)
${\text{VE}_{1}}$	95.0 (90.9, 97.9)	95.0 (90.8, 97.9)	94.6${^{\ast }}$ (90.3, 97.6)	95.0 (90.9, 97.9)	94.8 (90.6, 97.7)
${\text{VE}_{2}}$	94.6 (90.4, 97.6)	94.6 (90.4, 97.6)	94.3${^{\ast }}$ (89.9, 97.3)	94.6 (90.4, 97.6)	94.4 (90.1, 97.4)

${^{\ast }}$The reported VE values are the posterior means from the reproducible model, which differ from the empirically estimated values reported in [8].

Authors