We appreciate the comments by Cao and Pan [1] who highlighted the major points discussed in our paper, “Considerations for Single-Arm Trials to Support Accelerated Approval of Oncology Drugs” [2], and who also expressed concerns about potential biases and other related issues in single-arm trials (SATs). Specifically, Cao and Pan pointed out that, unlike randomized controlled trials (RCTs), SATs are inherently associated with methodological limitations—such as the lack of a control arm—which can complicate the interpretation of treatment effects. They briefly discussed considerations for analytical methods (e.g., propensity score matching), additional endpoints (e.g., patient-reported outcomes [PROs], quality of life [QoL]), subgroup analyses, and control of false positives. They also addressed the use of external controls, individual patient data (IPD), issues related to statistical power and small sample sizes, and various potential sources of bias (e.g., selection bias, information bias) in SATs. Ultimately, they argued that SATs, as a component of a broader evidentiary framework, should incorporate historical controls, real-world evidence (RWE), and confirmatory post-marketing studies [1].
While we agree with most of the points raised by Cao and Pan [1], we would like to emphasize that SATs may only be appropriate in specific clinical contexts—such as rare and/or life-threatening cancers with no efficacious treatment options—where randomized controlled trials (RCTs) are either (1) unethical or infeasible, as outlined as “necessary conditions” for SATs in our paper [2], or (2) unnecessary in cases where outcomes under control conditions are well understood (analogous to testing parachutes [3]), such as the cell therapy approved for synovial sarcoma [4]. Relevant regulatory guidance documents [5, 6, 7] also support the appropriate use of SATs in the context of accelerated approval (AA). The desirable conditions are presented in Sections 3 and 4 of Lu et al. [2], which—together with the necessary conditions in Section 2— constitute a comprehensive framework of prerequisites for the use of SATs to support AA. To address the concerns and critical points raised by Cao and Pan [1], we provide a brief discussion of additional considerations for using SATs in the regulatory approval of oncology drugs.
As discussed in several sections of our paper [2] and in the comments by Cao and Pan [1], a major concern with SATs is the lack of an internal comparator arm, which can introduce biases in comparative effect estimates. The reflection paper by the European Medicines Agency (EMA) [7] summarizes various sources of bias and corresponding mitigation strategies that can be applied during the design, conduct, analysis, and reporting of an SAT. It also acknowledges that these strategies may not fully eliminate bias and that demonstrating unbiased effect estimates may be impossible. To assess potential biases and their magnitude, one may consider conducting sensitivity analyses to quantify these biases [8, 9] and to explore the robustness of study conclusions to various assumptions and sources of bias [10, 11].
An SAT generally relies on an implicit or explicit external control (EC) to estimate the therapeutic effects of an anticancer drug. The FDA draft guidance on externally controlled trials (ECTs) [12] states that “if the natural history of a disease is well-defined and the disease is known not to improve in the absence of an intervention or with available therapies, historical information can potentially serve as the control group.” An Implicit EC refers to a study-level summary (e.g., an aggregate response rate derived from previous trials or RWE studies) or a literature reported threshold value (e.g., a quantity derived from a meta-analysis). In contrast, an explicit EC involves pre-defined IPD from other trials or from real-world data (RWD) sources such as disease registries and electronic medical records [13, 14]. From a design perspective, regulatory agencies [6, 12] recommend pre-specification of the following key elements in an SAT protocol when using an EC: suitable data sources, baseline eligibility criteria, appropriate exposure definitions and observation windows, clinically meaningful endpoints, analytic methods, and strategies to minimize the effects of missing data and various biases. The estimand framework outlined in ICH E9(R1) [15] should be followed to precisely define the estimand that aligns with the clinical question, as part of efforts to reduce potential biases. Particular attention should be given to potential discrepancies in the frequency and pattern of intercurrent events (ICEs) between the treatment and EC arms when defining the estimand for an ECT (including SATs); see also Chen et al. [16] for general considerations on estimands in RWE studies. Strategies for handling ICEs should be pre-specified in the protocol or SAP to ensure that the estimated estimand appropriately addresses the clinical question defined in the protocol. For a causal estimand to be identifiable in ECTs, a fit-for-purpose assessment of the RWD should evaluate the validity of the underlying causal assumptions—consistency, positivity, and exchangeability—in addition to standard data quality metrics such as relevance, reliability, and fitness for research use [16, 17]. From an analysis perspective, the FDA draft guidance [12] emphasizes that the analytic method should be capable of “identifying and managing sources of confounding and bias, including a strategy to account for differences in baseline factors and confounding variables between trial arms.” Nevertheless, quantitative bias analysis is recommended to assess the sensitivity of study conclusions to various sources of bias when using an EC in SATs [9, 18].
Regarding endpoints selection, most SATs supporting AA use surrogate endpoints to measure immediate or intermediate anticancer activity rather than longer-term, clinically meaningful time-to-event endpoints (e.g., overall survival), as the latter may not be adequately characterized in SATs [7, 19, 20]. PROs and QoL measures may also be incorporated in SATs to capture patient-centric effects—particularly treatment benefits beyond tumor response (e.g., symptom relief) compared to baseline—which can enhance real-world relevance and support regulatory approval [21]. However, caution should be exercised, as PROs and QoL outcomes may be over- or underestimated due to missing data in long-term follow-up SATs [22, 23].
Confirmatory subgroup analysis (SA) in SATs is often approached with caution due to inherent limitations, such as the lack of an internal comparator arm, small sample sizes, and susceptibility to bias, which can render SA results unreliable. In general, SA in SATs is conducted exploratorily, without pre-specified statistical power, to provide supporting evidence on the consistency of efficacy across subgroups. In some cases, a pre-specified SA based on molecularly defined biomarkers is conducted in SATs, with appropriate multiplicity adjustments, to evaluate whether a biomarker-modifying effect exists [7, 24, 25].
As for statistical power and sample size, regulatory guidance documents on RWE studies (including SATs with IPD ECs) recommend that a statistical analysis plan (SAP) be developed in advance and submitted to the relevant regulatory agency prior to study initiation [7, 12, 26]. The SAP should include, at a minimum, clearly defined analyses for primary and secondary estimands, statistical power, sample size, and methods for controlling the probability of erroneous conclusions. In particular, the sample size of an SAT should be sufficiently large to provide a reliable answer to the clinical question, taking into account the planned analysis and the criteria for trial success [12, 26]. For SATs using a fixed threshold control, classical statistical methods for binary outcomes and duration of response can be used to determine the sample size required to detect a clinically meaningful and statistically significant treatment effect compared to the fixed threshold value [27]. For SATs using IPD as ECs, in addition to conventional sample size determination methods for detecting meaningful differences with desired power [28, 29, 30, 31, 32], additional statistical considerations include: (1) Identification and evaluation of fit-for-purpose RWD sources—Research-oriented RWD (e.g., disease registries, prospective cohorts) is generally preferred over transactional RWD (e.g., claims data). The choice of RWD source impacts the effective sample size—the number of patients eligible to serve as ECs. (2) Type I error inflation—Type I error may be inflated if the response rate in the EC drifts toward the extremes (0 or 1) from a hypothetical value. (3) Analytical methods for estimating treatment effects—Different methods (e.g., Bayesian dynamic borrowing, PS matching, regression) rely on different assumptions and may yield varying results, affecting power and required sample size [33, 34]. (4) Separation of treatment effects from bias—The sample size must be sufficiently large to distinguish true treatment effects from potential bias in the analysis. (5) Simulation studies—Simulations are useful for exploring the interactions among key design factors such as power, sample size, type I error, analytical methods, bias, and the causal gap. Sample size re-estimation during the course of the study may be implemented upon regulatory agreement [35]; see also Chen et al. [36] for further discussion on trial design and analysis using external data.
Finally, we would like to express our gratitude to Cao and Pan for their thoughtful comments and to the Journal Editor for the opportunity to further elaborate on additional considerations in using SATs to support the AA of oncology drugs.