The New England Journal of Statistics in Data Science logo


  • Help
Login Register

  1. Home
  2. To appear
  3. A Comparison of Methods for Estimating t ...

The New England Journal of Statistics in Data Science

Submit your article Information Become a Peer-reviewer
  • Article info
  • Full article
  • More
    Article info Full article

A Comparison of Methods for Estimating the Average Treatment Effect on the Treated for Externally Controlled Trials
Huan Wang 1   Fei Wu 1   Yeh-Fong Chen  

Authors

 
Placeholder
https://doi.org/10.51387/25-NEJSDS77
Pub. online: 13 March 2025      Type: Methodology Article      Open accessOpen Access
Area: Statistical Methodology

1 Contributed equally.

Accepted
24 January 2025
Published
13 March 2025

Abstract

While randomized trials may be the gold standard for evaluating the effectiveness of the treatment intervention, in some special circumstances, single-arm clinical trials utilizing external control may be considered. The causal treatment effect of interest for single-arm trials is usually the average treatment effect on the treated (ATT) rather than the average treatment effect (ATE). Although methods have been developed to estimate the ATT, the selection and use of these methods require a thorough comparison and in-depth understanding of the advantages and disadvantages of these methods. In this study, we conducted simulations under different identifiability assumptions to compare the performance metrics (e.g., bias, standard deviation (SD), mean squared error (MSE), type I error rate) for a variety of methods, including the regression model, propensity score matching (PSM), Mahalanobis distance matching (MDM), coarsened exact matching, inverse probability weighting, augmented inverse probability weighting (AIPW), AIPW with SuperLearner, and targeted maximum likelihood estimator (TMLE) with SuperLearner.
Our simulation results demonstrate that the doubly robust methods in general have smaller biases than other methods. In terms of SD, nonmatching methods in general have smaller SDs than matching-based methods. The performance of MSE is a trade-off between the bias and SD, and no method consistently performs better in term of MSE. The identifiability assumptions are critical to the models’ performance: Violation of the positivity assumption can lead to a significant inflation of type I errors in some methods; violation of the unconfoundedness assumption can lead to a large bias for all methods.
According to the simulation results, under most scenarios we examined, PSM and MDM methods perform best overall in terms of type I error control. However, they in general have worse performance in the estimation accuracy compared to doubly robust methods given that the identifiability assumptions are not severely violated.

Supplementary material

 Supplementary Material
Supplementary materials are available online with this paper at the New England Journal of Statistics in Data Science website which includes Figures S1–S10.

References

[1] 
Abadie, A. (2005). Semiparametric difference-in-differences estimators. The Review of Economic Studies 72(1) 1–19. https://doi.org/10.1111/0034-6527.00321. MR2116973
[2] 
Abdia, Y., Kulasekera, K., Datta, S., Boakye, M. and Kong, M. (2017). Propensity scores based methods for estimating average treatment effect and average treatment effect among treated: a comparative study. Biometrical Journal 59(5) 967–985. https://doi.org/10.1002/bimj.201600094. MR3696495
[3] 
Austin, P. C. (2009). Type I error rates, coverage of confidence intervals, and variance estimation in propensity-score matched analyses. The International Journal of Biostatistics 5(1). https://doi.org/10.2202/1557-4679.1146. MR2504960
[4] 
Austin, P. C. (2010). Statistical criteria for selecting the optimal number of untreated subjects matched to each treated subject when using many-to-one matching on the propensity score. American journal of epidemiology 172(9) 1092–1097.
[5] 
Austin, P. C. (2011). Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharmaceutical Statistics 10(2) 150–161.
[6] 
Austin, P. C. (2014). A comparison of 12 algorithms for matching on the propensity score. Statistics in Medicine 33(6) 1057–1069. https://doi.org/10.1002/sim.6004. MR3249041
[7] 
Austin, P. C. (2022). Bootstrap vs asymptotic variance estimation when using propensity score weighting with continuous and binary outcomes. Statistics in Medicine 41(22) 4426–4443. MR4483678
[8] 
Blum, M. R., Tan, Y. J. and Ioannidis, J. P. (2020). Use of E-values for addressing confounding in observational studies—an empirical assessment of the literature. International Journal of Epidemiology 49(5) 1482–1494.
[9] 
Chatton, A., Le Borgne, F., Leyrat, C., Gillaizeau, F., Rousseau, C., Barbin, L., Laplaud, D., Léger, M., Giraudeau, B. and Foucher, Y. (2020). G-computation, propensity score-based methods, and targeted maximum likelihood estimator for causal inference with different covariates sets: a comparative simulation study. Scientific Reports 10(1) 9219.
[10] 
Chesnaye, N. C., Stel, V. S., Tripepi, G., Dekker, F. W., Fu, E. L., Zoccali, C. and Jager, K. J. (2022). An introduction to inverse probability of treatment weighting in observational research. Clinical Kidney Journal 15(1) 14–20.
[11] 
Ding, P. and VanderWeele, T. J. (2016). Sensitivity analysis without assumptions. Epidemiology (Cambridge, Mass.) 27(3) 368.
[12] 
Fang, Y. (2020). Two basic statistical strategies of conducting causal inference in real-world studies. Contemporary Clinical Trials 99 106193.
[13] 
Fang, Y., Wang, H. and He, W. (2020). A statistical roadmap for journey from real-world data to real-world evidence. Therapeutic Innovation & Regulatory Science 54(4) 749–757.
[14] 
Gruber, S., van der Laan, M., Kennedy, C. and Gruber, M. S. (2006). Package ‘tmle’. Biostatistics 2 1.
[15] 
Gruber, S., Phillips, R. V., Lee, H. and van der Laan, M. J. (2022). Data-adaptive selection of the propensity score truncation level for inverse-probability–weighted and targeted maximum likelihood estimators of marginal point treatment effects. American Journal of Epidemiology 191(9) 1640–1651.
[16] 
Heckman, J. J. and Robb Jr, R. (1985). Alternative methods for evaluating the impact of interventions: An overview. Journal of Econometrics 30(1-2) 239–267.
[17] 
Heckman, J. J., Ichimura, H. and Todd, P. E. (1997). Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme. The Review of Economic Studies 64(4) 605–654. https://doi.org/10.1111/1467-937X.00044. MR1623713
[18] 
Hernán, M. and Robins, J. (2020) Causal Inference: What If. Chapman & Hall/CRC Boca Raton, FL.
[19] 
Ho, D. E., Imai, K., King, G. and Stuart, E. A. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15(3) 199–236.
[20] 
Ho, D., Imai, K., King, G., Stuart, E. and Whitworth, A. (2018). Package ‘MatchIt’. Version; 2018.
[21] 
Iacus, S., King, G. and Porro, G. (2009). CEM: Software for coarsened exact matching. Journal of Statistical Software 30 1–27.
[22] 
Iacus, S. M., King, G. and Porro, G. (2011). Multivariate matching methods that are monotonic imbalance bounding. Journal of the American Statistical Association 106(493) 345–361. https://doi.org/10.1198/jasa.2011.tm09599. MR2816726
[23] 
Iacus, S. M., King, G. and Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political Analysis 20(1) 1–24.
[24] 
Imbens, G. W. (2003). Sensitivity to exogeneity assumptions in program evaluation. American Economic Review 93(2) 126–132.
[25] 
King, G. and Nielsen, R. (2019). Why propensity scores should not be used for matching. Political Analysis 27(4) 435–454.
[26] 
Léger, M., Chatton, A., Le Borgne, F., Pirracchio, R., Lasocki, S. and Foucher, Y. (2022). Causal inference in case of near-violation of positivity: comparison of methods. Biometrical Journal 64 1389–1403. https://doi.org/10.1002/bimj.202000323. MR4523219
[27] 
Li, F., Morgan, K. L. and Zaslavsky, A. M. (2018). Balancing covariates via propensity score weighting. Journal of the American Statistical Association 113(521) 390–400. https://doi.org/10.1080/01621459.2016.1260466. MR3803473
[28] 
Mao, H., Li, L. and Greene, T. (2019). Propensity score weighting analysis and treatment effect discovery. Statistical Methods in Medical Research 28(8) 2439–2454. https://doi.org/10.1177/0962280218781171. MR3988108
[29] 
Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90(429) 106–121. MR1325118
[30] 
Rosenbaum, P. R. and Rubin, D. B. (1983). Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society: Series B (Methodological) 45(2) 212–218.
[31] 
Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70(1) 41–55. https://doi.org/10.1093/biomet/70.1.41. MR0742974
[32] 
Schuler, M. S. and Rose, S. (2017). Targeted maximum likelihood estimation for causal inference in observational studies. American Journal of Epidemiology 185(1) 65–73. https://doi.org/10.2202/1557-4679.1241. MR2595112
[33] 
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics 25(1) 1. https://doi.org/10.1214/09-STS313. MR2741812
[34] 
Van Der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics 2(1). https://doi.org/10.2202/1557-4679.1043. MR2306500
[35] 
Van der Laan, M. J., Rose, S. et al. (2011) Targeted learning: causal inference for observational and experimental data 4. Springer. https://doi.org/10.1007/978-1-4419-9782-1. MR2867111
[36] 
VanderWeele, T. J. and Ding, P. (2017). Sensitivity analysis in observational research: introducing the E-value. Annals of Internal Medicine 167(4) 268–274.
[37] 
Zhang, Z., Kim, H. J., Lonjon, G., Zhu, Y. et al. (2019). Balance diagnostics after propensity score matching. Annals of Translational Medicine 7(1).
[38] 
Zhu, Y., Hubbard, R. A., Chubak, J., Roy, J. and Mitra, N. (2021). Core concepts in pharmacoepidemiology: Violations of the positivity assumption in the causal analysis of observational data: Consequences and statistical approaches. Pharmacoepidemiology and Drug Safety 30(11) 1471–1485.

Full article PDF XML
Full article PDF XML

Copyright
© 2025 New England Statistical Society
by logo by logo
Open access article under the CC BY license.

Keywords
Single-arm trials External control Real-world data Causal inference Average treatment effect on the treated

Funding
This work was supported by the ORISE Research Program of the U.S. Food and Drug Administration.

Metrics
since December 2021
121

Article info
views

57

Full article
views

54

PDF
downloads

10

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

The New England Journal of Statistics in Data Science

  • ISSN: 2693-7166
  • Copyright © 2021 New England Statistical Society

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer
Powered by PubliMill  •  Privacy policy