Biomarker Panel Development Using Logic Regression in the Presence of Missing Data
Volume 2, Issue 1 (2024), pp. 3–14
Pub. online: 31 January 2024
Type: Biomedical Research
Open Access
Accepted
20 December 2023
20 December 2023
Published
31 January 2024
31 January 2024
Abstract
We consider the problem of developing flexible and parsimonious biomarker combinations for cancer early detection in the presence of variable missingness at random. Motivated by the need to develop biomarker panels in a cross-institute pancreatic cyst biomarker validation study, we propose logic-regression based methods for feature selection and construction of logic rules under a multiple imputation framework. We generate ensemble trees for classification decision, and further select a single decision tree for simplicity and interpretability. We demonstrate superior performance of the proposed methods compared to alternative methods based on complete-case data or single imputation. The methods are applied to the pancreatic cyst data to estimate biomarker panels for pancreatic cysts subtype classification and malignant potential prediction.
References
Baker, S. G. (2000). Identifying combinations of cancer markers for further study as triggers of early intervention. Biometrics 56(4) 1082–1087. https://doi.org/10.1111/j.0006-341X.2000.01082.x. MR1815586
Chen, Q. and Wang, S. (2013). Variable selection for multiply-imputed data with application to dioxin exposure study. Statistics in Medicine 32(21) 3646–3659. https://doi.org/10.1002/sim.5783. MR3095503
Cho, H., Matthews, G. J. and Harel, O. (2018). Confidence intervals for the area under the receiver operating characteristic curve in the presence of ignorable missing data. arXiv preprint arXiv:1804.05882. https://doi.org/10.1111/insr.12277. MR3940143
Hamming, R. W. (1950). Error detecting and error correcting codes. The Bell system technical journal 29(2) 147–160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x. MR0035935
Harel, O. and Zhou, X. -H. (2007). Multiple imputation for the comparison of two screening tests in two-phase Alzheimer studies. Statistics in Medicine 26(11) 2370–2388. https://doi.org/10.1002/sim.2715. MR2368421
Harel, O. and Zhou, X. -H. (2007). Multiple imputation: review of theory, implementation and software. Statistics in Medicine 26(16) 3057–3077. https://doi.org/10.1002/sim.2787. MR2380504
He, H., Lyness, J. M. and McDermott, M. P. (2009). Direct estimation of the area under the ROC curve in the presence of verification bias. Statistics in Medicine 28(3) 361–376. https://doi.org/10.1002/sim.3388. MR2655685
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47(260) 663–685. MR0053460
Huang, Y. (2016). Evaluating and comparing biomarkers with respect to the area under the receiver operating characteristics curve in two-phase case–control studies. Biostatistics 17(3) 499–522. PMCID:PMC4915610. https://doi.org/10.1093/biostatistics/kxw003. MR3603950
Janes, H., Pepe, M., Kooperberg, C. and Newcomb, P. (2005). Identifying target populations for screening or not screening using logic regression. Statistics in Medicine 24(9) 1321–1338. https://doi.org/10.1002/sim.2021. MR2134561
Little, R. J. and Rubin, D. B. (2014) Statistical analysis with missing data 333. John Wiley & Sons. https://doi.org/10.1002/9781119013563. MR1925014
Liu, Y., Kaur, S., Huang, Y., Fahrmann, J. F., Rinaudo, J. A., Hanash, S. M., Batra, S. K., Singhi, A. D., Brand, R. E., Maitra, A. et al. (2020). Biomarkers and Strategy to Detect Pre-Invasive and Early Pancreatic Cancer: State of the Field and the Impact of the EDRN. Cancer Epidemiology and Prevention Biomarkers. PMID: 32532830, PubMed Journal. In Process.
Long, Q. and Johnson, B. A. (2015). Variable selection in the presence of missing data: Resampling and imputation. Biostatistics 16(3) 596–610. https://doi.org/10.1093/biostatistics/kxv003. MR3365449
Long, Q., Zhang, X. and Hsu, C. -H. (2011). Nonparametric multiple imputation for receiver operating characteristics analysis when some biomarker values are missing at random. Statistics in Medicine 30(26) 3149–3161. https://doi.org/10.1002/sim.4338. MR2845684
Long, Q., Zhang, X. and Johnson, B. A. (2011). Robust estimation of area under ROC curve using auxiliary variables in the presence of missing biomarker values. Biometrics 67(2) 559–567. https://doi.org/10.1111/j.1541-0420.2010.01487.x. MR2829024
Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(4) 417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x. MR2758523
Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89(427) 846–866. MR1294730
Rotnitzky, A., Faraggi, D. and Schisterman, E. (2006). Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. Journal of the American Statistical Association 101(475) 1276–1288. https://doi.org/10.1198/016214505000001339. MR2328313
Ruczinski, I., Kooperberg, C. and LeBlanc, M. (2003). Logic regression. Journal of Computational and graphical Statistics 12(3) 475–511. https://doi.org/10.1198/1061860032238. MR2002632
Wan, Y., Datta, S., Conklin, D. and Kong, M. (2015). Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect. Journal of Statistical Computation and Simulation 85(9) 1902–1916. https://doi.org/10.1080/00949655.2014.907801. MR3318342
Wang, L. and Huang, Y. (2019). Evaluating classification performance of biomarkers in two-phase case-control studies. Statistics in Medicine 38(1) 100–114. PMCID:PMC63178589. https://doi.org/10.1002/sim.7966. MR3887270
Wood, A. M., White, I. R. and Royston, P. (2008). How should variable selection be performed with multiply imputed data? Statistics in Medicine 27(17) 3227–3246. https://doi.org/10.1002/sim.3177. MR2523914
Zhang, Y., Alonzo, T. A. and Initiative, A. D. N. (2018). Estimation of the volume under the receiver-operating characteristic surface adjusting for non-ignorable verification bias. Statistical Methods in Medical Research 27(3) 715–739. https://doi.org/10.1177/0962280217742541. MR3767620
Zhao, Y. and Long, Q. (2016). Multiple imputation in the presence of high-dimensional data. Statistical Methods in Medical Research 25(5) 2021–2035. https://doi.org/10.1177/0962280213511027. MR3553324