The New England Journal of Statistics in Data Science logo


  • Help
Login Register

  1. Home
  2. Issues
  3. Volume 2, Issue 1 (2024)
  4. Biomarker Panel Development Using Logic ...

The New England Journal of Statistics in Data Science

Submit your article Information Become a Peer-reviewer
  • Article info
  • Full article
  • More
    Article info Full article

Biomarker Panel Development Using Logic Regression in the Presence of Missing Data
Volume 2, Issue 1 (2024), pp. 3–14
Ying Huang   Sayan Dasgupta  

Authors

 
Placeholder
https://doi.org/10.51387/24-NEJSDS59
Pub. online: 31 January 2024      Type: Methodology Article      Open accessOpen Access
Area: Biomedical Research

Accepted
20 December 2023
Published
31 January 2024

Abstract

We consider the problem of developing flexible and parsimonious biomarker combinations for cancer early detection in the presence of variable missingness at random. Motivated by the need to develop biomarker panels in a cross-institute pancreatic cyst biomarker validation study, we propose logic-regression based methods for feature selection and construction of logic rules under a multiple imputation framework. We generate ensemble trees for classification decision, and further select a single decision tree for simplicity and interpretability. We demonstrate superior performance of the proposed methods compared to alternative methods based on complete-case data or single imputation. The methods are applied to the pancreatic cyst data to estimate biomarker panels for pancreatic cysts subtype classification and malignant potential prediction.

References

[1] 
Baker, S. G. (2000). Identifying combinations of cancer markers for further study as triggers of early intervention. Biometrics 56(4) 1082–1087. https://doi.org/10.1111/j.0006-341X.2000.01082.x. MR1815586
[2] 
Breiman, L. (2017) Classification and regression trees. Routledge.
[3] 
Cai, T. and Zheng, Y. (2011). Evaluating prognostic accuracy of biomarkers in nested case–control studies. Biostatistics 13(1) 89–100.
[4] 
Chen, Q. and Wang, S. (2013). Variable selection for multiply-imputed data with application to dioxin exposure study. Statistics in Medicine 32(21) 3646–3659. https://doi.org/10.1002/sim.5783. MR3095503
[5] 
Cho, H., Matthews, G. J. and Harel, O. (2018). Confidence intervals for the area under the receiver operating characteristic curve in the presence of ignorable missing data. arXiv preprint arXiv:1804.05882. https://doi.org/10.1111/insr.12277. MR3940143
[6] 
Etzioni, R., Kooperberg, C., Pepe, M., Smith, R. and Gann, P. H. (2003). Combining biomarkers to detect disease with application to prostate cancer. Biostatistics 4(4) 523–538.
[7] 
Feng, Z. (2010). Classification versus association models: Should the same methods apply? Scandinavian Journal of Clinical & Laboratory Investigation 70(S242) 53–58. PMCID: PMC3140431.
[8] 
Hamming, R. W. (1950). Error detecting and error correcting codes. The Bell system technical journal 29(2) 147–160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x. MR0035935
[9] 
Harel, O. and Zhou, X. -H. (2007). Multiple imputation for the comparison of two screening tests in two-phase Alzheimer studies. Statistics in Medicine 26(11) 2370–2388. https://doi.org/10.1002/sim.2715. MR2368421
[10] 
Harel, O. and Zhou, X. -H. (2007). Multiple imputation: review of theory, implementation and software. Statistics in Medicine 26(16) 3057–3077. https://doi.org/10.1002/sim.2787. MR2380504
[11] 
He, H., Lyness, J. M. and McDermott, M. P. (2009). Direct estimation of the area under the ROC curve in the presence of verification bias. Statistics in Medicine 28(3) 361–376. https://doi.org/10.1002/sim.3388. MR2655685
[12] 
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47(260) 663–685. MR0053460
[13] 
Huang, Y. (2016). Evaluating and comparing biomarkers with respect to the area under the receiver operating characteristics curve in two-phase case–control studies. Biostatistics 17(3) 499–522. PMCID:PMC4915610. https://doi.org/10.1093/biostatistics/kxw003. MR3603950
[14] 
Janes, H., Pepe, M., Kooperberg, C. and Newcomb, P. (2005). Identifying target populations for screening or not screening using logic regression. Statistics in Medicine 24(9) 1321–1338. https://doi.org/10.1002/sim.2021. MR2134561
[15] 
Little, R. J. and Rubin, D. B. (2014) Statistical analysis with missing data 333. John Wiley & Sons. https://doi.org/10.1002/9781119013563. MR1925014
[16] 
Liu, Y., Kaur, S., Huang, Y., Fahrmann, J. F., Rinaudo, J. A., Hanash, S. M., Batra, S. K., Singhi, A. D., Brand, R. E., Maitra, A. et al. (2020). Biomarkers and Strategy to Detect Pre-Invasive and Early Pancreatic Cancer: State of the Field and the Impact of the EDRN. Cancer Epidemiology and Prevention Biomarkers. PMID: 32532830, PubMed Journal. In Process.
[17] 
Long, Q. and Johnson, B. A. (2015). Variable selection in the presence of missing data: Resampling and imputation. Biostatistics 16(3) 596–610. https://doi.org/10.1093/biostatistics/kxv003. MR3365449
[18] 
Long, Q., Zhang, X. and Hsu, C. -H. (2011). Nonparametric multiple imputation for receiver operating characteristics analysis when some biomarker values are missing at random. Statistics in Medicine 30(26) 3149–3161. https://doi.org/10.1002/sim.4338. MR2845684
[19] 
Long, Q., Zhang, X. and Johnson, B. A. (2011). Robust estimation of area under ROC curve using auxiliary variables in the presence of missing biomarker values. Biometrics 67(2) 559–567. https://doi.org/10.1111/j.1541-0420.2010.01487.x. MR2829024
[20] 
Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(4) 417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x. MR2758523
[21] 
Pepe, M. S., Fan, J., Seymour, C. W., Li, C., Huang, Y. and Feng, Z. (2012). Biases introduced by choosing controls to match risk factors of cases in biomarker research. Clinical Chemistry 58(8) 1242–1251. PMCID:PMC3464972.
[22] 
Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89(427) 846–866. MR1294730
[23] 
Rotnitzky, A., Faraggi, D. and Schisterman, E. (2006). Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. Journal of the American Statistical Association 101(475) 1276–1288. https://doi.org/10.1198/016214505000001339. MR2328313
[24] 
Ruczinski, I., Kooperberg, C. and LeBlanc, M. (2003). Logic regression. Journal of Computational and graphical Statistics 12(3) 475–511. https://doi.org/10.1198/1061860032238. MR2002632
[25] 
Wan, Y., Datta, S., Conklin, D. and Kong, M. (2015). Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect. Journal of Statistical Computation and Simulation 85(9) 1902–1916. https://doi.org/10.1080/00949655.2014.907801. MR3318342
[26] 
Wang, L. and Huang, Y. (2019). Evaluating classification performance of biomarkers in two-phase case-control studies. Statistics in Medicine 38(1) 100–114. PMCID:PMC63178589. https://doi.org/10.1002/sim.7966. MR3887270
[27] 
Wood, A. M., White, I. R. and Royston, P. (2008). How should variable selection be performed with multiply imputed data? Statistics in Medicine 27(17) 3227–3246. https://doi.org/10.1002/sim.3177. MR2523914
[28] 
Zhang, P. (2003). Multiple imputation: theory and method. International Statistical Review 71(3) 581–592.
[29] 
Zhang, Y., Alonzo, T. A. and Initiative, A. D. N. (2018). Estimation of the volume under the receiver-operating characteristic surface adjusting for non-ignorable verification bias. Statistical Methods in Medical Research 27(3) 715–739. https://doi.org/10.1177/0962280217742541. MR3767620
[30] 
Zhao, Y. and Long, Q. (2016). Multiple imputation in the presence of high-dimensional data. Statistical Methods in Medical Research 25(5) 2021–2035. https://doi.org/10.1177/0962280213511027. MR3553324

Full article PDF XML
Full article PDF XML

Copyright
© 2024 New England Statistical Society
by logo by logo
Open access article under the CC BY license.

Keywords
62P10 Biomarker Logic regression Missing data

Metrics
since December 2021
248

Article info
views

131

Full article
views

197

PDF
downloads

32

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

The New England Journal of Statistics in Data Science

  • ISSN: 2693-7166
  • Copyright © 2021 New England Statistical Society

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer
Powered by PubliMill  •  Privacy policy