The New England Journal of Statistics in Data Science logo


  • Help
Login Register

  1. Home
  2. To appear
  3. Practical Considerations for Variable Sc ...

The New England Journal of Statistics in Data Science

Submit your article Information Become a Peer-reviewer
  • Article info
  • Full article
  • Related articles
  • More
    Article info Full article Related articles

Practical Considerations for Variable Screening in the Super Learner
Brian D. Williamson ORCID icon link to view author Brian D. Williamson details   Drew King   Ying Huang  

Authors

 
Placeholder
https://doi.org/10.51387/25-NEJSDS82
Pub. online: 7 May 2025      Type: Case Study, Application, And/or Practice Article      Open accessOpen Access
Area: Machine Learning and Data Mining

Accepted
20 March 2025
Published
7 May 2025

Abstract

Estimating a prediction function is a fundamental component of many data analyses. The super learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms (screeners), including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a super learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screeners should be used to protect against poor performance of any one screener, similar to the guidance for choosing a library of prediction algorithms for the super learner. These results are further illustrated through the analysis of HIV-1 antibody data.

Supplementary material

 Supplementary Material
Additional numerical results are available in the Supporting Information. Code to reproduce all numerical experiments and the data analysis is available on GitHub at https://github.com/bdwilliamson/sl_screening_supplementary.

References

[1] 
Balzer, L. B. and Westling, T. (2021). Demystifying statistical inference when using machine learning in causal research. American Journal of Epidemiology 200.
[2] 
Barron, A. (1989). Statistical properties of artificial neural networks. In Proceedings of the 28th IEEE Conference on Decision and Control 280–285. IEEE.
[3] 
Breiman, L. (2001). Random forests. Machine Learning 45(1) 5–32. MR3874153
[4] 
Bricault, C. A., Yusim, K., Seaman, M. S., Yoon, H., Theiler, J., Giorgi, E. E., Wagh, K., Theiler, M., Hraber, P., Macke, J. P., Kreider, E., Learn, G., Hahn, B., Scheid, J., Kovacs, J., Shields, J., Lavine, C., Ghantous, F., Rist, M., Bayne, M., Neubauer, G., McMahan, K., Peng, H., Cheneau, C., Jones, J., Zeng, J., Oschsenbauer, C., Nkolola, J., Stephenson, K., Chen, B., Gnanakaran, S., Bonsignori, M., Williams, L., Haynes, B., Doria-Rose, N., Mascola, J., Montefiori, D., Barouch, D. and Korber, B. (2019). HIV-1 neutralizing antibody signatures and application to epitope-targeted vaccine design. Cell Host & Microbe 25(1) 59–72.
[5] 
Buiu, C., Putz, M. V. and Avram, S. (2016). Learning the relationship between the primary structure of HIV envelope glycoproteins and neutralization activity of particular antibodies by using artificial neural networks. International Journal of Molecular Sciences 17(10) 1710.
[6] 
Carrell, D. S., Gruber, S., Floyd, J. S., Bann, M. A., Cushing-Haugen, K. L., Johnson, R. L., Graham, V., Cronkite, D. J., Hazlehurst, B. L., Felcher, A. H., Bejan, C. A., Kennedy, A., Shinde, M. U., Karami, S., Ma, Y., Stojanovic, D., Zhao, Y., Ball, R. and Nelson, J. C. (2023). Improving methods of identifying anaphylaxis for medical product safety surveillance using natural language processing and machine learning. American Journal of Epidemiology 192(2) 283–295.
[7] 
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y. and Li, Y. (2019). xgboost: Extreme Gradient Boosting. R package version 0.82.1. https://CRAN.R-project.org/package=xgboost.
[8] 
Conti, S. and Karplus, M. (2019). Estimation of the breadth of CD4bs targeting HIV antibodies by molecular modeling and machine learning. PLoS Computational Biology 15(4) 1006954.
[9] 
Corey, L., Gilbert, P. B., Juraska, M., Montefiori, D. C., Morris, L., Karuna, S. T., Edupuganti, S., Mgodi, N. M., DeCamp, A. C., Rudnicki, E. et al. (2021). Two randomized trials of neutralizing antibodies to prevent HIV-1 acquisition. New England Journal of Medicine 384(11) 1003–1014. https://doi.org/10.1056/NEJMoa2031738.
[10] 
Coyle, J., Hejazi, N., Malencia, I., Phillips, R. and Sofrygin, O. (2023). sl3: Pipelines for machine learning and Super Learning. https://doi.org/10.5281/zenodo.1342293. https://github.com/tlverse/sl3.
[11] 
Dnil, V.-R. and Buiu, C. (2022). Prediction of HIV sensitivity to monoclonal antibodies using aminoacid sequences and deep learning. Bioinformatics 38(18) 4278–4285.
[12] 
Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1) 1–22. https://doi.org/10.18637/jss.v033.i01.
[13] 
Friedman, J. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics 29(5) 1189–1232. https://doi.org/10.1214/aos/1013203451. MR1873328
[14] 
Hake, A. and Pfeifer, N. (2017). Prediction of HIV-1 sensitivity to broadly neutralizing antibodies shows a trend towards resistance over time. PLoS Computational Biology 13(10) 1005789. https://doi.org/10.1371/journal.pcbi.1005789.
[15] 
Hepler, N. L., Scheffler, K., Weaver, S., Murrell, B., Richman, D. D., Burton, D. R., Poignard, P., Smith, D. M. and Kosakovsky Pond, S. L. (2014). IDEPI: rapid prediction of HIV-1 antibody epitopes and other phenotypic features from sequence data using a flexible machine learning platform. PLoS Computational Biology 10(9) 1003842.
[16] 
Kohavi, R. (1996) Wrappers for Performance Enhancement and Oblivious Decision Graphs. Stanford University ProQuest Dissertations Publishing.
[17] 
Leng, C., Lin, Y. and Wahba, G. (2006). A note on the lasso and related procedures in model selection. Statistica Sinica 16 1273–1284. MR2327490
[18] 
Magaret, C., Benkeser, D., Williamson, B., Borate, B., Carpp, L. et al. (2019). Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features. PLoS Computational Biology 15(4) 1006952.
[19] 
Milborrow, S. (2021). earth: Multivariate Adaptive Regression Splines. R package version 5.3.1. https://CRAN.R-project.org/package=earth.
[20] 
Nelder, J. and Wedderburn, R. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A 135(3) 370–384.
[21] 
Petersen, M. L., LeDell, E., Schwab, J., Sarovar, V., Gross, R., Reynolds, N., Haberer, J. E., Goggin, K., Golin, C., Arnsten, J., Rosen, M. I., Remien, R. H., Etoori, D., Wilson, I. B., Simoni, J. M., Erlen, J. A., van der Laan, M. J., Liu, H. and Bangsberg, D. R. (2015). Super learner analysis of electronic adherence data improves viral prediction and may provide strategies for selective HIV RNA monitoring. JAIDS Journal of Acquired Immune Deficiency Syndromes 69(1) 109–118.
[22] 
Phillips, R. V., van der Laan, M. J., Lee, H. and Gruber, S. (2023). Practical considerations for specifying a super learner. International Journal of Epidemiology 52(4) 1276–1285.
[23] 
Pirracchio, R., Petersen, M. L., Carone, M., Rigon, M. R., Chevret, S. and van der Laan, M. J. (2015). Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. The Lancet Respiratory Medicine 3(1) 42–52.
[24] 
Polley, E. C. and van der Laan, M. J. (2010). Super Learner in Prediction.
[25] 
Polley, E., LeDell, E., Kennedy, C. and van der Laan, M. (2021). SuperLearner: Super Learner Prediction. R package version 2.0-28. https://CRAN.R-project.org/package=SuperLearner.
[26] 
Rawi, R., Mall, R., Shen, C.-H., Farney, S. K., Shiakolas, A., Zhou, J., Bensmail, H., Chun, T.-W., Doria-Rose, N. A., Lynch, R. M., Mascola, J. R., Kwong, P. D. and Chuang, G.-Y. (2019). Accurate prediction for antibody resistance of clinical HIV-1 isolates. Scientific Reports 9(1) 14696. https://doi.org/10.1038/s41598-019-50635-w.
[27] 
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 58(1) 267–288. MR1379242
[28] 
van der Laan, M. and Rose, S. (2011) Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media. https://doi.org/10.1007/978-1-4419-9782-1. MR2867111
[29] 
van der Laan, M., Polley, E. and Hubbard, A. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology 6(1) 25. https://doi.org/10.2202/1544-6115.1309. MR2349918
[30] 
Williamson, B. D., Magaret, C. A., Gilbert, P. B., Nizam, S., Simmons, C. and Benkeser, D. (2021). Super LeArner Prediction of NAb Panels (SLAPNAP): a containerized tool for predicting combination monoclonal broadly neutralizing antibody sensitivity. Bioinformatics 37(22) 4187–4192.
[31] 
Williamson, B. D., Magaret, C. A., Karuna, S., Carpp, L. N., Gelderblom, H. C., Huang, Y., Benkeser, D. and Gilbert, P. B. (2023). Application of the SLAPNAP statistical learning tool to broadly neutralizing antibody HIV prevention research. iScience 26(9).
[32] 
Wolpert, D. (1992). Stacked generalization. Neural Networks 5(2) 241–259.
[33] 
Wright, M. N. and Ziegler, A. (2017). ranger: a fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77(1) 1–17. https://doi.org/10.18637/jss.v077.i01. MR4583337
[34] 
Yu, W.-H., Su, D., Torabi, J., Fennessey, C. M., Shiakolas, A., Lynch, R., Chun, T.-W., Doria-Rose, N., Alter, G., Seaman, M. S. et al. (2019). Predicting the broadly neutralizing antibody susceptibility of the HIV reservoir. JCI Insight 4(17).

Full article Related articles PDF XML
Full article Related articles PDF XML

Copyright
© 2025 New England Statistical Society
by logo by logo
Open access article under the CC BY license.

Keywords
Super learner Ensemble machine learning Variable screening Prediction

Funding
This work was supported by the National Institutes of Health (NIH) grants R01CA277133, R37AI054165, R01GM106177, U24CA086368 and S10OD028685. The opinions expressed in this article are those of the authors and do not necessarily represent the official views of the NIH.

Metrics
since December 2021
7

Article info
views

1

Full article
views

4

PDF
downloads

1

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

The New England Journal of Statistics in Data Science

  • ISSN: 2693-7166
  • Copyright © 2021 New England Statistical Society

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer
Powered by PubliMill  •  Privacy policy