Practical Considerations for Variable Screening in the Super Learner
Pub. online: 7 May 2025
Type: Case Study, Application, And/or Practice Article
Open Access
Area: Machine Learning and Data Mining
Accepted
20 March 2025
20 March 2025
Published
7 May 2025
7 May 2025
Abstract
Estimating a prediction function is a fundamental component of many data analyses. The super learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms (screeners), including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a super learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screeners should be used to protect against poor performance of any one screener, similar to the guidance for choosing a library of prediction algorithms for the super learner. These results are further illustrated through the analysis of HIV-1 antibody data.
Supplementary material
Supplementary MaterialAdditional numerical results are available in the Supporting Information. Code to reproduce all numerical experiments and the data analysis is available on GitHub at https://github.com/bdwilliamson/sl_screening_supplementary.
References
Breiman, L. (2001). Random forests. Machine Learning 45(1) 5–32. MR3874153
Bricault, C. A., Yusim, K., Seaman, M. S., Yoon, H., Theiler, J., Giorgi, E. E., Wagh, K., Theiler, M., Hraber, P., Macke, J. P., Kreider, E., Learn, G., Hahn, B., Scheid, J., Kovacs, J., Shields, J., Lavine, C., Ghantous, F., Rist, M., Bayne, M., Neubauer, G., McMahan, K., Peng, H., Cheneau, C., Jones, J., Zeng, J., Oschsenbauer, C., Nkolola, J., Stephenson, K., Chen, B., Gnanakaran, S., Bonsignori, M., Williams, L., Haynes, B., Doria-Rose, N., Mascola, J., Montefiori, D., Barouch, D. and Korber, B. (2019). HIV-1 neutralizing antibody signatures and application to epitope-targeted vaccine design. Cell Host & Microbe 25(1) 59–72.
Carrell, D. S., Gruber, S., Floyd, J. S., Bann, M. A., Cushing-Haugen, K. L., Johnson, R. L., Graham, V., Cronkite, D. J., Hazlehurst, B. L., Felcher, A. H., Bejan, C. A., Kennedy, A., Shinde, M. U., Karami, S., Ma, Y., Stojanovic, D., Zhao, Y., Ball, R. and Nelson, J. C. (2023). Improving methods of identifying anaphylaxis for medical product safety surveillance using natural language processing and machine learning. American Journal of Epidemiology 192(2) 283–295.
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y. and Li, Y. (2019). xgboost: Extreme Gradient Boosting. R package version 0.82.1. https://CRAN.R-project.org/package=xgboost.
Corey, L., Gilbert, P. B., Juraska, M., Montefiori, D. C., Morris, L., Karuna, S. T., Edupuganti, S., Mgodi, N. M., DeCamp, A. C., Rudnicki, E. et al. (2021). Two randomized trials of neutralizing antibodies to prevent HIV-1 acquisition. New England Journal of Medicine 384(11) 1003–1014. https://doi.org/10.1056/NEJMoa2031738.
Coyle, J., Hejazi, N., Malencia, I., Phillips, R. and Sofrygin, O. (2023). sl3: Pipelines for machine learning and Super Learning. https://doi.org/10.5281/zenodo.1342293. https://github.com/tlverse/sl3.
Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1) 1–22. https://doi.org/10.18637/jss.v033.i01.
Friedman, J. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics 29(5) 1189–1232. https://doi.org/10.1214/aos/1013203451. MR1873328
Hake, A. and Pfeifer, N. (2017). Prediction of HIV-1 sensitivity to broadly neutralizing antibodies shows a trend towards resistance over time. PLoS Computational Biology 13(10) 1005789. https://doi.org/10.1371/journal.pcbi.1005789.
Hepler, N. L., Scheffler, K., Weaver, S., Murrell, B., Richman, D. D., Burton, D. R., Poignard, P., Smith, D. M. and Kosakovsky Pond, S. L. (2014). IDEPI: rapid prediction of HIV-1 antibody epitopes and other phenotypic features from sequence data using a flexible machine learning platform. PLoS Computational Biology 10(9) 1003842.
Leng, C., Lin, Y. and Wahba, G. (2006). A note on the lasso and related procedures in model selection. Statistica Sinica 16 1273–1284. MR2327490
Milborrow, S. (2021). earth: Multivariate Adaptive Regression Splines. R package version 5.3.1. https://CRAN.R-project.org/package=earth.
Petersen, M. L., LeDell, E., Schwab, J., Sarovar, V., Gross, R., Reynolds, N., Haberer, J. E., Goggin, K., Golin, C., Arnsten, J., Rosen, M. I., Remien, R. H., Etoori, D., Wilson, I. B., Simoni, J. M., Erlen, J. A., van der Laan, M. J., Liu, H. and Bangsberg, D. R. (2015). Super learner analysis of electronic adherence data improves viral prediction and may provide strategies for selective HIV RNA monitoring. JAIDS Journal of Acquired Immune Deficiency Syndromes 69(1) 109–118.
Polley, E., LeDell, E., Kennedy, C. and van der Laan, M. (2021). SuperLearner: Super Learner Prediction. R package version 2.0-28. https://CRAN.R-project.org/package=SuperLearner.
Rawi, R., Mall, R., Shen, C.-H., Farney, S. K., Shiakolas, A., Zhou, J., Bensmail, H., Chun, T.-W., Doria-Rose, N. A., Lynch, R. M., Mascola, J. R., Kwong, P. D. and Chuang, G.-Y. (2019). Accurate prediction for antibody resistance of clinical HIV-1 isolates. Scientific Reports 9(1) 14696. https://doi.org/10.1038/s41598-019-50635-w.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 58(1) 267–288. MR1379242
van der Laan, M. and Rose, S. (2011) Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media. https://doi.org/10.1007/978-1-4419-9782-1. MR2867111
van der Laan, M., Polley, E. and Hubbard, A. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology 6(1) 25. https://doi.org/10.2202/1544-6115.1309. MR2349918
Wright, M. N. and Ziegler, A. (2017). ranger: a fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77(1) 1–17. https://doi.org/10.18637/jss.v077.i01. MR4583337