1 Introduction
Estimating a prediction function is a fundamental component of statistical data analysis. Based on measured outcome Y and covariates X, the goal is to estimate the conditional expectation $E(Y\mid X)$. There are many approaches to estimating this regression function, ranging from simple and fully parametric [e.g., generalized linear models; 20] to flexible machine learning approaches, including random forests [3], gradient boosted trees [13], the lasso [27], and neural networks [2]. While a single estimator (also referred to as a learner) may be chosen, it can be advantageous to instead consider an ensemble of multiple candidate learners; a large ensemble of flexible learners increases the chance that one learner can approximate the underlying conditional expectation well.
The super learner (SL) [29, 24] is one such ensemble, and is related to stacking [32]. The super learner has been shown to have the same expected loss for predicting the outcome as the oracle estimator, asymptotically [29]. If both simple and complex algorithms are included in the library of candidate learners, the cross-validation used within the super learner to select the optimal combination of candidate learners to minimize a cross-validated loss function can minimize the risk of overfitting [1]. The super learner has been used successfully in many applications [see, e.g., 28, 23, 21, 18, 6] and is implemented in several software packages for the R programming language [25, 10].
In some settings, it may be of interest to perform variable selection as part of certain candidate learners within the super learner. This includes high-dimensional settings where prediction performance may be improved by reducing the dimension prior to prediction and settings where having a parsimonious set of variables is a goal of the analysis. While recent work has developed general guidelines for specifying a super learner [22], the choice of screening algorithms (often referred to as screeners) has been relatively unexplored. In particular, there are cases where theory suggests that the lasso does not consistently select the most relevant variables [17]. In this article, we explore the use of the lasso as a screener within a super learner ensemble, with the goal of determining if there are cases where the performance of the ensemble is sensitive to possible poor performance of the lasso screener.
2 Overview of Variable Screening in the Super Learner
Phillips et al. [22] provide a thorough overview of the super learner algorithm, which we briefly summarize here. The super learner takes as input the following: the dataset ${\{({X_{i}},{Y_{i}})\}_{i=1}^{n}}$; a library of candidate learners (e.g., random forests, the lasso, neural networks), possibly including combinations with variable screeners (e.g., the lasso) that reduce the dimension of the covariates prior to prediction; a fixed number of cross-validation folds; and a loss function to minimize using cross-validation. The ensemble super learner (hereafter eSL) uses a meta-learner to combine the predictions from the candidate learners [22]. Below, we will refer to a special case of the eSL, which we call the cSL: the convex combination of the candidate learners that minimizes the cross-validated loss. The combination weights are greater than or equal to zero by definition. The discrete super learner (dSL) selects the single candidate learner that minimizes the cross-validated loss.
Including variable screeners in the SL library is motivated by the fact that reducing the number of covariates can improve prediction performance in some cases [see, e.g., 27], for example, high-dimensional settings. Screeners can be broadly categorized as outcome-blind, such as removing one variable from a pair of highly correlated covariates; or based on the outcome-covariate relationship. Examples of this latter category include removing covariates with univariate outcome-correlation-test p-value larger than a threshold; removing covariates with random forest variable importance measure [3] rank larger than a threshold; or removing covariates with zero estimated lasso coefficient.
Strategies based on the outcome-covariate relationship, if pursued, should be combined with other algorithms in the SL library and should be evaluated using cross-validation [22]. In practice, specifying a screener-learner combination results in a new learner, where first the screener is applied and then the learner is applied on the reduced set of covariates. This becomes one of the learners in the SL library, and like any other learner, can either be chosen as part of the optimal combination or assigned zero weight. For example, suppose that q screeners and ℓ learners are considered. Then the candidate library could consist of all $q\times \ell $ screener-learner combinations, or a subset of these combinations chosen by the analyst. Below, we will consider all $q\times \ell $ screener-learner pairs. The ensembling step of the super learner assigns non-negative coefficients to each of the screener-learner combinations to create the ensemble learner.
3 Numerical Experiments
3.1 Data-Generating Mechanisms
To demonstrate the performance of the SL procedure using different screeners, we consider several data-generating scenarios. In each scenario, our simulated dataset consists of independent replicates of $(X,Y)$, where $X=({X_{1}},\dots ,{X_{p}})$ is a covariate vector and Y is the outcome of interest.
We consider a continuous outcome with $Y\mid (X=x)=f(x)+\epsilon $, where $\epsilon \sim N(0,1)$ independent of X; and a binary outcome with $Pr(Y=1\mid X=x)=\Phi \{f(x)\}$, where Φ denotes the cumulative distribution function of the standard normal distribution (so Y follows a probit model). The outcome regression function f is either linear, with $f(x)=x\beta $, or nonlinear, with
\[\begin{aligned}{}f(x)& ={\beta _{1}}{f_{1}}\big\{{c_{1}}({x_{1}})\big\}+{\beta _{2}}{f_{2}}\big\{{c_{2}}({x_{2}}),{c_{3}}({x_{3}})\big\}\\ {} & \hspace{1em}+{\beta _{3}}{f_{3}}\big\{{c_{3}}({x_{3}})\big\}+{\beta _{4}}{f_{4}}\big\{{c_{4}}({x_{4}})\big\}\\ {} & \hspace{1em}+{\beta _{5}}{f_{2}}\big\{{c_{5}}({x_{5}}),{c_{1}}({x_{1}})\big\}+{\beta _{6}}{f_{3}}\big\{{c_{6}}({x_{6}})\big\},\\ {} {f_{1}}(x)& =\sin \bigg(\frac{\pi }{4}x\bigg),\hspace{1em}{f_{2}}(x,y)=xy,\\ {} {f_{3}}(x)& =x,\hspace{1em}{f_{4}}(x)=\cos \bigg(\frac{\pi }{4}x\bigg).\end{aligned}\]
The functions ${c_{1}},\dots ,{c_{6}}$ scale each variable to have mean zero and standard deviation one. The vector β determines the strength of the relationship between outcome and covariates. We define a weak relationship between the outcome and covariates by setting $\beta =(0,1,0,0,0,1,{\mathbf{0}_{p-6}})$, where $p-6$ variables do not affect the outcome, and a stronger relationship between the outcome and covariates by setting $\beta =(-3,-1,1,-1.5,-0.5,0.5,{\mathbf{0}_{p-6}})$. The covariates follow a multivariate normal distribution with mean zero and covariance matrix Σ. In the uncorrelated case, Σ is the identity matrix. In the correlated case, the variables in the active set (a subset of the first six variables) have correlation 0.9 (in the case of the strong outcome-covariate relationship) or 0.95 (in the case of the weak relationship) while the remaining variables have correlation 0.3. Based on the strength of relationship between outcome and features, whether it is linear or nonlinear, and whether the features are correlated, the outcome rate in the binary case ranges from approximately 13% to 80%.3.2 Prediction Algorithms
We compared several main prediction algorithms: the lasso, the cSL without including the lasso in its library of candidate learners [referred to as cSL (-lasso)], the cSL including the lasso (referred to as cSL), and the dSL with and without the lasso in its library of candidate learners (referred to as dSL and dSL (-lasso), respectively). For the super learner approaches, we further considered four possible sets of screeners that were fit prior to any learners: no screeners; a lasso screener only; rank correlation, univariate correlation, random forest, and lasso screeners (referred to as “All” screeners); and all possible screeners except the lasso [referred to as “All (-lasso)”]. Tuning parameters for the screeners depended on the total number of features, except for the lasso screener, which always removed variables with zero regression coefficient based on a tuning parameter selected by 10-fold cross-validation. For $p=10$, we considered a screener that selected all variables and a univariate correlation screener that removed variables with outcome-correlation-test p-value less than 0.2. For $p\gt 10$, the rank correlation screeners removed variables outside of the top 10, 25, or 50 ranked correlation-test p-values; the univariate correlation screener removed variables with p-value less than 0.2 or 0.4; and the random forest screener removed variables outside of the top 10 or 25 most-important variables, ranked by the random forest variable importance measure [3].
We finalized our cSL specification following the guidelines specified in Phillips et al. [22]. First, because we were interested in estimating the true continuous prediction function for both continuous and binary outcomes, we estimated the V-fold cross-validated least squares loss (for continuous outcomes) or log-likelihood loss (for binary outcomes); we then used the non-negative least squares (NNLS) or non-negative log-likelihood metalearner to obtain the optimal convex combination of these learners, respectively. We used stratified cross-validation [16] in the binary-outcome case. We used nested cross-validation in all cases to estimate the performance of the cSL and all individual screener-learner pairs. Second, we computed the effective sample size ${n_{\text{eff}}}$, and based our choice of V on the flowchart in Figure 1 of Phillips et al. [22]; the values of V are provided below. Finally, our library of screener-learner pairs specified above was designed to be computationally feasible and adapt to high dimensions and different underlying true regression functions.
Table 1
All possible candidate learners for super learners used in the simulations, along with their R implementation, tuning parameter values, and description of the tuning parameters. All tuning parameters besides those listed here are set to their default values. In particular, the random forests are grown with a minimum node size of 5 for continuous outcomes and 1 for binary outcomes and a subsampling fraction of 1; the boosted trees are grown with shrinkage rate of 0.1, and a minimum of 10 observations per node.
Candidate learner | R implementation | Tuning parameter and possible values | Tuning parameter description |
Generalized linear models | base | – | – |
Random forests | ranger [33] | num.trees = 1000 | Number of trees |
$\texttt{min.node.size}\in \{5,20,50,100,250\}$ | Minimum node size | ||
Gradient boosted trees | xgboost [7] | max.depth $=4$ | Maximum tree depth |
ntree $\in \{100,500,1000\}$ | Number of iterations | ||
shrinkage $\in \{0.01,0.1\}$ | Shrinkage | ||
Multivariate adaptive regression splines | earth [19] | $\texttt{nk}=\min {\{\max \{21,2p+1\},1000\}^{\dagger }}$ | Maximum number of model terms before pruning |
Lasso | glmnet [12] | λ, chosen via 10-fold cross-validation | ${\ell _{1}}$ regularization parameter |
${^{\dagger }}$: p denotes the total number of predictors.
3.3 Experimental Overview
For each $n\in \{200,500,1000,2000,3000\}$, $p\in \{10,500\}$, and simulation scenario described above, we generated 1000 random datasets according to this data generating mechanism. For continuous outcomes, ${n_{\text{eff}}}=n$; thus, we set $V=20$ for $n\le 500$ and set $V=10$ otherwise. For binary outcomes, ${n_{\text{eff}}}$ ranged from 10 (the 5% incidence outcome at $n=200$) to 1367 (a 54% incidence outcome at $n=3000$). We set $V={n_{\text{eff}}}$ in three cases, and $V=20$ or $V=10$ otherwise, depending on the value of ${n_{\text{eff}}}$. The exact values of ${n_{\text{eff}}}$ and V used are provided in the Supporting Information. We additionally generated a test dataset with sample size 1 million in each replication to estimate the true prediction performance of each prediction function estimated using V-fold cross-validation. We measured prediction performance for each algorithm described above using R-squared for continuous outcomes and area under the receiver operating characteristic curve (AUC) and non-negative log likelihood for binary outcomes. For the continuous outcome, R-squared is equivalent to the cross-validated metric that is being optimized: the mean squared error, which is equal to R-squared up to a scaling factor, the outcome variance. For the binary outcome, AUC is often of interest when assessing prediction performance. AUC is not equivalent to non-negative log-likelihood; however, developing a super learner using AUC loss can be unstable in some settings.
3.4 Results
We display the results under a strong outcome-feature relationship in Figures 1 and 2. Focusing first on a continuous outcome, when the outcome-feature relationship is linear (Figure 1 left column), all estimators have prediction performance converging quickly to the best-possible prediction performance as the sample size increases. In small samples with a linear relationship, removing the lasso from the SL library results in decreased performance. When the outcome-feature relationship is nonlinear (Figure 1 right column), the results depend on the variable screeners and algorithm used. The lasso has poor performance regardless of sample size, particularly in the case with correlated features; this is consistent with theory [17]. Also, particularly for large numbers of features (e.g., when $p=500$), using the lasso screener alone within a super learner degrades performance, while using a large library of candidate screeners can improve performance over a super learner with no screeners. Having a large library of candidate screeners can protect against poor lasso performance. Results are similar for the binary outcome.
Figure 1
Prediction performance versus sample size n, measured using cross-validated R-squared, for predicting a continuous outcome. There is a strong relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.
The results under a weak outcome-feature relationship follow similar patterns (Figures 3 and 4). In this case, the best-possible prediction performance is lower than in the strong-relationship case, as expected; and a larger sample size is required to achieve prediction performance close to this optimal level.
Figure 2
Prediction performance versus sample size n, measured using cross-validated AUC, for predicting a binary outcome. There is a strong relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.
Figure 3
Prediction performance versus sample size n, measured using cross-validated R-squared, for predicting a continuous outcome. There is a weak relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.
Figure 4
Prediction performance versus sample size n, measured using cross-validated AUC, for predicting a binary outcome. There is a weak relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.
Table 2
Estimates of cross-validated R-squared for the continuous ${\text{IC}_{50}}$ outcome, for the convex ensemble super learner (cSL), the discrete super learner (dSL), and the lasso, under each combination of learners and screeners. For screeners, ‘None’ denotes no screeners; ‘Lasso’ denotes only a lasso screener; ‘All (-lasso)’ denotes random forest, rank-correlation, and correlation-test p-value screening; ‘All’ denotes these three screener types plus the lasso; and ‘All (+none)’ denotes all screeners plus the ‘none’ screener.
Learners | Screeners | Algorithm | Min | Max | Point estimate [95% CI] |
All | None | cSL | 0.208 | 0.501 | 0.373 [0.353, 0.393] |
All | None | dSL | 0.058 | 0.491 | 0.366 [0.347, 0.385] |
All | None | lasso | 0.331 | 0.331 | 0.331 [0.305, 0.358] |
All | Lasso | cSL | 0.175 | 0.527 | 0.388 [0.364, 0.414] |
All | Lasso | dSL | 0.173 | 0.516 | 0.387 [0.366, 0.409] |
All | All (-lasso) | cSL | 0.182 | 0.535 | 0.390 [0.370, 0.411] |
All | All (-lasso) | dSL | 0.192 | 0.519 | 0.391 [0.372, 0.411] |
All | All | cSL | 0.180 | 0.545 | 0.394 [0.371, 0.417] |
All | All | dSL | 0.173 | 0.516 | 0.387 [0.365, 0.409] |
All | All (+none) | cSL | 0.203 | 0.533 | 0.378 [0.354, 0.403] |
All | All (+none) | dSL | 0.173 | 0.516 | 0.387 [0.365, 0.409] |
Table 3
Estimates of cross-validated AUC for the binary sensitivity outcome, for the convex ensemble super learner (cSL), the discrete super learner (dSL), and the lasso, under each combination of learners and screeners. For screeners, ‘None’ denotes no screeners; ‘Lasso’ denotes only a lasso screener; ‘All (-lasso)’ denotes random forest, rank-correlation, and correlation-test p-value screening; ‘All’ denotes these three screener types plus the lasso; and ‘All (+none)’ denotes all screeners plus the ‘none’ screener.
Learners | Screeners | Algorithm | Min | Max | Point estimate [95% CI] |
All | None | cSL | 0.755 | 0.874 | 0.823 [0.719, 0.928] |
All | None | dSL | 0.763 | 0.895 | 0.837 [0.737, 0.936] |
All | None | lasso | 0.647 | 0.813 | 0.757 [0.633, 0.882] |
All | Lasso | cSL | 0.727 | 0.865 | 0.806 [0.696, 0.915] |
All | Lasso | dSL | 0.730 | 0.897 | 0.811 [0.703, 0.919] |
All | All (-lasso) | cSL | 0.752 | 0.906 | 0.826 [0.723, 0.929] |
All | All (-lasso) | dSL | 0.772 | 0.907 | 0.827 [0.724, 0.929] |
All | All | cSL | 0.750 | 0.873 | 0.823 [0.719, 0.928] |
All | All | dSL | 0.772 | 0.897 | 0.826 [0.723, 0.929] |
All | All (+none) | cSL | 0.746 | 0.879 | 0.825 [0.720, 0.929] |
All | All (+none) | dSL | 0.772 | 0.897 | 0.829 [0.727, 0.931] |
In the Supporting Information, we provide additional results. Results for the binary outcome with respect to non-negative log-likelihood follow similar patterns to those observed here using AUC. We considered further feature dimensions p with a fixed number of cross-validation folds V, and found similar results to the primary results presented above. Finally, we present results for $n=500$ and $p=2000$ and for candidate learners within the super learner. In the high-dimensional setting, performance follows the same trends across outcomes and estimators as the other $(n,p)$ combinations.
4 Predicting HIV-1 Neutralization Susceptibility
HIV-1 is a genetically diverse pathogen. Broadly neutralizing antibodies (bnAbs) against HIV-1 neutralize a wide array of HIV-1 genetic variants. One such bnAb, VRC01, was recently evaluated in two placebo-controlled randomized trials [9]. Predicting whether or not a given HIV-1 virus is susceptible to neutralization by a bnAb, including VRC01, is an important component of prevention research; several prediction models have been developed recently [15, 5, 14, 4, 26, 8, 34, 18, 30, 11, 31].
We analyze HIV-1 envelope (Env) amino acid (AA) sequence data from 611 publicly-available HIV-1 Env pseudoviruses made from blood samples of HIV-1 infected individuals [18]. In addition to binary indicators of specific AA residues at each position in the Env sequence, the data also include information on the geographic region of origin of the virus, the subtype of the virus, and viral geometry; there are over 800 features in total. We considered two outcomes of interest: the ${\log _{10}}$-transformed 50% inhibitory concentration, ${\text{IC}_{50}}$, defined as the concentration ($\text{ţ}\text{g}$/$\mathrm{mL}$) of VRC01 necessary to neutralize 50% of viruses in vitro, with large values of ${\text{IC}_{50}}$ indicating resistance to neutralization; and susceptibility to neutralization, defined as the binary indicator that ${\text{IC}_{50}}\lt 1\hspace{2.5pt}\text{ţ}\text{g}$/$\mathrm{mL}$. For each outcome, we considered the same prediction algorithms and eSL specification as in Section 3. Following Phillips et al. [22], we set $V=10$ for both the continuous and binary outcome.
The results are presented in Tables 2 and 3. For both outcomes, some screening tended to be beneficial. Among the analyses that used screeners, using the lasso screener alone resulted in the worst performance for the binary outcome and near the worst for the continuous outcome. Again, for both outcomes, having a large set of screeners protected against poor lasso performance; the lasso performed worse than the cSL or dSL for both outcomes. The lasso had a cross-validated (CV) R-squared for the continuous outcome of 0.331 with a 95% confidence interval (CI) of [0.305, 0.358], and a CV AUC for the binary outcome of 0.757 [0.633, 0.882]. For the continuous outcome, the largest point estimate of CV R-squared was achieved by the cSL with all screeners, including the lasso; the CV R-squared was 0.394 [0.371, 0.417]. The best-performing dSL was in the case with all screeners but the lasso, with CV R-squared 0.391 [0.372, 0.411]. For the binary outcome, the largest CV AUC for the cSL was 0.826 [0.723, 0.929] in the case with all screeners but the lasso; for the dSL, the largest CV AUC was 0.837 [0.737, 0.936] in the case with no screeners. In the Supporting Information, we present cross-validated performance for the candidate learners in each cSL; cross-validated negative log-likelihood loss for the binary susceptibility outcome; and the cSL coefficients and dSLs for each cross-validation fold.
5 Discussion
In this manuscript, we explored the effect of using different combinations of variable screeners within the super learner. We found that both the lasso and the ensemble super learner (cSL) using only a lasso screener had poor prediction performance when the outcome-feature relationship was nonlinear; in other words, in the case where the lasso is misspecified. However, if a sufficiently rich set of candidate screeners were included, then including the lasso as a candidate screener did not degrade performance. These results held for both continuous and binary outcomes, and for both strong and weak relationships between the outcome and features. The same patterns held for the discrete super learner (dSL). In an analysis of 611 HIV-1 envelope protein pseudoviruses with over 800 features, we found similar results to the simulations. There, the dSL tended to result in performance similar to the cSL.
Taken together, the results suggest that some caution must be used when specifying screeners within a super learner, but that a sufficiently large set of candidate screeners can protect against misspecification of a given screener. This guidance is similar to the guidance to specify a diverse set of learners in a super learner [22], and can be viewed as complementary, since an algorithm-screener pair defines a new candidate learner.