<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">NEJSDS</journal-id>
<journal-title-group><journal-title>The New England Journal of Statistics in Data Science</journal-title></journal-title-group>
<issn pub-type="ppub">2693-7166</issn><issn-l>2693-7166</issn-l>
<publisher>
<publisher-name>New England Statistical Society</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">NEJSDS82</article-id>
<article-id pub-id-type="doi">10.51387/25-NEJSDS82</article-id>
<article-id pub-id-type="arxiv">2311.03313</article-id>
<article-categories><subj-group subj-group-type="area">
<subject>Machine Learning and Data Mining</subject></subj-group><subj-group subj-group-type="heading">
<subject>Case Study, Application, and/or Practice Article</subject></subj-group></article-categories>
<title-group>
<article-title>Practical Considerations for Variable Screening in the Super Learner</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-7024-548X</contrib-id>
<name><surname>Williamson</surname><given-names>Brian D.</given-names></name><email xlink:href="mailto:brian.d.williamson@kp.org">brian.d.williamson@kp.org</email><xref ref-type="aff" rid="j_nejsds82_aff_001"/><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0009-0007-9594-1602</contrib-id>
<name><surname>King</surname><given-names>Drew</given-names></name><xref ref-type="aff" rid="j_nejsds82_aff_002"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Huang</surname><given-names>Ying</given-names></name><xref ref-type="aff" rid="j_nejsds82_aff_003"/>
</contrib>
<aff id="j_nejsds82_aff_001"><institution>Kaiser Permanente Washington Health Research Institute, Fred Hutchinson Cancer Center, and University of Washington</institution>, <country>USA</country>. E-mail address: <email xlink:href="mailto:brian.d.williamson@kp.org">brian.d.williamson@kp.org</email></aff>
<aff id="j_nejsds82_aff_002"><institution>Seattle Central College</institution>, <country>USA</country>.</aff>
<aff id="j_nejsds82_aff_003"><institution>Fred Hutchinson Cancer Center and University of Washington</institution>, <country>USA</country>.</aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2025</year></pub-date><pub-date pub-type="epub"><day>7</day><month>5</month><year>2025</year></pub-date><volume>3</volume><issue>2</issue><fpage>167</fpage><lpage>175</lpage><supplementary-material id="S1" content-type="document" xlink:href="nejsds82_s001.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<title>Supplementary Material</title>
<p>Additional numerical results are available in the Supporting Information. Code to reproduce all numerical experiments and the data analysis is available on GitHub at <uri>https://github.com/bdwilliamson/sl_screening_supplementary</uri>.</p>
</caption>
</supplementary-material><history><date date-type="accepted"><day>20</day><month>3</month><year>2025</year></date></history>
<permissions><copyright-statement>© 2025 New England Statistical Society</copyright-statement><copyright-year>2025</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Estimating a prediction function is a fundamental component of many data analyses. The super learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms (screeners), including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a super learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screeners should be used to protect against poor performance of any one screener, similar to the guidance for choosing a library of prediction algorithms for the super learner. These results are further illustrated through the analysis of HIV-1 antibody data.</p>
</abstract>
<kwd-group>
<label>Keywords and phrases</label>
<kwd>Super learner</kwd>
<kwd>Ensemble machine learning</kwd>
<kwd>Variable screening</kwd>
<kwd>Prediction</kwd>
</kwd-group>
<funding-group><funding-statement>This work was supported by the National Institutes of Health (NIH) grants R01CA277133, R37AI054165, R01GM106177, U24CA086368 and S10OD028685. The opinions expressed in this article are those of the authors and do not necessarily represent the official views of the NIH.</funding-statement></funding-group>
</article-meta>
</front>
<body>
<sec id="j_nejsds82_s_001">
<label>1</label>
<title>Introduction</title>
<p>Estimating a prediction function is a fundamental component of statistical data analysis. Based on measured outcome <italic>Y</italic> and covariates <italic>X</italic>, the goal is to estimate the conditional expectation <inline-formula id="j_nejsds82_ineq_001"><alternatives><mml:math>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo stretchy="false">∣</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$E(Y\mid X)$]]></tex-math></alternatives></inline-formula>. There are many approaches to estimating this regression function, ranging from simple and fully parametric [e.g., generalized linear models; <xref ref-type="bibr" rid="j_nejsds82_ref_020">20</xref>] to flexible machine learning approaches, including random forests [<xref ref-type="bibr" rid="j_nejsds82_ref_003">3</xref>], gradient boosted trees [<xref ref-type="bibr" rid="j_nejsds82_ref_013">13</xref>], the lasso [<xref ref-type="bibr" rid="j_nejsds82_ref_027">27</xref>], and neural networks [<xref ref-type="bibr" rid="j_nejsds82_ref_002">2</xref>]. While a single estimator (also referred to as a learner) may be chosen, it can be advantageous to instead consider an ensemble of multiple candidate learners; a large ensemble of flexible learners increases the chance that one learner can approximate the underlying conditional expectation well.</p>
<p>The super learner (SL) [<xref ref-type="bibr" rid="j_nejsds82_ref_029">29</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_024">24</xref>] is one such ensemble, and is related to stacking [<xref ref-type="bibr" rid="j_nejsds82_ref_032">32</xref>]. The super learner has been shown to have the same expected loss for predicting the outcome as the oracle estimator, asymptotically [<xref ref-type="bibr" rid="j_nejsds82_ref_029">29</xref>]. If both simple and complex algorithms are included in the library of candidate learners, the cross-validation used within the super learner to select the optimal combination of candidate learners to minimize a cross-validated loss function can minimize the risk of overfitting [<xref ref-type="bibr" rid="j_nejsds82_ref_001">1</xref>]. The super learner has been used successfully in many applications [see, e.g., <xref ref-type="bibr" rid="j_nejsds82_ref_028">28</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_023">23</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_021">21</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_018">18</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_006">6</xref>] and is implemented in several software packages for the R programming language [<xref ref-type="bibr" rid="j_nejsds82_ref_025">25</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_010">10</xref>].</p>
<p>In some settings, it may be of interest to perform variable selection as part of certain candidate learners within the super learner. This includes high-dimensional settings where prediction performance may be improved by reducing the dimension prior to prediction and settings where having a parsimonious set of variables is a goal of the analysis. While recent work has developed general guidelines for specifying a super learner [<xref ref-type="bibr" rid="j_nejsds82_ref_022">22</xref>], the choice of <italic>screening algorithms</italic> (often referred to as <italic>screeners</italic>) has been relatively unexplored. In particular, there are cases where theory suggests that the lasso does not consistently select the most relevant variables [<xref ref-type="bibr" rid="j_nejsds82_ref_017">17</xref>]. In this article, we explore the use of the lasso as a screener within a super learner ensemble, with the goal of determining if there are cases where the performance of the ensemble is sensitive to possible poor performance of the lasso screener.</p>
</sec>
<sec id="j_nejsds82_s_002">
<label>2</label>
<title>Overview of Variable Screening in the Super Learner</title>
<p>Phillips et al. [<xref ref-type="bibr" rid="j_nejsds82_ref_022">22</xref>] provide a thorough overview of the super learner algorithm, which we briefly summarize here. The super learner takes as input the following: the dataset <inline-formula id="j_nejsds82_ineq_002"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${\{({X_{i}},{Y_{i}})\}_{i=1}^{n}}$]]></tex-math></alternatives></inline-formula>; a <italic>library</italic> of candidate learners (e.g., random forests, the lasso, neural networks), possibly including combinations with variable screeners (e.g., the lasso) that reduce the dimension of the covariates prior to prediction; a fixed number of cross-validation folds; and a loss function to minimize using cross-validation. The <italic>ensemble super learner</italic> (hereafter eSL) uses a meta-learner to combine the predictions from the candidate learners [<xref ref-type="bibr" rid="j_nejsds82_ref_022">22</xref>]. Below, we will refer to a special case of the eSL, which we call the <italic>cSL</italic>: the convex combination of the candidate learners that minimizes the cross-validated loss. The combination weights are greater than or equal to zero by definition. The discrete super learner (dSL) selects the single candidate learner that minimizes the cross-validated loss.</p>
<p>Including variable screeners in the SL library is motivated by the fact that reducing the number of covariates can improve prediction performance in some cases [see, e.g., <xref ref-type="bibr" rid="j_nejsds82_ref_027">27</xref>], for example, high-dimensional settings. Screeners can be broadly categorized as outcome-blind, such as removing one variable from a pair of highly correlated covariates; or based on the outcome-covariate relationship. Examples of this latter category include removing covariates with univariate outcome-correlation-test p-value larger than a threshold; removing covariates with random forest variable importance measure [<xref ref-type="bibr" rid="j_nejsds82_ref_003">3</xref>] rank larger than a threshold; or removing covariates with zero estimated lasso coefficient.</p>
<p>Strategies based on the outcome-covariate relationship, if pursued, should be combined with other algorithms in the SL library and should be evaluated using cross-validation [<xref ref-type="bibr" rid="j_nejsds82_ref_022">22</xref>]. In practice, specifying a screener-learner combination results in a new learner, where first the screener is applied and then the learner is applied on the reduced set of covariates. This becomes one of the learners in the SL library, and like any other learner, can either be chosen as part of the optimal combination or assigned zero weight. For example, suppose that <italic>q</italic> screeners and <italic>ℓ</italic> learners are considered. Then the candidate library could consist of all <inline-formula id="j_nejsds82_ineq_003"><alternatives><mml:math>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>ℓ</mml:mi></mml:math><tex-math><![CDATA[$q\times \ell $]]></tex-math></alternatives></inline-formula> screener-learner combinations, or a subset of these combinations chosen by the analyst. Below, we will consider all <inline-formula id="j_nejsds82_ineq_004"><alternatives><mml:math>
<mml:mi mathvariant="italic">q</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi>ℓ</mml:mi></mml:math><tex-math><![CDATA[$q\times \ell $]]></tex-math></alternatives></inline-formula> screener-learner pairs. The ensembling step of the super learner assigns non-negative coefficients to each of the screener-learner combinations to create the ensemble learner.</p>
</sec>
<sec id="j_nejsds82_s_003">
<label>3</label>
<title>Numerical Experiments</title>
<sec id="j_nejsds82_s_004">
<label>3.1</label>
<title>Data-Generating Mechanisms</title>
<p>To demonstrate the performance of the SL procedure using different screeners, we consider several data-generating scenarios. In each scenario, our simulated dataset consists of independent replicates of <inline-formula id="j_nejsds82_ineq_005"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(X,Y)$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_nejsds82_ineq_006"><alternatives><mml:math>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$X=({X_{1}},\dots ,{X_{p}})$]]></tex-math></alternatives></inline-formula> is a covariate vector and <italic>Y</italic> is the outcome of interest.</p>
<p>We consider a continuous outcome with <inline-formula id="j_nejsds82_ineq_007"><alternatives><mml:math>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo stretchy="false">∣</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi></mml:math><tex-math><![CDATA[$Y\mid (X=x)=f(x)+\epsilon $]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_nejsds82_ineq_008"><alternatives><mml:math>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo stretchy="false">∼</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\epsilon \sim N(0,1)$]]></tex-math></alternatives></inline-formula> independent of <italic>X</italic>; and a binary outcome with <inline-formula id="j_nejsds82_ineq_009"><alternatives><mml:math>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">∣</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">Φ</mml:mi>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$Pr(Y=1\mid X=x)=\Phi \{f(x)\}$]]></tex-math></alternatives></inline-formula>, where Φ denotes the cumulative distribution function of the standard normal distribution (so <italic>Y</italic> follows a probit model). The outcome regression function <italic>f</italic> is either linear, with <inline-formula id="j_nejsds82_ineq_010"><alternatives><mml:math>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">β</mml:mi></mml:math><tex-math><![CDATA[$f(x)=x\beta $]]></tex-math></alternatives></inline-formula>, or nonlinear, with 
<disp-formula id="j_nejsds82_eq_001">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd">
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">}</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">}</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mspace width="1em"/>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">}</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">}</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:mspace width="1em"/>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">}</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" maxsize="1.19em" minsize="1.19em">}</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mo>=</mml:mo>
<mml:mo movablelimits="false">sin</mml:mo>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">(</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">π</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">)</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mi mathvariant="italic">y</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="align-odd">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mtd>
<mml:mtd class="align-even">
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo movablelimits="false">cos</mml:mo>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">(</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">π</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mi mathvariant="italic">x</mml:mi>
<mml:mo mathvariant="normal" fence="true" maxsize="2.03em" minsize="2.03em">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}f(x)& ={\beta _{1}}{f_{1}}\big\{{c_{1}}({x_{1}})\big\}+{\beta _{2}}{f_{2}}\big\{{c_{2}}({x_{2}}),{c_{3}}({x_{3}})\big\}\\ {} & \hspace{1em}+{\beta _{3}}{f_{3}}\big\{{c_{3}}({x_{3}})\big\}+{\beta _{4}}{f_{4}}\big\{{c_{4}}({x_{4}})\big\}\\ {} & \hspace{1em}+{\beta _{5}}{f_{2}}\big\{{c_{5}}({x_{5}}),{c_{1}}({x_{1}})\big\}+{\beta _{6}}{f_{3}}\big\{{c_{6}}({x_{6}})\big\},\\ {} {f_{1}}(x)& =\sin \bigg(\frac{\pi }{4}x\bigg),\hspace{1em}{f_{2}}(x,y)=xy,\\ {} {f_{3}}(x)& =x,\hspace{1em}{f_{4}}(x)=\cos \bigg(\frac{\pi }{4}x\bigg).\end{aligned}\]]]></tex-math></alternatives>
</disp-formula> 
The functions <inline-formula id="j_nejsds82_ineq_011"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${c_{1}},\dots ,{c_{6}}$]]></tex-math></alternatives></inline-formula> scale each variable to have mean zero and standard deviation one. The vector <italic>β</italic> determines the strength of the relationship between outcome and covariates. We define a weak relationship between the outcome and covariates by setting <inline-formula id="j_nejsds82_ineq_012"><alternatives><mml:math>
<mml:mi mathvariant="italic">β</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mn mathvariant="bold">0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\beta =(0,1,0,0,0,1,{\mathbf{0}_{p-6}})$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_nejsds82_ineq_013"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>6</mml:mn></mml:math><tex-math><![CDATA[$p-6$]]></tex-math></alternatives></inline-formula> variables do not affect the outcome, and a stronger relationship between the outcome and covariates by setting <inline-formula id="j_nejsds82_ineq_014"><alternatives><mml:math>
<mml:mi mathvariant="italic">β</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mo>−</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>−</mml:mo>
<mml:mn>1.5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>−</mml:mo>
<mml:mn>0.5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0.5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mn mathvariant="bold">0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\beta =(-3,-1,1,-1.5,-0.5,0.5,{\mathbf{0}_{p-6}})$]]></tex-math></alternatives></inline-formula>. The covariates follow a multivariate normal distribution with mean zero and covariance matrix Σ. In the uncorrelated case, Σ is the identity matrix. In the correlated case, the variables in the active set (a subset of the first six variables) have correlation 0.9 (in the case of the strong outcome-covariate relationship) or 0.95 (in the case of the weak relationship) while the remaining variables have correlation 0.3. Based on the strength of relationship between outcome and features, whether it is linear or nonlinear, and whether the features are correlated, the outcome rate in the binary case ranges from approximately 13% to 80%.</p>
</sec>
<sec id="j_nejsds82_s_005">
<label>3.2</label>
<title>Prediction Algorithms</title>
<p>We compared several main prediction algorithms: the lasso, the cSL without including the lasso in its library of candidate learners [referred to as cSL (-lasso)], the cSL including the lasso (referred to as cSL), and the dSL with and without the lasso in its library of candidate learners (referred to as dSL and dSL (-lasso), respectively). For the super learner approaches, we further considered four possible sets of screeners that were fit prior to any learners: no screeners; a lasso screener only; rank correlation, univariate correlation, random forest, and lasso screeners (referred to as “All” screeners); and all possible screeners except the lasso [referred to as “All (-lasso)”]. Tuning parameters for the screeners depended on the total number of features, except for the lasso screener, which always removed variables with zero regression coefficient based on a tuning parameter selected by 10-fold cross-validation. For <inline-formula id="j_nejsds82_ineq_015"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>10</mml:mn></mml:math><tex-math><![CDATA[$p=10$]]></tex-math></alternatives></inline-formula>, we considered a screener that selected all variables and a univariate correlation screener that removed variables with outcome-correlation-test p-value less than 0.2. For <inline-formula id="j_nejsds82_ineq_016"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mn>10</mml:mn></mml:math><tex-math><![CDATA[$p\gt 10$]]></tex-math></alternatives></inline-formula>, the rank correlation screeners removed variables outside of the top 10, 25, or 50 ranked correlation-test p-values; the univariate correlation screener removed variables with p-value less than 0.2 or 0.4; and the random forest screener removed variables outside of the top 10 or 25 most-important variables, ranked by the random forest variable importance measure [<xref ref-type="bibr" rid="j_nejsds82_ref_003">3</xref>].</p>
<p>We finalized our cSL specification following the guidelines specified in Phillips et al. [<xref ref-type="bibr" rid="j_nejsds82_ref_022">22</xref>]. First, because we were interested in estimating the true continuous prediction function for both continuous and binary outcomes, we estimated the <italic>V</italic>-fold cross-validated least squares loss (for continuous outcomes) or log-likelihood loss (for binary outcomes); we then used the non-negative least squares (NNLS) or non-negative log-likelihood metalearner to obtain the optimal convex combination of these learners, respectively. We used stratified cross-validation [<xref ref-type="bibr" rid="j_nejsds82_ref_016">16</xref>] in the binary-outcome case. We used nested cross-validation in all cases to estimate the performance of the cSL and all individual screener-learner pairs. Second, we computed the effective sample size <inline-formula id="j_nejsds82_ineq_017"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>eff</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${n_{\text{eff}}}$]]></tex-math></alternatives></inline-formula>, and based our choice of <italic>V</italic> on the flowchart in Figure 1 of Phillips et al. [<xref ref-type="bibr" rid="j_nejsds82_ref_022">22</xref>]; the values of <italic>V</italic> are provided below. Finally, our library of screener-learner pairs specified above was designed to be computationally feasible and adapt to high dimensions and different underlying true regression functions.</p>
<table-wrap id="j_nejsds82_tab_001">
<label>Table 1</label>
<caption>
<p>All possible candidate learners for super learners used in the simulations, along with their R implementation, tuning parameter values, and description of the tuning parameters. All tuning parameters besides those listed here are set to their default values. In particular, the random forests are grown with a minimum node size of 5 for continuous outcomes and 1 for binary outcomes and a subsampling fraction of 1; the boosted trees are grown with shrinkage rate of 0.1, and a minimum of 10 observations per node.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Candidate learner</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">R implementation</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Tuning parameter and possible values</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Tuning parameter description</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Generalized linear models</td>
<td style="vertical-align: top; text-align: left"><monospace>base</monospace></td>
<td style="vertical-align: top; text-align: left">–</td>
<td style="vertical-align: top; text-align: left">–</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Random forests</td>
<td style="vertical-align: top; text-align: left"><monospace>ranger</monospace> [<xref ref-type="bibr" rid="j_nejsds82_ref_033">33</xref>]</td>
<td style="vertical-align: top; text-align: left"><monospace>num.trees</monospace> = 1000</td>
<td style="vertical-align: top; text-align: left">Number of trees</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_nejsds82_ineq_018"><alternatives><mml:math>
<mml:mtext mathvariant="monospace">min.node.size</mml:mtext>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>20</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>50</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>100</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>250</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\texttt{min.node.size}\in \{5,20,50,100,250\}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Minimum node size</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Gradient boosted trees</td>
<td style="vertical-align: top; text-align: left"><monospace>xgboost</monospace> [<xref ref-type="bibr" rid="j_nejsds82_ref_007">7</xref>]</td>
<td style="vertical-align: top; text-align: left"><monospace>max.depth</monospace> <inline-formula id="j_nejsds82_ineq_019"><alternatives><mml:math>
<mml:mo>=</mml:mo>
<mml:mn>4</mml:mn></mml:math><tex-math><![CDATA[$=4$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Maximum tree depth</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"><monospace>ntree</monospace> <inline-formula id="j_nejsds82_ineq_020"><alternatives><mml:math>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>100</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>500</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1000</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\in \{100,500,1000\}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Number of iterations</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: left"><monospace>shrinkage</monospace> <inline-formula id="j_nejsds82_ineq_021"><alternatives><mml:math>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>0.01</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0.1</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\in \{0.01,0.1\}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Shrinkage</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Multivariate adaptive regression splines</td>
<td style="vertical-align: top; text-align: left"><monospace>earth</monospace> [<xref ref-type="bibr" rid="j_nejsds82_ref_019">19</xref>]</td>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_nejsds82_ineq_022"><alternatives><mml:math>
<mml:mtext mathvariant="monospace">nk</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mo movablelimits="false">min</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mo movablelimits="false">max</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>21</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1000</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">†</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$\texttt{nk}=\min {\{\max \{21,2p+1\},1000\}^{\dagger }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Maximum number of model terms before pruning</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Lasso</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><monospace>glmnet</monospace> [<xref ref-type="bibr" rid="j_nejsds82_ref_012">12</xref>]</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><italic>λ</italic>, chosen via 10-fold cross-validation</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_nejsds82_ineq_023"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\ell _{1}}$]]></tex-math></alternatives></inline-formula> regularization parameter</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><inline-formula id="j_nejsds82_ineq_024"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mi mathvariant="normal">†</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\dagger }}$]]></tex-math></alternatives></inline-formula>: <italic>p</italic> denotes the total number of predictors.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="j_nejsds82_s_006">
<label>3.3</label>
<title>Experimental Overview</title>
<p>For each <inline-formula id="j_nejsds82_ineq_025"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>200</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>500</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1000</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2000</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>3000</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$n\in \{200,500,1000,2000,3000\}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds82_ineq_026"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>10</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>500</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$p\in \{10,500\}$]]></tex-math></alternatives></inline-formula>, and simulation scenario described above, we generated 1000 random datasets according to this data generating mechanism. For continuous outcomes, <inline-formula id="j_nejsds82_ineq_027"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>eff</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi></mml:math><tex-math><![CDATA[${n_{\text{eff}}}=n$]]></tex-math></alternatives></inline-formula>; thus, we set <inline-formula id="j_nejsds82_ineq_028"><alternatives><mml:math>
<mml:mi mathvariant="italic">V</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>20</mml:mn></mml:math><tex-math><![CDATA[$V=20$]]></tex-math></alternatives></inline-formula> for <inline-formula id="j_nejsds82_ineq_029"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo stretchy="false">≤</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$n\le 500$]]></tex-math></alternatives></inline-formula> and set <inline-formula id="j_nejsds82_ineq_030"><alternatives><mml:math>
<mml:mi mathvariant="italic">V</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>10</mml:mn></mml:math><tex-math><![CDATA[$V=10$]]></tex-math></alternatives></inline-formula> otherwise. For binary outcomes, <inline-formula id="j_nejsds82_ineq_031"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>eff</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${n_{\text{eff}}}$]]></tex-math></alternatives></inline-formula> ranged from 10 (the 5% incidence outcome at <inline-formula id="j_nejsds82_ineq_032"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>200</mml:mn></mml:math><tex-math><![CDATA[$n=200$]]></tex-math></alternatives></inline-formula>) to 1367 (a 54% incidence outcome at <inline-formula id="j_nejsds82_ineq_033"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>3000</mml:mn></mml:math><tex-math><![CDATA[$n=3000$]]></tex-math></alternatives></inline-formula>). We set <inline-formula id="j_nejsds82_ineq_034"><alternatives><mml:math>
<mml:mi mathvariant="italic">V</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>eff</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$V={n_{\text{eff}}}$]]></tex-math></alternatives></inline-formula> in three cases, and <inline-formula id="j_nejsds82_ineq_035"><alternatives><mml:math>
<mml:mi mathvariant="italic">V</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>20</mml:mn></mml:math><tex-math><![CDATA[$V=20$]]></tex-math></alternatives></inline-formula> or <inline-formula id="j_nejsds82_ineq_036"><alternatives><mml:math>
<mml:mi mathvariant="italic">V</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>10</mml:mn></mml:math><tex-math><![CDATA[$V=10$]]></tex-math></alternatives></inline-formula> otherwise, depending on the value of <inline-formula id="j_nejsds82_ineq_037"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>eff</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${n_{\text{eff}}}$]]></tex-math></alternatives></inline-formula>. The exact values of <inline-formula id="j_nejsds82_ineq_038"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>eff</mml:mtext>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${n_{\text{eff}}}$]]></tex-math></alternatives></inline-formula> and <italic>V</italic> used are provided in the Supporting Information. We additionally generated a test dataset with sample size 1 million in each replication to estimate the true prediction performance of each prediction function estimated using <italic>V</italic>-fold cross-validation. We measured prediction performance for each algorithm described above using R-squared for continuous outcomes and area under the receiver operating characteristic curve (AUC) and non-negative log likelihood for binary outcomes. For the continuous outcome, R-squared is equivalent to the cross-validated metric that is being optimized: the mean squared error, which is equal to R-squared up to a scaling factor, the outcome variance. For the binary outcome, AUC is often of interest when assessing prediction performance. AUC is not equivalent to non-negative log-likelihood; however, developing a super learner using AUC loss can be unstable in some settings.</p>
</sec>
<sec id="j_nejsds82_s_007">
<label>3.4</label>
<title>Results</title>
<p>We display the results under a strong outcome-feature relationship in Figures <xref rid="j_nejsds82_fig_001">1</xref> and <xref rid="j_nejsds82_fig_002">2</xref>. Focusing first on a continuous outcome, when the outcome-feature relationship is linear (Figure <xref rid="j_nejsds82_fig_001">1</xref> left column), all estimators have prediction performance converging quickly to the best-possible prediction performance as the sample size increases. In small samples with a linear relationship, removing the lasso from the SL library results in decreased performance. When the outcome-feature relationship is nonlinear (Figure <xref rid="j_nejsds82_fig_001">1</xref> right column), the results depend on the variable screeners and algorithm used. The lasso has poor performance regardless of sample size, particularly in the case with correlated features; this is consistent with theory [<xref ref-type="bibr" rid="j_nejsds82_ref_017">17</xref>]. Also, particularly for large numbers of features (e.g., when <inline-formula id="j_nejsds82_ineq_039"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>), using the lasso screener alone within a super learner degrades performance, while using a large library of candidate screeners can improve performance over a super learner with no screeners. Having a large library of candidate screeners can protect against poor lasso performance. Results are similar for the binary outcome.</p>
<fig id="j_nejsds82_fig_001">
<label>Figure 1</label>
<caption>
<p>Prediction performance versus sample size <italic>n</italic>, measured using cross-validated R-squared, for predicting a continuous outcome. There is a strong relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.</p>
</caption>
<graphic xlink:href="nejsds82_g001.jpg"/>
</fig>
<p>The results under a weak outcome-feature relationship follow similar patterns (Figures <xref rid="j_nejsds82_fig_003">3</xref> and <xref rid="j_nejsds82_fig_004">4</xref>). In this case, the best-possible prediction performance is lower than in the strong-relationship case, as expected; and a larger sample size is required to achieve prediction performance close to this optimal level.</p>
<fig id="j_nejsds82_fig_002">
<label>Figure 2</label>
<caption>
<p>Prediction performance versus sample size <italic>n</italic>, measured using cross-validated AUC, for predicting a binary outcome. There is a strong relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.</p>
</caption>
<graphic xlink:href="nejsds82_g002.jpg"/>
</fig>
<fig id="j_nejsds82_fig_003">
<label>Figure 3</label>
<caption>
<p>Prediction performance versus sample size <italic>n</italic>, measured using cross-validated R-squared, for predicting a continuous outcome. There is a weak relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.</p>
</caption>
<graphic xlink:href="nejsds82_g003.jpg"/>
</fig>
<fig id="j_nejsds82_fig_004">
<label>Figure 4</label>
<caption>
<p>Prediction performance versus sample size <italic>n</italic>, measured using cross-validated AUC, for predicting a binary outcome. There is a weak relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.</p>
</caption>
<graphic xlink:href="nejsds82_g004.jpg"/>
</fig>
<table-wrap id="j_nejsds82_tab_002">
<label>Table 2</label>
<caption>
<p>Estimates of cross-validated R-squared for the continuous <inline-formula id="j_nejsds82_ineq_040"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext>IC</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>50</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\text{IC}_{50}}$]]></tex-math></alternatives></inline-formula> outcome, for the convex ensemble super learner (cSL), the discrete super learner (dSL), and the lasso, under each combination of learners and screeners. For screeners, ‘None’ denotes no screeners; ‘Lasso’ denotes only a lasso screener; ‘All (-lasso)’ denotes random forest, rank-correlation, and correlation-test p-value screening; ‘All’ denotes these three screener types plus the lasso; and ‘All (+none)’ denotes all screeners plus the ‘none’ screener.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Learners</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Screeners</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Algorithm</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Min</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Max</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Point estimate [95% CI]</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">None</td>
<td style="vertical-align: top; text-align: left">cSL</td>
<td style="vertical-align: top; text-align: left">0.208</td>
<td style="vertical-align: top; text-align: left">0.501</td>
<td style="vertical-align: top; text-align: left">0.373 [0.353, 0.393]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">None</td>
<td style="vertical-align: top; text-align: left">dSL</td>
<td style="vertical-align: top; text-align: left">0.058</td>
<td style="vertical-align: top; text-align: left">0.491</td>
<td style="vertical-align: top; text-align: left">0.366 [0.347, 0.385]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">None</td>
<td style="vertical-align: top; text-align: left">lasso</td>
<td style="vertical-align: top; text-align: left">0.331</td>
<td style="vertical-align: top; text-align: left">0.331</td>
<td style="vertical-align: top; text-align: left">0.331 [0.305, 0.358]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">Lasso</td>
<td style="vertical-align: top; text-align: left">cSL</td>
<td style="vertical-align: top; text-align: left">0.175</td>
<td style="vertical-align: top; text-align: left">0.527</td>
<td style="vertical-align: top; text-align: left">0.388 [0.364, 0.414]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">Lasso</td>
<td style="vertical-align: top; text-align: left">dSL</td>
<td style="vertical-align: top; text-align: left">0.173</td>
<td style="vertical-align: top; text-align: left">0.516</td>
<td style="vertical-align: top; text-align: left">0.387 [0.366, 0.409]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">All (-lasso)</td>
<td style="vertical-align: top; text-align: left">cSL</td>
<td style="vertical-align: top; text-align: left">0.182</td>
<td style="vertical-align: top; text-align: left">0.535</td>
<td style="vertical-align: top; text-align: left">0.390 [0.370, 0.411]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">All (-lasso)</td>
<td style="vertical-align: top; text-align: left">dSL</td>
<td style="vertical-align: top; text-align: left">0.192</td>
<td style="vertical-align: top; text-align: left">0.519</td>
<td style="vertical-align: top; text-align: left">0.391 [0.372, 0.411]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">cSL</td>
<td style="vertical-align: top; text-align: left">0.180</td>
<td style="vertical-align: top; text-align: left">0.545</td>
<td style="vertical-align: top; text-align: left">0.394 [0.371, 0.417]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">dSL</td>
<td style="vertical-align: top; text-align: left">0.173</td>
<td style="vertical-align: top; text-align: left">0.516</td>
<td style="vertical-align: top; text-align: left">0.387 [0.365, 0.409]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">All (+none)</td>
<td style="vertical-align: top; text-align: left">cSL</td>
<td style="vertical-align: top; text-align: left">0.203</td>
<td style="vertical-align: top; text-align: left">0.533</td>
<td style="vertical-align: top; text-align: left">0.378 [0.354, 0.403]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">All</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">All (+none)</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">dSL</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.173</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.516</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.387 [0.365, 0.409]</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_nejsds82_tab_003">
<label>Table 3</label>
<caption>
<p>Estimates of cross-validated AUC for the binary sensitivity outcome, for the convex ensemble super learner (cSL), the discrete super learner (dSL), and the lasso, under each combination of learners and screeners. For screeners, ‘None’ denotes no screeners; ‘Lasso’ denotes only a lasso screener; ‘All (-lasso)’ denotes random forest, rank-correlation, and correlation-test p-value screening; ‘All’ denotes these three screener types plus the lasso; and ‘All (+none)’ denotes all screeners plus the ‘none’ screener.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Learners</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Screeners</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Algorithm</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Min</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Max</td>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Point estimate [95% CI]</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">None</td>
<td style="vertical-align: top; text-align: left">cSL</td>
<td style="vertical-align: top; text-align: left">0.755</td>
<td style="vertical-align: top; text-align: left">0.874</td>
<td style="vertical-align: top; text-align: left">0.823 [0.719, 0.928]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">None</td>
<td style="vertical-align: top; text-align: left">dSL</td>
<td style="vertical-align: top; text-align: left">0.763</td>
<td style="vertical-align: top; text-align: left">0.895</td>
<td style="vertical-align: top; text-align: left">0.837 [0.737, 0.936]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">None</td>
<td style="vertical-align: top; text-align: left">lasso</td>
<td style="vertical-align: top; text-align: left">0.647</td>
<td style="vertical-align: top; text-align: left">0.813</td>
<td style="vertical-align: top; text-align: left">0.757 [0.633, 0.882]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">Lasso</td>
<td style="vertical-align: top; text-align: left">cSL</td>
<td style="vertical-align: top; text-align: left">0.727</td>
<td style="vertical-align: top; text-align: left">0.865</td>
<td style="vertical-align: top; text-align: left">0.806 [0.696, 0.915]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">Lasso</td>
<td style="vertical-align: top; text-align: left">dSL</td>
<td style="vertical-align: top; text-align: left">0.730</td>
<td style="vertical-align: top; text-align: left">0.897</td>
<td style="vertical-align: top; text-align: left">0.811 [0.703, 0.919]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">All (-lasso)</td>
<td style="vertical-align: top; text-align: left">cSL</td>
<td style="vertical-align: top; text-align: left">0.752</td>
<td style="vertical-align: top; text-align: left">0.906</td>
<td style="vertical-align: top; text-align: left">0.826 [0.723, 0.929]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">All (-lasso)</td>
<td style="vertical-align: top; text-align: left">dSL</td>
<td style="vertical-align: top; text-align: left">0.772</td>
<td style="vertical-align: top; text-align: left">0.907</td>
<td style="vertical-align: top; text-align: left">0.827 [0.724, 0.929]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">cSL</td>
<td style="vertical-align: top; text-align: left">0.750</td>
<td style="vertical-align: top; text-align: left">0.873</td>
<td style="vertical-align: top; text-align: left">0.823 [0.719, 0.928]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">dSL</td>
<td style="vertical-align: top; text-align: left">0.772</td>
<td style="vertical-align: top; text-align: left">0.897</td>
<td style="vertical-align: top; text-align: left">0.826 [0.723, 0.929]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">All</td>
<td style="vertical-align: top; text-align: left">All (+none)</td>
<td style="vertical-align: top; text-align: left">cSL</td>
<td style="vertical-align: top; text-align: left">0.746</td>
<td style="vertical-align: top; text-align: left">0.879</td>
<td style="vertical-align: top; text-align: left">0.825 [0.720, 0.929]</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">All</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">All (+none)</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">dSL</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.772</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.897</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">0.829 [0.727, 0.931]</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In the Supporting Information, we provide additional results. Results for the binary outcome with respect to non-negative log-likelihood follow similar patterns to those observed here using AUC. We considered further feature dimensions <italic>p</italic> with a fixed number of cross-validation folds <italic>V</italic>, and found similar results to the primary results presented above. Finally, we present results for <inline-formula id="j_nejsds82_ineq_041"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$n=500$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds82_ineq_042"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2000</mml:mn></mml:math><tex-math><![CDATA[$p=2000$]]></tex-math></alternatives></inline-formula> and for candidate learners within the super learner. In the high-dimensional setting, performance follows the same trends across outcomes and estimators as the other <inline-formula id="j_nejsds82_ineq_043"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(n,p)$]]></tex-math></alternatives></inline-formula> combinations.</p>
</sec>
</sec>
<sec id="j_nejsds82_s_008">
<label>4</label>
<title>Predicting HIV-1 Neutralization Susceptibility</title>
<p>HIV-1 is a genetically diverse pathogen. Broadly neutralizing antibodies (bnAbs) against HIV-1 neutralize a wide array of HIV-1 genetic variants. One such bnAb, VRC01, was recently evaluated in two placebo-controlled randomized trials [<xref ref-type="bibr" rid="j_nejsds82_ref_009">9</xref>]. Predicting whether or not a given HIV-1 virus is susceptible to neutralization by a bnAb, including VRC01, is an important component of prevention research; several prediction models have been developed recently [<xref ref-type="bibr" rid="j_nejsds82_ref_015">15</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_005">5</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_014">14</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_004">4</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_026">26</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_008">8</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_034">34</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_018">18</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_030">30</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_011">11</xref>, <xref ref-type="bibr" rid="j_nejsds82_ref_031">31</xref>].</p>
<p>We analyze HIV-1 envelope (Env) amino acid (AA) sequence data from 611 publicly-available HIV-1 Env pseudoviruses made from blood samples of HIV-1 infected individuals [<xref ref-type="bibr" rid="j_nejsds82_ref_018">18</xref>]. In addition to binary indicators of specific AA residues at each position in the Env sequence, the data also include information on the geographic region of origin of the virus, the subtype of the virus, and viral geometry; there are over 800 features in total. We considered two outcomes of interest: the <inline-formula id="j_nejsds82_ineq_044"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false">log</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\log _{10}}$]]></tex-math></alternatives></inline-formula>-transformed 50% inhibitory concentration, <inline-formula id="j_nejsds82_ineq_045"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext>IC</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>50</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\text{IC}_{50}}$]]></tex-math></alternatives></inline-formula>, defined as the concentration (<inline-formula id="j_nejsds82_ineq_046"><alternatives><mml:math>
<mml:mtext>ţ</mml:mtext>
<mml:mtext>g</mml:mtext></mml:math><tex-math><![CDATA[$\text{ţ}\text{g}$]]></tex-math></alternatives></inline-formula>/<inline-formula id="j_nejsds82_ineq_047"><alternatives><mml:math>
<mml:mi mathvariant="normal">mL</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{mL}$]]></tex-math></alternatives></inline-formula>) of VRC01 necessary to neutralize 50% of viruses in vitro, with large values of <inline-formula id="j_nejsds82_ineq_048"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext>IC</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>50</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\text{IC}_{50}}$]]></tex-math></alternatives></inline-formula> indicating resistance to neutralization; and susceptibility to neutralization, defined as the binary indicator that <inline-formula id="j_nejsds82_ineq_049"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mtext>IC</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mn>50</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mspace width="2.5pt"/>
<mml:mtext>ţ</mml:mtext>
<mml:mtext>g</mml:mtext></mml:math><tex-math><![CDATA[${\text{IC}_{50}}\lt 1\hspace{2.5pt}\text{ţ}\text{g}$]]></tex-math></alternatives></inline-formula>/<inline-formula id="j_nejsds82_ineq_050"><alternatives><mml:math>
<mml:mi mathvariant="normal">mL</mml:mi></mml:math><tex-math><![CDATA[$\mathrm{mL}$]]></tex-math></alternatives></inline-formula>. For each outcome, we considered the same prediction algorithms and eSL specification as in Section <xref rid="j_nejsds82_s_003">3</xref>. Following Phillips et al. [<xref ref-type="bibr" rid="j_nejsds82_ref_022">22</xref>], we set <inline-formula id="j_nejsds82_ineq_051"><alternatives><mml:math>
<mml:mi mathvariant="italic">V</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>10</mml:mn></mml:math><tex-math><![CDATA[$V=10$]]></tex-math></alternatives></inline-formula> for both the continuous and binary outcome.</p>
<p>The results are presented in Tables <xref rid="j_nejsds82_tab_002">2</xref> and <xref rid="j_nejsds82_tab_003">3</xref>. For both outcomes, some screening tended to be beneficial. Among the analyses that used screeners, using the lasso screener alone resulted in the worst performance for the binary outcome and near the worst for the continuous outcome. Again, for both outcomes, having a large set of screeners protected against poor lasso performance; the lasso performed worse than the cSL or dSL for both outcomes. The lasso had a cross-validated (CV) R-squared for the continuous outcome of 0.331 with a 95% confidence interval (CI) of [0.305, 0.358], and a CV AUC for the binary outcome of 0.757 [0.633, 0.882]. For the continuous outcome, the largest point estimate of CV R-squared was achieved by the cSL with all screeners, including the lasso; the CV R-squared was 0.394 [0.371, 0.417]. The best-performing dSL was in the case with all screeners but the lasso, with CV R-squared 0.391 [0.372, 0.411]. For the binary outcome, the largest CV AUC for the cSL was 0.826 [0.723, 0.929] in the case with all screeners but the lasso; for the dSL, the largest CV AUC was 0.837 [0.737, 0.936] in the case with no screeners. In the Supporting Information, we present cross-validated performance for the candidate learners in each cSL; cross-validated negative log-likelihood loss for the binary susceptibility outcome; and the cSL coefficients and dSLs for each cross-validation fold.</p>
</sec>
<sec id="j_nejsds82_s_009">
<label>5</label>
<title>Discussion</title>
<p>In this manuscript, we explored the effect of using different combinations of variable screeners within the super learner. We found that both the lasso and the ensemble super learner (cSL) using only a lasso screener had poor prediction performance when the outcome-feature relationship was nonlinear; in other words, in the case where the lasso is misspecified. However, if a sufficiently rich set of candidate screeners were included, then including the lasso as a candidate screener did not degrade performance. These results held for both continuous and binary outcomes, and for both strong and weak relationships between the outcome and features. The same patterns held for the discrete super learner (dSL). In an analysis of 611 HIV-1 envelope protein pseudoviruses with over 800 features, we found similar results to the simulations. There, the dSL tended to result in performance similar to the cSL.</p>
<p>Taken together, the results suggest that some caution must be used when specifying screeners within a super learner, but that a sufficiently large set of candidate screeners can protect against misspecification of a given screener. This guidance is similar to the guidance to specify a diverse set of learners in a super learner [<xref ref-type="bibr" rid="j_nejsds82_ref_022">22</xref>], and can be viewed as complementary, since an algorithm-screener pair defines a new candidate learner.</p>
</sec>
</body>
<back>
<ref-list id="j_nejsds82_reflist_001">
<title>References</title>
<ref id="j_nejsds82_ref_001">
<label>[1]</label><mixed-citation publication-type="journal"><string-name><surname>Balzer</surname>, <given-names>L. B.</given-names></string-name> and <string-name><surname>Westling</surname>, <given-names>T.</given-names></string-name> (<year>2021</year>). <article-title>Demystifying statistical inference when using machine learning in causal research</article-title>. <source>American Journal of Epidemiology</source> <volume>200</volume>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_002">
<label>[2]</label><mixed-citation publication-type="chapter"><string-name><surname>Barron</surname>, <given-names>A.</given-names></string-name> (<year>1989</year>). <chapter-title>Statistical properties of artificial neural networks</chapter-title>. In <source>Proceedings of the 28th IEEE Conference on Decision and Control</source> <fpage>280</fpage>–<lpage>285</lpage>. <publisher-name>IEEE</publisher-name>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_003">
<label>[3]</label><mixed-citation publication-type="journal"><string-name><surname>Breiman</surname>, <given-names>L.</given-names></string-name> (<year>2001</year>). <article-title>Random forests</article-title>. <source>Machine Learning</source> <volume>45</volume>(<issue>1</issue>) <fpage>5</fpage>–<lpage>32</lpage>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3874153">MR3874153</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds82_ref_004">
<label>[4]</label><mixed-citation publication-type="journal"><string-name><surname>Bricault</surname>, <given-names>C. A.</given-names></string-name>, <string-name><surname>Yusim</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Seaman</surname>, <given-names>M. S.</given-names></string-name>, <string-name><surname>Yoon</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Theiler</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Giorgi</surname>, <given-names>E. E.</given-names></string-name>, <string-name><surname>Wagh</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Theiler</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Hraber</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Macke</surname>, <given-names>J. P.</given-names></string-name>, <string-name><surname>Kreider</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Learn</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Hahn</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Scheid</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Kovacs</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Shields</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Lavine</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Ghantous</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Rist</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Bayne</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Neubauer</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>McMahan</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Peng</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Cheneau</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Jones</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Zeng</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Oschsenbauer</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Nkolola</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Stephenson</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Gnanakaran</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Bonsignori</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Williams</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Haynes</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Doria-Rose</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Mascola</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Montefiori</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Barouch</surname>, <given-names>D.</given-names></string-name> and <string-name><surname>Korber</surname>, <given-names>B.</given-names></string-name> (<year>2019</year>). <article-title>HIV-1 neutralizing antibody signatures and application to epitope-targeted vaccine design</article-title>. <source>Cell Host &amp; Microbe</source> <volume>25</volume>(<issue>1</issue>) <fpage>59</fpage>–<lpage>72</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_005">
<label>[5]</label><mixed-citation publication-type="journal"><string-name><surname>Buiu</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Putz</surname>, <given-names>M. V.</given-names></string-name> and <string-name><surname>Avram</surname>, <given-names>S.</given-names></string-name> (<year>2016</year>). <article-title>Learning the relationship between the primary structure of HIV envelope glycoproteins and neutralization activity of particular antibodies by using artificial neural networks</article-title>. <source>International Journal of Molecular Sciences</source> <volume>17</volume>(<issue>10</issue>) <fpage>1710</fpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_006">
<label>[6]</label><mixed-citation publication-type="journal"><string-name><surname>Carrell</surname>, <given-names>D. S.</given-names></string-name>, <string-name><surname>Gruber</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Floyd</surname>, <given-names>J. S.</given-names></string-name>, <string-name><surname>Bann</surname>, <given-names>M. A.</given-names></string-name>, <string-name><surname>Cushing-Haugen</surname>, <given-names>K. L.</given-names></string-name>, <string-name><surname>Johnson</surname>, <given-names>R. L.</given-names></string-name>, <string-name><surname>Graham</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Cronkite</surname>, <given-names>D. J.</given-names></string-name>, <string-name><surname>Hazlehurst</surname>, <given-names>B. L.</given-names></string-name>, <string-name><surname>Felcher</surname>, <given-names>A. H.</given-names></string-name>, <string-name><surname>Bejan</surname>, <given-names>C. A.</given-names></string-name>, <string-name><surname>Kennedy</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Shinde</surname>, <given-names>M. U.</given-names></string-name>, <string-name><surname>Karami</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Ma</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Stojanovic</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Zhao</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Ball</surname>, <given-names>R.</given-names></string-name> and <string-name><surname>Nelson</surname>, <given-names>J. C.</given-names></string-name> (<year>2023</year>). <article-title>Improving methods of identifying anaphylaxis for medical product safety surveillance using natural language processing and machine learning</article-title>. <source>American Journal of Epidemiology</source> <volume>192</volume>(<issue>2</issue>) <fpage>283</fpage>–<lpage>295</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_007">
<label>[7]</label><mixed-citation publication-type="other"><string-name><surname>Chen</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>He</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Benesty</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Khotilovich</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Tang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Cho</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Mitchell</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Cano</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Zhou</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Xie</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Lin</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Geng</surname>, <given-names>Y.</given-names></string-name> and <string-name><surname>Li</surname>, <given-names>Y.</given-names></string-name> (2019). xgboost: Extreme Gradient Boosting. R package version 0.82.1. <uri>https://CRAN.R-project.org/package=xgboost</uri>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_008">
<label>[8]</label><mixed-citation publication-type="journal"><string-name><surname>Conti</surname>, <given-names>S.</given-names></string-name> and <string-name><surname>Karplus</surname>, <given-names>M.</given-names></string-name> (<year>2019</year>). <article-title>Estimation of the breadth of CD4bs targeting HIV antibodies by molecular modeling and machine learning</article-title>. <source>PLoS Computational Biology</source> <volume>15</volume>(<issue>4</issue>) <fpage>1006954</fpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_009">
<label>[9]</label><mixed-citation publication-type="journal"><string-name><surname>Corey</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Gilbert</surname>, <given-names>P. B.</given-names></string-name>, <string-name><surname>Juraska</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Montefiori</surname>, <given-names>D. C.</given-names></string-name>, <string-name><surname>Morris</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Karuna</surname>, <given-names>S. T.</given-names></string-name>, <string-name><surname>Edupuganti</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Mgodi</surname>, <given-names>N. M.</given-names></string-name>, <string-name><surname>DeCamp</surname>, <given-names>A. C.</given-names></string-name>, <string-name><surname>Rudnicki</surname>, <given-names>E.</given-names></string-name> <etal>et al.</etal> (<year>2021</year>). <article-title>Two randomized trials of neutralizing antibodies to prevent HIV-1 acquisition</article-title>. <source>New England Journal of Medicine</source> <volume>384</volume>(<issue>11</issue>) <fpage>1003</fpage>–<lpage>1014</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1056/NEJMoa2031738" xlink:type="simple">https://doi.org/10.1056/NEJMoa2031738</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_010">
<label>[10]</label><mixed-citation publication-type="other"><string-name><surname>Coyle</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hejazi</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Malencia</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Phillips</surname>, <given-names>R.</given-names></string-name> and <string-name><surname>Sofrygin</surname>, <given-names>O.</given-names></string-name> (2023). sl3: Pipelines for machine learning and Super Learning. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.5281/zenodo.1342293" xlink:type="simple">https://doi.org/10.5281/zenodo.1342293</ext-link>. <uri>https://github.com/tlverse/sl3</uri>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_011">
<label>[11]</label><mixed-citation publication-type="journal"><string-name><surname>Dnil</surname>, <given-names>V.-R.</given-names></string-name> and <string-name><surname>Buiu</surname>, <given-names>C.</given-names></string-name> (<year>2022</year>). <article-title>Prediction of HIV sensitivity to monoclonal antibodies using aminoacid sequences and deep learning</article-title>. <source>Bioinformatics</source> <volume>38</volume>(<issue>18</issue>) <fpage>4278</fpage>–<lpage>4285</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_012">
<label>[12]</label><mixed-citation publication-type="journal"><string-name><surname>Friedman</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hastie</surname>, <given-names>T.</given-names></string-name> and <string-name><surname>Tibshirani</surname>, <given-names>R.</given-names></string-name> (<year>2010</year>). <article-title>Regularization paths for generalized linear models via coordinate descent</article-title>. <source>Journal of Statistical Software</source> <volume>33</volume>(<issue>1</issue>) <fpage>1</fpage>–<lpage>22</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18637/jss.v033.i01" xlink:type="simple">https://doi.org/10.18637/jss.v033.i01</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_013">
<label>[13]</label><mixed-citation publication-type="journal"><string-name><surname>Friedman</surname>, <given-names>J.</given-names></string-name> (<year>2001</year>). <article-title>Greedy function approximation: a gradient boosting machine</article-title>. <source>Annals of Statistics</source> <volume>29</volume>(<issue>5</issue>) <fpage>1189</fpage>–<lpage>1232</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/aos/1013203451" xlink:type="simple">https://doi.org/10.1214/aos/1013203451</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=1873328">MR1873328</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds82_ref_014">
<label>[14]</label><mixed-citation publication-type="journal"><string-name><surname>Hake</surname>, <given-names>A.</given-names></string-name> and <string-name><surname>Pfeifer</surname>, <given-names>N.</given-names></string-name> (<year>2017</year>). <article-title>Prediction of HIV-1 sensitivity to broadly neutralizing antibodies shows a trend towards resistance over time</article-title>. <source>PLoS Computational Biology</source> <volume>13</volume>(<issue>10</issue>) <fpage>1005789</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1371/journal.pcbi.1005789" xlink:type="simple">https://doi.org/10.1371/journal.pcbi.1005789</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_015">
<label>[15]</label><mixed-citation publication-type="journal"><string-name><surname>Hepler</surname>, <given-names>N. L.</given-names></string-name>, <string-name><surname>Scheffler</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Weaver</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Murrell</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Richman</surname>, <given-names>D. D.</given-names></string-name>, <string-name><surname>Burton</surname>, <given-names>D. R.</given-names></string-name>, <string-name><surname>Poignard</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Smith</surname>, <given-names>D. M.</given-names></string-name> and <string-name><surname>Kosakovsky Pond</surname>, <given-names>S. L.</given-names></string-name> (<year>2014</year>). <article-title>IDEPI: rapid prediction of HIV-1 antibody epitopes and other phenotypic features from sequence data using a flexible machine learning platform</article-title>. <source>PLoS Computational Biology</source> <volume>10</volume>(<issue>9</issue>) <fpage>1003842</fpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_016">
<label>[16]</label><mixed-citation publication-type="book"><string-name><surname>Kohavi</surname>, <given-names>R.</given-names></string-name> (<year>1996</year>) <source>Wrappers for Performance Enhancement and Oblivious Decision Graphs</source>. <publisher-name>Stanford University ProQuest Dissertations Publishing</publisher-name>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_017">
<label>[17]</label><mixed-citation publication-type="journal"><string-name><surname>Leng</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Lin</surname>, <given-names>Y.</given-names></string-name> and <string-name><surname>Wahba</surname>, <given-names>G.</given-names></string-name> (<year>2006</year>). <article-title>A note on the lasso and related procedures in model selection</article-title>. <source>Statistica Sinica</source> <volume>16</volume> <fpage>1273</fpage>–<lpage>1284</lpage>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2327490">MR2327490</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds82_ref_018">
<label>[18]</label><mixed-citation publication-type="journal"><string-name><surname>Magaret</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Benkeser</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Williamson</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Borate</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Carpp</surname>, <given-names>L.</given-names></string-name> <etal>et al.</etal> (<year>2019</year>). <article-title>Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features</article-title>. <source>PLoS Computational Biology</source> <volume>15</volume>(<issue>4</issue>) <fpage>1006952</fpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_019">
<label>[19]</label><mixed-citation publication-type="other"><string-name><surname>Milborrow</surname>, <given-names>S.</given-names></string-name> (2021). earth: Multivariate Adaptive Regression Splines. R package version 5.3.1. <uri>https://CRAN.R-project.org/package=earth</uri>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_020">
<label>[20]</label><mixed-citation publication-type="journal"><string-name><surname>Nelder</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Wedderburn</surname>, <given-names>R.</given-names></string-name> (<year>1972</year>). <article-title>Generalized linear models</article-title>. <source>Journal of the Royal Statistical Society, Series A</source> <volume>135</volume>(<issue>3</issue>) <fpage>370</fpage>–<lpage>384</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_021">
<label>[21]</label><mixed-citation publication-type="journal"><string-name><surname>Petersen</surname>, <given-names>M. L.</given-names></string-name>, <string-name><surname>LeDell</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Schwab</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Sarovar</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Gross</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Reynolds</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Haberer</surname>, <given-names>J. E.</given-names></string-name>, <string-name><surname>Goggin</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Golin</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Arnsten</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Rosen</surname>, <given-names>M. I.</given-names></string-name>, <string-name><surname>Remien</surname>, <given-names>R. H.</given-names></string-name>, <string-name><surname>Etoori</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Wilson</surname>, <given-names>I. B.</given-names></string-name>, <string-name><surname>Simoni</surname>, <given-names>J. M.</given-names></string-name>, <string-name><surname>Erlen</surname>, <given-names>J. A.</given-names></string-name>, <string-name><surname>van der Laan</surname>, <given-names>M. J.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>H.</given-names></string-name> and <string-name><surname>Bangsberg</surname>, <given-names>D. R.</given-names></string-name> (<year>2015</year>). <article-title>Super learner analysis of electronic adherence data improves viral prediction and may provide strategies for selective HIV RNA monitoring</article-title>. <source>JAIDS Journal of Acquired Immune Deficiency Syndromes</source> <volume>69</volume>(<issue>1</issue>) <fpage>109</fpage>–<lpage>118</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_022">
<label>[22]</label><mixed-citation publication-type="journal"><string-name><surname>Phillips</surname>, <given-names>R. V.</given-names></string-name>, <string-name><surname>van der Laan</surname>, <given-names>M. J.</given-names></string-name>, <string-name><surname>Lee</surname>, <given-names>H.</given-names></string-name> and <string-name><surname>Gruber</surname>, <given-names>S.</given-names></string-name> (<year>2023</year>). <article-title>Practical considerations for specifying a super learner</article-title>. <source>International Journal of Epidemiology</source> <volume>52</volume>(<issue>4</issue>) <fpage>1276</fpage>–<lpage>1285</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_023">
<label>[23]</label><mixed-citation publication-type="journal"><string-name><surname>Pirracchio</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Petersen</surname>, <given-names>M. L.</given-names></string-name>, <string-name><surname>Carone</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Rigon</surname>, <given-names>M. R.</given-names></string-name>, <string-name><surname>Chevret</surname>, <given-names>S.</given-names></string-name> and <string-name><surname>van der Laan</surname>, <given-names>M. J.</given-names></string-name> (<year>2015</year>). <article-title>Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study</article-title>. <source>The Lancet Respiratory Medicine</source> <volume>3</volume>(<issue>1</issue>) <fpage>42</fpage>–<lpage>52</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_024">
<label>[24]</label><mixed-citation publication-type="other"><string-name><surname>Polley</surname>, <given-names>E. C.</given-names></string-name> and <string-name><surname>van der Laan</surname>, <given-names>M. J.</given-names></string-name> (2010). Super Learner in Prediction.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_025">
<label>[25]</label><mixed-citation publication-type="other"><string-name><surname>Polley</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>LeDell</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Kennedy</surname>, <given-names>C.</given-names></string-name> and <string-name><surname>van der Laan</surname>, <given-names>M.</given-names></string-name> (2021). SuperLearner: Super Learner Prediction. R package version 2.0-28. <uri>https://CRAN.R-project.org/package=SuperLearner</uri>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_026">
<label>[26]</label><mixed-citation publication-type="journal"><string-name><surname>Rawi</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Mall</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Shen</surname>, <given-names>C.-H.</given-names></string-name>, <string-name><surname>Farney</surname>, <given-names>S. K.</given-names></string-name>, <string-name><surname>Shiakolas</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Zhou</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Bensmail</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Chun</surname>, <given-names>T.-W.</given-names></string-name>, <string-name><surname>Doria-Rose</surname>, <given-names>N. A.</given-names></string-name>, <string-name><surname>Lynch</surname>, <given-names>R. M.</given-names></string-name>, <string-name><surname>Mascola</surname>, <given-names>J. R.</given-names></string-name>, <string-name><surname>Kwong</surname>, <given-names>P. D.</given-names></string-name> and <string-name><surname>Chuang</surname>, <given-names>G.-Y.</given-names></string-name> (<year>2019</year>). <article-title>Accurate prediction for antibody resistance of clinical HIV-1 isolates</article-title>. <source>Scientific Reports</source> <volume>9</volume>(<issue>1</issue>) <fpage>14696</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1038/s41598-019-50635-w" xlink:type="simple">https://doi.org/10.1038/s41598-019-50635-w</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_027">
<label>[27]</label><mixed-citation publication-type="journal"><string-name><surname>Tibshirani</surname>, <given-names>R.</given-names></string-name> (<year>1996</year>). <article-title>Regression shrinkage and selection via the lasso</article-title>. <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source> <volume>58</volume>(<issue>1</issue>) <fpage>267</fpage>–<lpage>288</lpage>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=1379242">MR1379242</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds82_ref_028">
<label>[28]</label><mixed-citation publication-type="book"><string-name><surname>van der Laan</surname>, <given-names>M.</given-names></string-name> and <string-name><surname>Rose</surname>, <given-names>S.</given-names></string-name> (<year>2011</year>) <source>Targeted Learning: Causal Inference for Observational and Experimental Data</source>. <publisher-name>Springer Science &amp; Business Media</publisher-name>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-1-4419-9782-1" xlink:type="simple">https://doi.org/10.1007/978-1-4419-9782-1</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2867111">MR2867111</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds82_ref_029">
<label>[29]</label><mixed-citation publication-type="journal"><string-name><surname>van der Laan</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Polley</surname>, <given-names>E.</given-names></string-name> and <string-name><surname>Hubbard</surname>, <given-names>A.</given-names></string-name> (<year>2007</year>). <article-title>Super learner</article-title>. <source>Statistical Applications in Genetics and Molecular Biology</source> <volume>6</volume>(<issue>1</issue>) <fpage>25</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.2202/1544-6115.1309" xlink:type="simple">https://doi.org/10.2202/1544-6115.1309</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2349918">MR2349918</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds82_ref_030">
<label>[30]</label><mixed-citation publication-type="journal"><string-name><surname>Williamson</surname>, <given-names>B. D.</given-names></string-name>, <string-name><surname>Magaret</surname>, <given-names>C. A.</given-names></string-name>, <string-name><surname>Gilbert</surname>, <given-names>P. B.</given-names></string-name>, <string-name><surname>Nizam</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Simmons</surname>, <given-names>C.</given-names></string-name> and <string-name><surname>Benkeser</surname>, <given-names>D.</given-names></string-name> (<year>2021</year>). <article-title>Super LeArner Prediction of NAb Panels (SLAPNAP): a containerized tool for predicting combination monoclonal broadly neutralizing antibody sensitivity</article-title>. <source>Bioinformatics</source> <volume>37</volume>(<issue>22</issue>) <fpage>4187</fpage>–<lpage>4192</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_031">
<label>[31]</label><mixed-citation publication-type="journal"><string-name><surname>Williamson</surname>, <given-names>B. D.</given-names></string-name>, <string-name><surname>Magaret</surname>, <given-names>C. A.</given-names></string-name>, <string-name><surname>Karuna</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Carpp</surname>, <given-names>L. N.</given-names></string-name>, <string-name><surname>Gelderblom</surname>, <given-names>H. C.</given-names></string-name>, <string-name><surname>Huang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Benkeser</surname>, <given-names>D.</given-names></string-name> and <string-name><surname>Gilbert</surname>, <given-names>P. B.</given-names></string-name> (<year>2023</year>). <article-title>Application of the SLAPNAP statistical learning tool to broadly neutralizing antibody HIV prevention research</article-title>. <source>iScience</source> <volume>26</volume>(<issue>9</issue>).</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_032">
<label>[32]</label><mixed-citation publication-type="journal"><string-name><surname>Wolpert</surname>, <given-names>D.</given-names></string-name> (<year>1992</year>). <article-title>Stacked generalization</article-title>. <source>Neural Networks</source> <volume>5</volume>(<issue>2</issue>) <fpage>241</fpage>–<lpage>259</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds82_ref_033">
<label>[33]</label><mixed-citation publication-type="journal"><string-name><surname>Wright</surname>, <given-names>M. N.</given-names></string-name> and <string-name><surname>Ziegler</surname>, <given-names>A.</given-names></string-name> (<year>2017</year>). <article-title>ranger: a fast implementation of random forests for high dimensional data in C++ and R</article-title>. <source>Journal of Statistical Software</source> <volume>77</volume>(<issue>1</issue>) <fpage>1</fpage>–<lpage>17</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18637/jss.v077.i01" xlink:type="simple">https://doi.org/10.18637/jss.v077.i01</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4583337">MR4583337</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds82_ref_034">
<label>[34]</label><mixed-citation publication-type="journal"><string-name><surname>Yu</surname>, <given-names>W.-H.</given-names></string-name>, <string-name><surname>Su</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Torabi</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Fennessey</surname>, <given-names>C. M.</given-names></string-name>, <string-name><surname>Shiakolas</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Lynch</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Chun</surname>, <given-names>T.-W.</given-names></string-name>, <string-name><surname>Doria-Rose</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Alter</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Seaman</surname>, <given-names>M. S.</given-names></string-name> et al. (<year>2019</year>). <article-title>Predicting the broadly neutralizing antibody susceptibility of the HIV reservoir</article-title>. <source>JCI Insight</source> <volume>4</volume>(<issue>17</issue>).</mixed-citation>
</ref>
</ref-list>
</back>
</article>
