<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">NEJSDS</journal-id>
<journal-title-group><journal-title>The New England Journal of Statistics in Data Science</journal-title></journal-title-group>
<issn pub-type="ppub">2693-7166</issn><issn-l>2693-7166</issn-l>
<publisher>
<publisher-name>New England Statistical Society</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">NEJSDS36</article-id>
<article-id pub-id-type="doi">10.51387/23-NEJSDS36</article-id>
<article-categories>
<subj-group subj-group-type="heading"><subject>Methodology Article</subject></subj-group>
<subj-group subj-group-type="area"><subject>Statistical Methodology</subject></subj-group>
</article-categories>
<title-group>
<article-title>Subdata Selection With a Large Number of Variables</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Singh</surname><given-names>Rakhi</given-names></name><email xlink:href="mailto:rsingh@binghamton.edu">rsingh@binghamton.edu</email><xref ref-type="aff" rid="j_nejsds36_aff_001"/><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Stufken</surname><given-names>John</given-names></name><email xlink:href="mailto:jstufken@gmu.edu">jstufken@gmu.edu</email><xref ref-type="aff" rid="j_nejsds36_aff_002"/>
</contrib>
<aff id="j_nejsds36_aff_001">Department of Mathematics and Statistics, <institution>Binghamton University</institution>, <country>USA</country>. E-mail address: <email xlink:href="mailto:rsingh@binghamton.edu">rsingh@binghamton.edu</email></aff>
<aff id="j_nejsds36_aff_002">Department of Statistics, <institution>George Mason University</institution>, <country>USA</country>. E-mail address: <email xlink:href="mailto:jstufken@gmu.edu">jstufken@gmu.edu</email></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2023</year></pub-date><pub-date pub-type="epub"><day>15</day><month>6</month><year>2023</year></pub-date><volume>1</volume><issue>3</issue><fpage>426</fpage><lpage>438</lpage><supplementary-material id="S1" content-type="document" xlink:href="nejsds36_s001.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<title>Supplementary Material</title>
<p>The Supplementary Material is available online and contains more performance results corresponding to the cases in Table <xref rid="j_nejsds36_tab_001">1</xref>.</p>
</caption>
</supplementary-material><history><date date-type="accepted"><day>18</day><month>5</month><year>2023</year></date></history>
<permissions><copyright-statement>© 2023 New England Statistical Society</copyright-statement><copyright-year>2023</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Subdata selection from big data is an active area of research that facilitates inferences based on big data with limited computational expense. For linear regression models, the optimal design-inspired Information-Based Optimal Subdata Selection (IBOSS) method is a computationally efficient method for selecting subdata that has excellent statistical properties. But the method can only be used if the subdata size, <italic>k</italic>, is at last twice the number of regression variables, <italic>p</italic>. In addition, even when <inline-formula id="j_nejsds36_ineq_001"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo stretchy="false">≥</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$k\ge 2p$]]></tex-math></alternatives></inline-formula>, under the assumption of effect sparsity, one can expect to obtain subdata with better statistical properties by trying to focus on active variables. Inspired by recent efforts to extend the IBOSS method to situations with a large number of variables <italic>p</italic>, we introduce a method called Combining Lasso And Subdata Selection (CLASS) that, as shown, improves on other proposed methods in terms of variable selection and building a predictive model based on subdata when the full data size <italic>n</italic> is very large and the number of variables <italic>p</italic> is large. In terms of computational expense, CLASS is more expensive than recent competitors for moderately large values of <italic>n</italic>, but the roles reverse under effect sparsity for extremely large values of <italic>n</italic>.</p>
</abstract>
<kwd-group>
<label>Keywords and phrases</label>
<kwd>Effect Sparsity</kwd>
<kwd>Optimal Design</kwd>
<kwd>Prediction</kwd>
<kwd>Subsampling</kwd>
<kwd>Variable Selection</kwd>
</kwd-group>
<funding-group><award-group><funding-source xlink:href="https://doi.org/10.13039/100000001">NSF</funding-source><award-id>DMS-1935729</award-id><award-id>DMS-2304767</award-id></award-group><funding-statement>JS gratefully acknowledges support through NSF grants DMS-1935729 and DMS-2304767. </funding-statement></funding-group>
</article-meta>
</front>
<body>
<sec id="j_nejsds36_s_001">
<label>1</label>
<title>Introduction</title>
<p>Unprecedented advancements in modern information technologies have resulted in an exponential growth of data and massive datasets. Data sizes are now measured in terabytes (TB) or petabytes (PB) and not in mere megabytes (MB) or gigabytes (GB). Big data facilitates and incentivizes data-driven decisions in almost every area of science, industry, and government. Given the challenges that big data presents due to its volume, variety, and complexity, extracting high-quality information from big data is a prerequisite for understanding the data meaningfully [<xref ref-type="bibr" rid="j_nejsds36_ref_005">5</xref>].</p>
<p>Some statistical methods for analyzing big data include bags of little bootstraps by [<xref ref-type="bibr" rid="j_nejsds36_ref_021">21</xref>], divide-and-conquer [<xref ref-type="bibr" rid="j_nejsds36_ref_022">22</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_006">6</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_034">34</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_031">31</xref>, for example] and sequential updating for streaming data [<xref ref-type="bibr" rid="j_nejsds36_ref_032">32</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_045">45</xref>, for example]. In divide-and-conquer approaches, statistical analyses are performed on multiple parts of the data, and then these results are combined to form overall conclusions. In sequential updating, since the data is made available in streams or large chunks, sequential analysis methods that do not require storing the big data are developed. Interested readers are directed to [<xref ref-type="bibr" rid="j_nejsds36_ref_038">38</xref>] for a comprehensive review of these approaches. Within the subsampling-based approaches to handle big data, one approach is to work with a carefully selected small representative sample (called <italic>subdata</italic>) of size <italic>k</italic> from the big data of size <italic>n</italic> (called <italic>full data</italic>). The sample size <italic>k</italic> should be chosen so that appropriate statistical tools and methods can be applied to the subdata with sufficiently reduced computational complexity. Methods to identify such subdata are called <italic>subdata selection methods</italic>.</p>
<p>The current literature on subdata selection is rapidly growing. Much of the relevant literature focuses on identifying subdata that yields precise estimates of parameters in a given statistical model, for example, for linear regression [see, <xref ref-type="bibr" rid="j_nejsds36_ref_010">10</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_023">23</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_008">8</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_036">36</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>], logistic regression [<xref ref-type="bibr" rid="j_nejsds36_ref_042">42</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_039">39</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_007">7</xref>], multinomial logistic regression [<xref ref-type="bibr" rid="j_nejsds36_ref_046">46</xref>], generalized linear models [<xref ref-type="bibr" rid="j_nejsds36_ref_013">13</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_036">36</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_001">1</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_050">50</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_053">53</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_016">16</xref>], quantile regression [<xref ref-type="bibr" rid="j_nejsds36_ref_040">40</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_002">2</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_012">12</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_033">33</xref>], and quasi-likelihood [<xref ref-type="bibr" rid="j_nejsds36_ref_050">50</xref>]. All of these methods assume a true underlying model. Methods that allow for the misspecification of a linear model [<xref ref-type="bibr" rid="j_nejsds36_ref_029">29</xref>], a non-parametric regression model [<xref ref-type="bibr" rid="j_nejsds36_ref_030">30</xref>], a distributed computing environment [<xref ref-type="bibr" rid="j_nejsds36_ref_052">52</xref>], and model selection [<xref ref-type="bibr" rid="j_nejsds36_ref_049">49</xref>] also exist. Typically, model-based methods aid in estimating model parameters and, as a byproduct, in the prediction for new test data.</p>
<p>In addition, model-free subdata selection methods also exist. For example, one could mirror the population distribution in the subdata [<xref ref-type="bibr" rid="j_nejsds36_ref_026">26</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_019">19</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_037">37</xref>], or compress the full data in a small set for prediction [<xref ref-type="bibr" rid="j_nejsds36_ref_018">18</xref>]. Selective reviews of subdata selection methods are also provided by [<xref ref-type="bibr" rid="j_nejsds36_ref_048">48</xref>] and [<xref ref-type="bibr" rid="j_nejsds36_ref_047">47</xref>].</p>
<p>While subdata selection methods focus on data reduction by drastically reducing the number of observations <italic>n</italic>, they tend to become computationally intensive or statistically inefficient when the number of variables <italic>p</italic> is moderate to large. In this paper, we consider the situation when <inline-formula id="j_nejsds36_ineq_002"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo stretchy="false">≫</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$n\gg p$]]></tex-math></alternatives></inline-formula>, but <italic>p</italic> is moderate to large (in the thousands). We assume that the response can be modeled using a linear model and use the Information-Based Optimal Subdata Selecton (IBOSS) method [<xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>] in conjunction with LASSO [<xref ref-type="bibr" rid="j_nejsds36_ref_035">35</xref>] to combine variable selection and subdata selection.</p>
<p>In Section <xref rid="j_nejsds36_s_002">2</xref>, we provide a brief background for the current analysis methods and subdata selection methods. Section <xref rid="j_nejsds36_s_009">3</xref> describes our method, explores its analytical properties, and discusses its advantages. Section <xref rid="j_nejsds36_s_014">4</xref> compares the proposed method with competing methods on simulated and real data. Finally, we provide some concluding remarks in Section <xref rid="j_nejsds36_s_016">5</xref>.</p>
</sec>
<sec id="j_nejsds36_s_002">
<label>2</label>
<title>Background</title>
<sec id="j_nejsds36_s_003">
<label>2.1</label>
<title>The Model and Analysis Methods</title>
<sec id="j_nejsds36_s_004">
<label>2.1.1</label>
<title>The Model</title>
<p>Let <inline-formula id="j_nejsds36_ineq_003"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="bold">Y</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(\mathbf{X},\mathbf{Y})$]]></tex-math></alternatives></inline-formula> be the full data, where <bold>X</bold> is a <inline-formula id="j_nejsds36_ineq_004"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$n\times p$]]></tex-math></alternatives></inline-formula> matrix with <italic>n</italic> observations and <italic>p</italic> (independent) variables or features and <bold>Y</bold> is the corresponding <inline-formula id="j_nejsds36_ineq_005"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[$n\times 1$]]></tex-math></alternatives></inline-formula> response vector. The linear regression model is 
<disp-formula id="j_nejsds36_eq_001">
<label>(2.1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {y_{i}}={\beta _{0}}+{\mathbf{x}_{i}^{T}}{\boldsymbol{\beta }_{1}}+{\epsilon _{i}}={\beta _{0}}+{\sum \limits_{j=1}^{p}}{\beta _{j}}{x_{ij}}+{\epsilon _{i}},i=1,\dots ,n,\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_nejsds36_ineq_006"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\beta _{0}}$]]></tex-math></alternatives></inline-formula> is the intercept parameter, <inline-formula id="j_nejsds36_ineq_007"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\boldsymbol{\beta }_{1}}={({\beta _{1}},\dots ,{\beta _{p}})^{T}}$]]></tex-math></alternatives></inline-formula> is the vector of slope parameters, and <inline-formula id="j_nejsds36_ineq_008"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\mathbf{x}_{i}}={({x_{i1}},\dots ,{x_{ip}})^{T}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_009"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${y_{i}}$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_010"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\epsilon _{i}}$]]></tex-math></alternatives></inline-formula> are the vector of variable values, response, and error for the <italic>i</italic>th observation, respectively. Further, we write <inline-formula id="j_nejsds36_ineq_011"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\mathbf{z}_{i}}={(1,{\mathbf{x}_{i}^{T}})^{T}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_012"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">β</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$\boldsymbol{\beta }={({\beta _{0}},{\boldsymbol{\beta }_{1}^{T}})^{T}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_013"><alternatives><mml:math>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$\mathbf{X}={({\mathbf{x}_{1}},\dots ,{\mathbf{x}_{n}})^{T}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_014"><alternatives><mml:math>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$\mathbf{Z}={({\mathbf{z}_{1}},\dots ,{\mathbf{z}_{n}})^{T}}$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_015"><alternatives><mml:math>
<mml:mi mathvariant="bold">Y</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$\mathbf{Y}={({y_{1}},\dots ,{y_{n}})^{T}}$]]></tex-math></alternatives></inline-formula>. We assume that the <inline-formula id="j_nejsds36_ineq_016"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">ϵ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\epsilon _{i}}$]]></tex-math></alternatives></inline-formula>’s are independent and identically distributed with mean 0 and variance <inline-formula id="j_nejsds36_ineq_017"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\sigma ^{2}}$]]></tex-math></alternatives></inline-formula>.</p>
</sec>
<sec id="j_nejsds36_s_005">
<label>2.1.2</label>
<title>The Ordinary Least Squares (OLS) Estimator</title>
<p>When using the full data and model (<xref rid="j_nejsds36_eq_001">2.1</xref>), the least-squares estimator of <inline-formula id="j_nejsds36_ineq_018"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">β</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{\beta }$]]></tex-math></alternatives></inline-formula>, which is also its best linear unbiased estimator, is <inline-formula id="j_nejsds36_ineq_019"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="bold">Y</mml:mi></mml:math><tex-math><![CDATA[${\hat{\boldsymbol{\beta }}_{f}}={({\mathbf{Z}^{T}}\mathbf{Z})^{-1}}{\mathbf{Z}^{T}}\mathbf{Y}$]]></tex-math></alternatives></inline-formula>. The covariance matrix of this unbiased estimator is equal to the inverse of the Fisher information matrix <inline-formula id="j_nejsds36_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{I}_{f}}$]]></tex-math></alternatives></inline-formula> for <inline-formula id="j_nejsds36_ineq_021"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">β</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{\beta }$]]></tex-math></alternatives></inline-formula> from the full data 
<disp-formula id="j_nejsds36_eq_002">
<label>(2.2)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\mathbf{I}_{f}}=\frac{1}{{\sigma ^{2}}}{\mathbf{Z}^{T}}\mathbf{Z}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>Subdata of size <italic>k</italic> can be represented by a <inline-formula id="j_nejsds36_ineq_022"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>×</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$k\times (p+1)$]]></tex-math></alternatives></inline-formula> matrix <inline-formula id="j_nejsds36_ineq_023"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{Z}_{s}}$]]></tex-math></alternatives></inline-formula>, which consists of <italic>k</italic> rows of the full data matrix <bold>Z</bold>, and the corresponding vector of responses <inline-formula id="j_nejsds36_ineq_024"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{Y}_{s}}$]]></tex-math></alternatives></inline-formula>. The OLS estimator based on the subdata is <inline-formula id="j_nejsds36_ineq_025"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\hat{\boldsymbol{\beta }}_{s}}={({\mathbf{Z}_{s}^{T}}{\mathbf{Z}_{s}})^{-1}}{\mathbf{Z}_{s}^{T}}{\mathbf{Y}_{s}}$]]></tex-math></alternatives></inline-formula>.</p>
</sec>
<sec id="j_nejsds36_s_006">
<label>2.1.3</label>
<title>The LASSO Estimator</title>
<p>For high-dimensional data (i.e., large <italic>p</italic>), it is common to assume that only a few variables affect the response (<italic>sparsity</italic>). We will refer to these variables as “active” variables. Penalized regression methods are used to analyze data in such situations. One well-known method is LASSO with an <inline-formula id="j_nejsds36_ineq_026"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\ell _{1}}$]]></tex-math></alternatives></inline-formula>-norm regularization. The LASSO estimator <inline-formula id="j_nejsds36_ineq_027"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">O</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\hat{\boldsymbol{\beta }}_{LASSO}}$]]></tex-math></alternatives></inline-formula> is a solution to the following optimization problem [<xref ref-type="bibr" rid="j_nejsds36_ref_035">35</xref>] 
<disp-formula id="j_nejsds36_eq_003">
<label>(2.3)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mtext>argmin</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
<mml:mo stretchy="false">∈</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="script">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msub><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>−</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mi mathvariant="bold-italic">β</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">λ</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold-italic">β</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\text{argmin}_{\boldsymbol{\beta }\in {\mathcal{R}^{p+1}}}}\frac{1}{n}{\sum \limits_{i=1}^{n}}{({y_{i}}-{\mathbf{z}_{i}^{T}}\boldsymbol{\beta })^{2}}+\lambda ||\boldsymbol{\beta }|{|_{1}},\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_nejsds36_ineq_028"><alternatives><mml:math>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold-italic">β</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">⋯</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">|</mml:mo></mml:math><tex-math><![CDATA[$||\boldsymbol{\beta }|{|_{1}}=|{\beta _{0}}|+|{\beta _{1}}|+\cdots +|{\beta _{p}}|$]]></tex-math></alternatives></inline-formula> is the <inline-formula id="j_nejsds36_ineq_029"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi>ℓ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\ell _{1}}$]]></tex-math></alternatives></inline-formula>-norm of <inline-formula id="j_nejsds36_ineq_030"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">β</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{\beta }$]]></tex-math></alternatives></inline-formula>, and <italic>λ</italic> is the regularization parameter. If the tuning parameter <italic>λ</italic> goes to 0 slower than <inline-formula id="j_nejsds36_ineq_031"><alternatives><mml:math>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msqrt></mml:math><tex-math><![CDATA[$1/\sqrt{n}$]]></tex-math></alternatives></inline-formula>, then, provided that <inline-formula id="j_nejsds36_ineq_032"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{I}_{f}}$]]></tex-math></alternatives></inline-formula> is non-singular, <inline-formula id="j_nejsds36_ineq_033"><alternatives><mml:math>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msqrt>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">O</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="bold-italic">β</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\sqrt{n}({\hat{\boldsymbol{\beta }}_{LASSO}}-\boldsymbol{\beta })$]]></tex-math></alternatives></inline-formula> converges in distribution [<xref ref-type="bibr" rid="j_nejsds36_ref_015">15</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_044">44</xref>] to <inline-formula id="j_nejsds36_ineq_034"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(0,{\sigma ^{2}}{\mathbf{I}_{f}^{-1}})$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_nejsds36_ineq_035"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{I}_{f}}$]]></tex-math></alternatives></inline-formula> is as in (<xref rid="j_nejsds36_eq_002">2.2</xref>). In practice, cross-validation is typically used to tune <italic>λ</italic>. Two solutions, one with the minimum cross-validation error (corresponding to <monospace>lambda.min</monospace> in R package <monospace>glmnet</monospace> [<xref ref-type="bibr" rid="j_nejsds36_ref_014">14</xref>]) and another with the most regularized solution within 1 standard deviation of the minimum cross-validation error (corresponding to <monospace>lambda.1se</monospace> in R package <monospace>glmnet</monospace>), are widely used.</p>
</sec>
</sec>
<sec id="j_nejsds36_s_007">
<label>2.2</label>
<title>Subdata Selection Methods for OLS</title>
<p>For a linear regression model, current subdata selection methods can be broadly classified into two categories:</p>
<list>
<list-item id="j_nejsds36_li_001">
<label>•</label>
<p>Probabilistic methods. The <italic>k</italic> subdata observations are randomly sampled with replacement from the full data, each time using a selection probability <inline-formula id="j_nejsds36_ineq_036"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">π</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\pi _{i}}$]]></tex-math></alternatives></inline-formula> for the <italic>i</italic>th observation in the full data, <inline-formula id="j_nejsds36_ineq_037"><alternatives><mml:math>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi></mml:math><tex-math><![CDATA[$i=1,\dots ,n$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_038"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">π</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[${\textstyle\sum _{i=1}^{n}}{\pi _{i}}=1$]]></tex-math></alternatives></inline-formula>. This is often combined with the weighted least squares estimator [cf. <xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>] 
<disp-formula id="j_nejsds36_eq_004">
<label>(2.4)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msup>
<mml:mrow>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">(</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">η</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo maxsize="2.03em" minsize="2.03em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">η</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\bigg({\sum \limits_{i=1}^{n}}{w_{i}}{\eta _{i}}{\mathbf{z}_{i}}{\mathbf{z}_{i}^{T}}\bigg)^{-1}}{\sum \limits_{i=1}^{n}}{w_{i}}{\eta _{i}}{\mathbf{z}_{i}}{y_{i}},\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_nejsds36_ineq_039"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">η</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\eta _{i}}$]]></tex-math></alternatives></inline-formula> denotes the number of times that the <italic>i</italic>th data point is included in the subsample, and the weight <inline-formula id="j_nejsds36_ineq_040"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${w_{i}}$]]></tex-math></alternatives></inline-formula> is often taken to be proportional to <inline-formula id="j_nejsds36_ineq_041"><alternatives><mml:math>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">π</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$1/{\pi _{i}}$]]></tex-math></alternatives></inline-formula>. Using <inline-formula id="j_nejsds36_ineq_042"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">π</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi></mml:math><tex-math><![CDATA[${\pi _{i}}=1/n$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_043"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[${w_{i}}=1$]]></tex-math></alternatives></inline-formula> for <inline-formula id="j_nejsds36_ineq_044"><alternatives><mml:math>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi></mml:math><tex-math><![CDATA[$i=1,\dots ,n$]]></tex-math></alternatives></inline-formula> constitutes simple random sampling with replacement, to which we will refer as uniform sampling (UNI), whereas leverage sampling [<xref ref-type="bibr" rid="j_nejsds36_ref_010">10</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_025">25</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_023">23</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_024">24</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_009">9</xref>] uses <inline-formula id="j_nejsds36_ineq_045"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">π</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\pi _{i}}={h_{ii}}/(p+1)$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_046"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">π</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${w_{i}}=1/{\pi _{i}}$]]></tex-math></alternatives></inline-formula> where <inline-formula id="j_nejsds36_ineq_047"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${h_{ii}}$]]></tex-math></alternatives></inline-formula> is the leverage value of the <italic>i</italic>th observation obtained as <inline-formula id="j_nejsds36_ineq_048"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{z}_{i}^{T}}{({\mathbf{Z}^{T}}\mathbf{Z})^{-1}}{\mathbf{z}_{i}}$]]></tex-math></alternatives></inline-formula>. Another probabilistic approach uses influence functions to find the subsampling probabilities [<xref ref-type="bibr" rid="j_nejsds36_ref_036">36</xref>]. One major limitation of probabilistic methods [see, <xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>] is that variances of the estimators in (<xref rid="j_nejsds36_eq_004">2.4</xref>), given some mild conditions on <bold>X</bold>, are bounded below by quantities of order <inline-formula id="j_nejsds36_ineq_049"><alternatives><mml:math>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi></mml:math><tex-math><![CDATA[$1/k$]]></tex-math></alternatives></inline-formula>, and do not approach 0 as <inline-formula id="j_nejsds36_ineq_050"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo stretchy="false">→</mml:mo>
<mml:mi>∞</mml:mi></mml:math><tex-math><![CDATA[$n\to \infty $]]></tex-math></alternatives></inline-formula>. Except for UNI, other methods can still be computationally expensive for values of <italic>n</italic> and <italic>p</italic> considered in this paper.</p>
</list-item>
<list-item id="j_nejsds36_li_002">
<label>•</label>
<p>Deterministic methods: These methods, some of which draw inspiration from the optimal design literature, aim to select subdata of size <italic>k</italic> that optimizes an objective function. Under model (<xref rid="j_nejsds36_eq_001">2.1</xref>), the information matrix for <inline-formula id="j_nejsds36_ineq_051"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">β</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{\beta }$]]></tex-math></alternatives></inline-formula> when using subdata of size <italic>k</italic> is 
<disp-formula id="j_nejsds36_eq_005">
<label>(2.5)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="script">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold-italic">δ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="bold">Δ</mml:mi>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \mathcal{I}(\boldsymbol{\delta })=\frac{1}{{\sigma ^{2}}}{\mathbf{Z}^{T}}\boldsymbol{\Delta }\mathbf{Z}=\frac{1}{{\sigma ^{2}}}{\mathbf{Z}_{s}^{T}}{\mathbf{Z}_{s}},\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_nejsds36_ineq_052"><alternatives><mml:math>
<mml:mi mathvariant="bold">Δ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mtext>diag</mml:mtext>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold-italic">δ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Delta }=\text{diag}(\boldsymbol{\delta })$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_053"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">δ</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">δ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">δ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$\boldsymbol{\delta }={({\delta _{1}},\dots ,{\delta _{n}})^{T}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_054"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">δ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\delta _{i}}$]]></tex-math></alternatives></inline-formula> is the indicator variable indicating whether the <italic>i</italic>th data point is in the subdata or not, and <inline-formula id="j_nejsds36_ineq_055"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">δ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi></mml:math><tex-math><![CDATA[${\textstyle\sum _{i=1}^{n}}{\delta _{i}}=k$]]></tex-math></alternatives></inline-formula>. Based on the structure of <italic>D</italic>-optimal designs [<xref ref-type="bibr" rid="j_nejsds36_ref_020">20</xref>], [<xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>] devised a deterministic subdata selection method for finding subdata that, approximately, maximizes the determinant of <inline-formula id="j_nejsds36_ineq_056"><alternatives><mml:math>
<mml:mi mathvariant="script">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold-italic">δ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\mathcal{I}(\boldsymbol{\delta })$]]></tex-math></alternatives></inline-formula> for a give value of <italic>k</italic>. The method is called <italic>D</italic>-optimal Information-Based Optimal Subdata Selection (IBOSS). The <italic>D</italic>-optimality criterion minimizes the expected volume of the joint confidence ellipsoid for <inline-formula id="j_nejsds36_ineq_057"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">β</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{\beta }$]]></tex-math></alternatives></inline-formula>. When <inline-formula id="j_nejsds36_ineq_058"><alternatives><mml:math>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$r=k/(2p)$]]></tex-math></alternatives></inline-formula> is an integer, the simple algorithm proposed in [<xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>], which has computational complexity <inline-formula id="j_nejsds36_ineq_059"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(np)$]]></tex-math></alternatives></inline-formula>, selects IBOSS subdata by sequentially considering each column of <bold>X</bold> and selecting the observations with the <italic>r</italic> smallest and <italic>r</italic> largest values for each of the <italic>p</italic> variables. Orthogonal subsampling, proposed by [<xref ref-type="bibr" rid="j_nejsds36_ref_043">43</xref>], is another deterministic approach inspired by experimental design. In addition to having extreme observations for each variable, a better approximate solution to maximizing the determinant can be obtained if the subdata has a structure that mirrors the structure of an orthogonal array of strength 2 [<xref ref-type="bibr" rid="j_nejsds36_ref_017">17</xref>]. It can provide subdata that has a better spread in <italic>p</italic>-dimensional space. [<xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>] showed nice theoretical properties of the estimator from the IBOSS sample (discussed in Section <xref rid="j_nejsds36_s_012">3.3</xref>).</p>
</list-item>
</list>
<p>In what follows, we use IBOSS as a subdata selection method owing to its computational and statistical superiority.</p>
</sec>
<sec id="j_nejsds36_s_008">
<label>2.3</label>
<title>Challenges With Using IBOSS for Large <italic>p</italic></title>
<p>Since IBOSS attempts to select at least 2 data points for each variable, it can only be applied if <inline-formula id="j_nejsds36_ineq_060"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo stretchy="false">≥</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$k\ge 2p$]]></tex-math></alternatives></inline-formula>. But even when this condition is satisfied, the subdata that is obtained by applying IBOSS may not be great for large <italic>p</italic> given that many variables are likely to be inactive. For <inline-formula id="j_nejsds36_ineq_061"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$k\lt 2p$]]></tex-math></alternatives></inline-formula>, [<xref ref-type="bibr" rid="j_nejsds36_ref_044">44</xref>] recently proposed first selecting <inline-formula id="j_nejsds36_ineq_062"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{+}}$]]></tex-math></alternatives></inline-formula> variables with the largest absolute correlation with the response by the sure independence screening (SIS) method [<xref ref-type="bibr" rid="j_nejsds36_ref_011">11</xref>] and then only using these <inline-formula id="j_nejsds36_ineq_063"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{+}}$]]></tex-math></alternatives></inline-formula> variables for selecting the IBOSS subdata. They call this procedure SIS-IBOSS. They then analyzed the subdata, using all <italic>p</italic> variables, by using LASSO. They also used this analysis for <inline-formula id="j_nejsds36_ineq_064"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo stretchy="false">≥</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$k\ge 2p$]]></tex-math></alternatives></inline-formula>, in which case they selected the subdata simply by using IBOSS. If the tuning parameter <italic>λ</italic> of LASSO goes to 0 slower than <inline-formula id="j_nejsds36_ineq_065"><alternatives><mml:math>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msqrt></mml:math><tex-math><![CDATA[$1/\sqrt{n}$]]></tex-math></alternatives></inline-formula>, then [<xref ref-type="bibr" rid="j_nejsds36_ref_044">44</xref>] showed that the asymptotic behavior of the <inline-formula id="j_nejsds36_ineq_066"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\hat{\boldsymbol{\beta }}$]]></tex-math></alternatives></inline-formula> in (<xref rid="j_nejsds36_eq_003">2.3</xref>) is the same as that of an OLS estimator. Therefore, both IBOSS and SIS-IBOSS work well with LASSO. Note, however, that SIS only considers the marginal relation of each variable with the response and works best when the variables are independent [<xref ref-type="bibr" rid="j_nejsds36_ref_011">11</xref>]. SIS-IBOSS also suffers from this problem (see Section <xref rid="j_nejsds36_s_014">4</xref>) and does not work well when the variables are correlated. Another challenge with SIS-IBOSS is that a good value of <inline-formula id="j_nejsds36_ineq_067"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{+}}$]]></tex-math></alternatives></inline-formula> is generally unknown and is hard to guess.</p>
<p>We therefore develop a subdata selection method that mitigates these challenges for high-dimensional data.</p>
</sec>
</sec>
<sec id="j_nejsds36_s_009" sec-type="methods">
<label>3</label>
<title>Methodology</title>
<sec id="j_nejsds36_s_010">
<label>3.1</label>
<title>Overview of Our Method</title>
<p>With the ultimate goal of prediction, our method first screens variables to identify the active variables, and then performs the subdata selection using only the identified variables. Finally, a linear regression model with only the variables identified as active is fitted using the subdata and OLS estimation. Algorithm <xref rid="j_nejsds36_fig_001">1</xref> provides more details for our method, which we call <italic>CLASS</italic> for Combining Lasso and Subdata Selection. Steps 1 to 8 of Algorithm <xref rid="j_nejsds36_fig_001">1</xref> focus on variable selection. The novel variable selection method runs LASSO multiple times on small randomly selected subsets of the full data. Variables that are consistently selected in different LASSO runs are declared active. Unlike [<xref ref-type="bibr" rid="j_nejsds36_ref_028">28</xref>] and [<xref ref-type="bibr" rid="j_nejsds36_ref_003">3</xref>], we use a kmeans-based data-driven approach for deciding which variables are consistently selected. Step 9 of Algorithm <xref rid="j_nejsds36_fig_001">1</xref> focuses on the subdata selection using IBOSS only on the selected variables. Finally, in step 10 of Algorithm <xref rid="j_nejsds36_fig_001">1</xref>, we fit a linear regression model to the subdata and selected variables using OLS estimation. This model can then be used to obtain predictions on test data.</p>
<fig id="j_nejsds36_fig_001">
<label>Algorithm 1:</label>
<caption>
<p>CLASS.</p>
</caption>
<graphic xlink:href="nejsds36_g001.jpg"/>
</fig>
<p>In the next two subsections, we provide evidence for the superior performance of CLASS.</p>
</sec>
<sec id="j_nejsds36_s_011">
<label>3.2</label>
<title>Variable Selection With Large <italic>n</italic> and Large <italic>p</italic></title>
<p>We perform repeated applications of LASSO in step 3 of Algorithm <xref rid="j_nejsds36_fig_001">1</xref>, each time on a randomly selected subset of size <inline-formula id="j_nejsds36_ineq_068"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$nsample\times p$]]></tex-math></alternatives></inline-formula>. Therefore, for CLASS to perform well, we need to identify conditions under which LASSO performs well. Regular consistency and the model selection consistency of <inline-formula id="j_nejsds36_ineq_069"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">O</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\hat{\boldsymbol{\beta }}_{LASSO}}$]]></tex-math></alternatives></inline-formula> are well-studied in the literature. Writing the tuning parameter as <inline-formula id="j_nejsds36_ineq_070"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\lambda _{n}}$]]></tex-math></alternatives></inline-formula>, if <inline-formula id="j_nejsds36_ineq_071"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\lambda _{n}}$]]></tex-math></alternatives></inline-formula> tends to zero slower than <inline-formula id="j_nejsds36_ineq_072"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${n^{-1/2}}$]]></tex-math></alternatives></inline-formula>, then <inline-formula id="j_nejsds36_ineq_073"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">O</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\hat{\boldsymbol{\beta }}_{LASSO}}$]]></tex-math></alternatives></inline-formula> converges to <inline-formula id="j_nejsds36_ineq_074"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">β</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{\beta }$]]></tex-math></alternatives></inline-formula>, that is, the solution is consistent [<xref ref-type="bibr" rid="j_nejsds36_ref_015">15</xref>]. With the same condition on the tuning parameter, <inline-formula id="j_nejsds36_ineq_075"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">O</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\hat{\boldsymbol{\beta }}_{LASSO}}$]]></tex-math></alternatives></inline-formula> is strong sign consistent as <inline-formula id="j_nejsds36_ineq_076"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo stretchy="false">→</mml:mo>
<mml:mi>∞</mml:mi></mml:math><tex-math><![CDATA[$n\to \infty $]]></tex-math></alternatives></inline-formula> if and only if the following condition is satisfied [<xref ref-type="bibr" rid="j_nejsds36_ref_054">54</xref>, <xref ref-type="bibr" rid="j_nejsds36_ref_051">51</xref>]: 
<disp-formula id="j_nejsds36_eq_006">
<label>(3.1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">g</mml:mi>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>∞</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">≤</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ ||{\mathbf{Z}_{{J^{C}}}^{T}}{\mathbf{Z}_{J}}{({\mathbf{Z}_{J}^{T}}{\mathbf{Z}_{J}})^{-1}}sign({\boldsymbol{\beta }_{J}})|{|_{\infty }}\le 1,\]]]></tex-math></alternatives>
</disp-formula> 
where <italic>J</italic> is the index set for active variables, <inline-formula id="j_nejsds36_ineq_077"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${J^{C}}$]]></tex-math></alternatives></inline-formula> is the complement of <italic>J</italic>, and <inline-formula id="j_nejsds36_ineq_078"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\boldsymbol{\beta }_{J}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_079"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{Z}_{J}}$]]></tex-math></alternatives></inline-formula> are the subvector of <inline-formula id="j_nejsds36_ineq_080"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">β</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{\beta }$]]></tex-math></alternatives></inline-formula> and submatrix of <bold>Z</bold> with elements and columns, respectively, corresponding to <italic>J</italic>. Since the condition in (<xref rid="j_nejsds36_eq_006">3.1</xref>) is not trivial to verify, it is difficult to guarantee strong sign consistentcy of LASSO for any <bold>Z</bold>. For the special case that the tuning parameter <inline-formula id="j_nejsds36_ineq_081"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\lambda _{n}}={\lambda _{0}}{n^{-1/2}}$]]></tex-math></alternatives></inline-formula>, [<xref ref-type="bibr" rid="j_nejsds36_ref_003">3</xref>] showed that under regularity conditions, (a) the sign of <inline-formula id="j_nejsds36_ineq_082"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">A</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">O</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\hat{\boldsymbol{\beta }}_{LASSO}}$]]></tex-math></alternatives></inline-formula> matches with that of the true <inline-formula id="j_nejsds36_ineq_083"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">β</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{\beta }$]]></tex-math></alternatives></inline-formula> for indices in <italic>J</italic> with probability tending to 1 exponentially fast, and (b) all variables with indices in <inline-formula id="j_nejsds36_ineq_084"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">C</mml:mi>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${J^{C}}$]]></tex-math></alternatives></inline-formula> are selected with a probability strictly between <inline-formula id="j_nejsds36_ineq_085"><alternatives><mml:math>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$(0,1)$]]></tex-math></alternatives></inline-formula>. Therefore, [<xref ref-type="bibr" rid="j_nejsds36_ref_003">3</xref>] proposed <italic>Bolasso</italic> where LASSO is applied multiple times on a different bootstrapped sample of the original data. The variables that appear consistently in either all or at least in <inline-formula id="j_nejsds36_ineq_086"><alternatives><mml:math>
<mml:mn>90</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$90\% $]]></tex-math></alternatives></inline-formula> of LASSO models are declared to be active variables. The latter version is called the soft-threshold version of the <italic>Bolasso</italic>. Except for the fact that we use samples of much smaller size than that of the original data and that we do not use a fixed cutoff for deciding which variables are active in a LASSO run, the <italic>Bolasso</italic> idea mirrors our approach and our method shares the theoretical properties outlined in [<xref ref-type="bibr" rid="j_nejsds36_ref_003">3</xref>].</p>
<p>The variable selection component of Algorithm <xref rid="j_nejsds36_fig_001">1</xref> is also closely related to the stability selection method developed in [<xref ref-type="bibr" rid="j_nejsds36_ref_028">28</xref>]. They proposed using <inline-formula id="j_nejsds36_ineq_087"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>2</mml:mn></mml:math><tex-math><![CDATA[$nsample=n/2$]]></tex-math></alternatives></inline-formula> and obtained good results by declaring variables active if counts exceed a threshold of 20%–80% of <inline-formula id="j_nejsds36_ineq_088"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi></mml:math><tex-math><![CDATA[$ntimes$]]></tex-math></alternatives></inline-formula>. Their simulations suggest that <inline-formula id="j_nejsds36_ineq_089"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn></mml:math><tex-math><![CDATA[$ntimes=100$]]></tex-math></alternatives></inline-formula> is a reasonable choice. With Steps 7–8 of Algorithm <xref rid="j_nejsds36_fig_001">1</xref>, by using kmeans with two clusters, we provide a data-driven way to soft threshold the value above which a variable is declared active. In Section <xref rid="j_nejsds36_s_014">4</xref>, we will demonstrate that the choices <inline-formula id="j_nejsds36_ineq_090"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$nsample=1000$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_091"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn></mml:math><tex-math><![CDATA[$ntimes=100$]]></tex-math></alternatives></inline-formula> for Algorithm <xref rid="j_nejsds36_fig_001">1</xref> and using kmeans with two clusters gives good results. Other choices for <inline-formula id="j_nejsds36_ineq_092"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi></mml:math><tex-math><![CDATA[$nsample$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_093"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi></mml:math><tex-math><![CDATA[$ntimes$]]></tex-math></alternatives></inline-formula> are explored in the Supplementary Material. Simulations in Section <xref rid="j_nejsds36_s_014">4</xref> confirm that CLASS tends to select all active variables and a much smaller number of inactive variables than other competing methods.</p>
</sec>
<sec id="j_nejsds36_s_012">
<label>3.3</label>
<title>Subdata Selection on Selected Variables</title>
<p>In the previous section, we demonstrated that the selected variables contain (a) true active variables with a very large probability and (b) some inactive variables with a positive probability. In Steps 9–10 of Algorithm <xref rid="j_nejsds36_fig_001">1</xref>, we first use IBOSS to select subdata of size <italic>k</italic> only using the variables selected in step 8. We then fit a linear regression model with the intercept and these variables using the subdata and OLS estimation. This section discusses the asymptotic properties of the OLS estimator of <inline-formula id="j_nejsds36_ineq_094"><alternatives><mml:math>
<mml:mi mathvariant="bold-italic">β</mml:mi></mml:math><tex-math><![CDATA[$\boldsymbol{\beta }$]]></tex-math></alternatives></inline-formula> obtained as a result of Algorithm <xref rid="j_nejsds36_fig_001">1</xref>. If we select all active variables in Step 8 of Algorithm <xref rid="j_nejsds36_fig_001">1</xref>, then the OLS estimators for the slopes of the active variables obtained in Step 10 of Algorithm <xref rid="j_nejsds36_fig_001">1</xref> are unbiased.</p>
<p>IBOSS subdata attempts to maximize the determinant of the information matrix of the model based on the selected variables. Since we apply IBOSS only using the selected variables, IBOSS subdata for CLASS gives a larger determinant of the information matrix corresponding to these selected variables than a method that uses IBOSS on all variables. If the selected variables are precisely the active variables, again indexed by <italic>J</italic>, then <italic>D</italic>-optimal IBOSS subdata obtained by only using those variables, minimizes the determinant of the variance-covariance matrix of the estimator <inline-formula id="j_nejsds36_ineq_095"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${\hat{\boldsymbol{\beta }}_{J}^{D}}$]]></tex-math></alternatives></inline-formula> of the parameter vector in the linear regression model that only uses the active variables. Provided that the column means for the full data are available, the estimate of the intercept parameter is adjusted in [<xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>] to maintain the predictive power of the model, that is, 
<disp-formula id="j_nejsds36_eq_007">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo><mml:mstyle mathvariant="bold"><mml:mover accent="true">
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:mstyle>
<mml:mo>−</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mstyle mathvariant="bold"><mml:mover accent="true">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {\hat{\beta }_{J0}^{D}}=\mathbf{\bar{Y}}-{\mathbf{\bar{X}}_{J}^{T}}{\hat{\boldsymbol{\beta }}_{1J}^{D}},\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_nejsds36_ineq_096"><alternatives><mml:math><mml:mstyle mathvariant="bold"><mml:mover accent="true">
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:mstyle></mml:math><tex-math><![CDATA[$\mathbf{\bar{Y}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_097"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mstyle mathvariant="bold"><mml:mover accent="true">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">¯</mml:mo></mml:mover></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{\bar{X}}_{J}}$]]></tex-math></alternatives></inline-formula> are means based on the full data and <inline-formula id="j_nejsds36_ineq_098"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${\hat{\boldsymbol{\beta }}_{1J}^{D}}$]]></tex-math></alternatives></inline-formula> are the slope parameter estimates from <inline-formula id="j_nejsds36_ineq_099"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${\hat{\boldsymbol{\beta }}_{J}^{D}}$]]></tex-math></alternatives></inline-formula>. We will use this adjustment for estimating the intercept parameter.</p>
<p>Theorem 5 in [<xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>] provides a general result for the variance of individual parameter estimators for IBOSS subdata. This result also applies to the individual parameter estimators of the selected variables obtained using IBOSS subdata in Algorithm <xref rid="j_nejsds36_fig_001">1</xref> if all active variables are among those that were selected. As argued in Section <xref rid="j_nejsds36_s_011">3.2</xref>, and as will be demonstrated through simulation, the chance of this happening is high. Theorem 6 in [<xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>] discusses the asymptotic results of these variances when the joint variable distribution is either multivariate normal or lognormal. This result also applies to the subdata of CLASS provided that all active variables are among those that were selected.</p>
<p>For a multivariate normal or lognormal distribution of the variables, this would then imply that for the overall mean, <inline-formula id="j_nejsds36_ineq_100"><alternatives><mml:math>
<mml:mi mathvariant="italic">V</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">D</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="italic">X</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$Var({\hat{\beta }_{s0}^{D}}|X)$]]></tex-math></alternatives></inline-formula>, is proportional to <inline-formula id="j_nejsds36_ineq_101"><alternatives><mml:math>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi></mml:math><tex-math><![CDATA[$1/k$]]></tex-math></alternatives></inline-formula> and never converges to 0 with a fixed <italic>k</italic>. However, the variance of the estimators of the slope parameters for the selected variables would converge to 0 as the full data size <inline-formula id="j_nejsds36_ineq_102"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo stretchy="false">→</mml:mo>
<mml:mi>∞</mml:mi></mml:math><tex-math><![CDATA[$n\to \infty $]]></tex-math></alternatives></inline-formula>. As shown in [<xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>], this nice property of IBOSS subdata is in contrast with other subsampling-based estimators, such as leverage sampling, where the variances of the slope parameter estimators do not converge to 0 because, under mild assumptions, they are bounded from below by terms that are proportional to <inline-formula id="j_nejsds36_ineq_103"><alternatives><mml:math>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi></mml:math><tex-math><![CDATA[$1/k$]]></tex-math></alternatives></inline-formula>.</p>
<p>The discussion up to this point has focused on variable selection and parameter estimation rather than prediction even though our goal is good prediction. However, the strong variable selection properties of our method (see Section <xref rid="j_nejsds36_s_011">3.2</xref>) combined with selecting subdata that optimizes the estimation of the coefficients of the selected variables is bound to result in good prediction properties provided that the model assumptions hold.</p>
</sec>
<sec id="j_nejsds36_s_013">
<label>3.4</label>
<title>Desirable Features of CLASS</title>
<p>First, howsoever large the full data is, variable selection can be done in a feasible time since CLASS runs LASSO on small subsets of the full data. Second, CLASS does better at selecting active variables correctly and in not identifying inactive variables as active than LASSO on full data or than other competing subdata selection methods. The superior performance remains true irrespective of whether the variables are correlated. These claims are validated via simulations in the next section. Third, since CLASS employs IBOSS only on the selected variables for obtaining the subdata and since the active variables are almost always among the selected variables, subdata corresponding to CLASS gives a larger determinant for the information matrix for the active variables than the subdata obtained from competing methods. As a result, CLASS leads to better parameter estimation for the selected variables and prediction.</p>
<p>The computational complexity of LASSO for data of size <inline-formula id="j_nejsds36_ineq_104"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$n\times p$]]></tex-math></alternatives></inline-formula> is <inline-formula id="j_nejsds36_ineq_105"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O({p^{3}}+{p^{2}}n)$]]></tex-math></alternatives></inline-formula>. Therefore, the computational complexity of CLASS for steps 1–8 of Algorithm <xref rid="j_nejsds36_fig_001">1</xref> is <inline-formula id="j_nejsds36_ineq_106"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(ntimes({p^{3}}+{p^{2}}nsample))$]]></tex-math></alternatives></inline-formula>. Step 9’s computational complexity is <inline-formula id="j_nejsds36_ineq_107"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(n{p^{\ast }})$]]></tex-math></alternatives></inline-formula>, where <inline-formula id="j_nejsds36_ineq_108"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{\ast }}$]]></tex-math></alternatives></inline-formula> is the number of selected variables in step 8. Finally, for step 10, it is <inline-formula id="j_nejsds36_ineq_109"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(k{p^{\ast 2}})$]]></tex-math></alternatives></inline-formula>. Assuming that <inline-formula id="j_nejsds36_ineq_110"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo stretchy="false">≪</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[${p^{\ast }}\ll p$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_111"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi></mml:math><tex-math><![CDATA[$k\lt nsample\times ntimes$]]></tex-math></alternatives></inline-formula>, the overall computational complexity of CLASS is <inline-formula id="j_nejsds36_ineq_112"><alternatives><mml:math>
<mml:mi mathvariant="italic">O</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$O(ntimes({p^{3}}+{p^{2}}nsample)+n{p^{\ast }})$]]></tex-math></alternatives></inline-formula>. With <inline-formula id="j_nejsds36_ineq_113"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn></mml:math><tex-math><![CDATA[$ntimes=100$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_114"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mn>100</mml:mn></mml:math><tex-math><![CDATA[$nsample\lt n/100$]]></tex-math></alternatives></inline-formula>, CLASS is much faster than LASSO on the full data. CLASS is computationally more expensive than [<xref ref-type="bibr" rid="j_nejsds36_ref_044">44</xref>] for moderately large <italic>n</italic>, but with appropriate values of <inline-formula id="j_nejsds36_ineq_115"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi></mml:math><tex-math><![CDATA[$nsample$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_116"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi></mml:math><tex-math><![CDATA[$ntimes$]]></tex-math></alternatives></inline-formula>, it becomes less expensive for very large <italic>n</italic> and large <italic>p</italic>.</p>
</sec>
</sec>
<sec id="j_nejsds36_s_014">
<label>4</label>
<title>Numerical Experiments</title>
<p>In this section, we compare the performance of CLASS with competing methods through simulation studies. The comparison focuses on variable selection and prediction accuracy for test data.</p>
<p>Data are generated from the linear model (<xref rid="j_nejsds36_eq_001">2.1</xref>) with the value of <inline-formula id="j_nejsds36_ineq_117"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_118"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5000</mml:mn></mml:math><tex-math><![CDATA[$p=5000$]]></tex-math></alternatives></inline-formula>. Let <inline-formula id="j_nejsds36_ineq_119"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{1}}$]]></tex-math></alternatives></inline-formula> be the number of true active variables; without loss of generality, we take the first <inline-formula id="j_nejsds36_ineq_120"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{1}}$]]></tex-math></alternatives></inline-formula> variables to be active. The coefficients of the active variables and independent error terms are generated from <inline-formula id="j_nejsds36_ineq_121"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(5,1)$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_122"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(0,1)$]]></tex-math></alternatives></inline-formula>, respectively, for <inline-formula id="j_nejsds36_ineq_123"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_124"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5000</mml:mn></mml:math><tex-math><![CDATA[$p=5000$]]></tex-math></alternatives></inline-formula>. Similar to [<xref ref-type="bibr" rid="j_nejsds36_ref_041">41</xref>], we also generate error terms from <inline-formula id="j_nejsds36_ineq_125"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>9</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(0,9)$]]></tex-math></alternatives></inline-formula> with coefficients of active variables equal to 1 for <inline-formula id="j_nejsds36_ineq_126"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>. Let <inline-formula id="j_nejsds36_ineq_127"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }={\boldsymbol{\Sigma }_{ij}}$]]></tex-math></alternatives></inline-formula> be a <inline-formula id="j_nejsds36_ineq_128"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$p\times p$]]></tex-math></alternatives></inline-formula> correlation matrix for the <italic>p</italic> variables. For <inline-formula id="j_nejsds36_ineq_129"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, we considered uncorrelated variables (<inline-formula id="j_nejsds36_ineq_130"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn></mml:math><tex-math><![CDATA[${\boldsymbol{\Sigma }_{ij}}=0$]]></tex-math></alternatives></inline-formula>), a constant correlation of 0.5 (<inline-formula id="j_nejsds36_ineq_131"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }ij=0.{5^{I(i\ne j)}}$]]></tex-math></alternatives></inline-formula>), and a random correlation matrix generated with the R package <monospace>randcorr</monospace>. For <inline-formula id="j_nejsds36_ineq_132"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5000</mml:mn></mml:math><tex-math><![CDATA[$p=5000$]]></tex-math></alternatives></inline-formula>, we did not consider a random correlation matrix as, for <inline-formula id="j_nejsds36_ineq_133"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, the comparisons were not very different from the other two correlation structures. Variables <inline-formula id="j_nejsds36_ineq_134"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{x}_{i}}$]]></tex-math></alternatives></inline-formula>’s were generated according to the following scenarios:</p>
<list>
<list-item id="j_nejsds36_li_003">
<label>•</label>
<p><bold>Normal:</bold> <inline-formula id="j_nejsds36_ineq_135"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{x}_{i}}$]]></tex-math></alternatives></inline-formula>’s have the multivariate normal distribution <inline-formula id="j_nejsds36_ineq_136"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="bold">Σ</mml:mi></mml:math><tex-math><![CDATA[$N(0,\boldsymbol{\Sigma }$]]></tex-math></alternatives></inline-formula>)</p>
</list-item>
<list-item id="j_nejsds36_li_004">
<label>•</label>
<p><bold>LogNormal:</bold> <inline-formula id="j_nejsds36_ineq_137"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{x}_{i}}$]]></tex-math></alternatives></inline-formula>’s have the multivariate lognormal distribution <inline-formula id="j_nejsds36_ineq_138"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="bold">Σ</mml:mi></mml:math><tex-math><![CDATA[$LN(0,\boldsymbol{\Sigma }$]]></tex-math></alternatives></inline-formula>)</p>
</list-item>
<list-item id="j_nejsds36_li_005">
<label>•</label>
<p><inline-formula id="j_nejsds36_ineq_139"><alternatives><mml:math><mml:mstyle mathvariant="bold">
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:mstyle></mml:math><tex-math><![CDATA[$\mathbf{{t_{2}}}$]]></tex-math></alternatives></inline-formula><bold>:</bold> <inline-formula id="j_nejsds36_ineq_140"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{x}_{i}}$]]></tex-math></alternatives></inline-formula>’s have the multivariate <italic>t</italic> distribution <inline-formula id="j_nejsds36_ineq_141"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${t_{2}}(0,\boldsymbol{\Sigma })$]]></tex-math></alternatives></inline-formula></p>
</list-item>
<list-item id="j_nejsds36_li_006">
<label>•</label>
<p><bold>Mixture:</bold> <inline-formula id="j_nejsds36_ineq_142"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{x}_{i}}$]]></tex-math></alternatives></inline-formula>’s have the mixture distribution of four different distributions, <inline-formula id="j_nejsds36_ineq_143"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(0,\boldsymbol{\Sigma })$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_144"><alternatives><mml:math>
<mml:mi mathvariant="italic">L</mml:mi>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$LN(0,\boldsymbol{\Sigma })$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_145"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${t_{2}}(0,\boldsymbol{\Sigma })$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_146"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${t_{3}}(0,\boldsymbol{\Sigma })$]]></tex-math></alternatives></inline-formula>, with equal proportions.</p>
</list-item>
</list>
<table-wrap id="j_nejsds36_tab_001">
<label>Table 1</label>
<caption>
<p>Simulation scenarios, with <bold>highlighted</bold> scenarios presented in the main paper. Results for the remaining cases are relegated to the Supplementary Material.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin"><italic>p</italic></td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin"><inline-formula id="j_nejsds36_ineq_147"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${p_{1}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin"><inline-formula id="j_nejsds36_ineq_148"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\boldsymbol{\beta }_{act}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_149"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\sigma ^{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin"><inline-formula id="j_nejsds36_ineq_150"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∼</mml:mo></mml:math><tex-math><![CDATA[${\mathbf{x}_{i}}\sim $]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin"><inline-formula id="j_nejsds36_ineq_151"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\boldsymbol{\Sigma }_{ij}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: center">500</td>
<td style="vertical-align: top; text-align: center">10, 25,</td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_152"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∼</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\boldsymbol{\beta }_{act}}\sim N(5,1)$]]></tex-math></alternatives></inline-formula>,</td>
<td style="vertical-align: top; text-align: center"><bold>Normal</bold>,</td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_153"><alternatives><mml:math>
<mml:mn mathvariant="bold">0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn mathvariant="bold">0.5</mml:mn></mml:math><tex-math><![CDATA[$\mathbf{0},\mathbf{0.5}$]]></tex-math></alternatives></inline-formula>,</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"><bold>50</bold></td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_154"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[${\sigma ^{2}}=1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"><bold>LogNormal</bold>,</td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_155"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${r_{ij}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_156"><alternatives><mml:math><mml:mstyle mathvariant="bold">
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:mstyle></mml:math><tex-math><![CDATA[$\mathbf{{t_{2}}}$]]></tex-math></alternatives></inline-formula>, <bold>Mixture</bold></td>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">500</td>
<td style="vertical-align: top; text-align: center">10, 25,</td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_157"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[${\boldsymbol{\beta }_{act}}=1$]]></tex-math></alternatives></inline-formula>,</td>
<td style="vertical-align: top; text-align: center">Normal,</td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_158"><alternatives><mml:math>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0.5</mml:mn></mml:math><tex-math><![CDATA[$0,0.5$]]></tex-math></alternatives></inline-formula>,</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">50</td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_159"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mn>9</mml:mn></mml:math><tex-math><![CDATA[${\sigma ^{2}}=9$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">LogNormal</td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_160"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${r_{ij}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_161"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula>, Mixture</td>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">5000</td>
<td style="vertical-align: top; text-align: center"><bold>25</bold>, 50,</td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_162"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">c</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">∼</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[${\boldsymbol{\beta }_{act}}\sim N(5,1)$]]></tex-math></alternatives></inline-formula>,</td>
<td style="vertical-align: top; text-align: center">Normal,</td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_163"><alternatives><mml:math>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn mathvariant="bold">0.5</mml:mn></mml:math><tex-math><![CDATA[$0,\mathbf{0.5}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">75</td>
<td style="vertical-align: top; text-align: center"><inline-formula id="j_nejsds36_ineq_164"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[${\sigma ^{2}}=1$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">LogNormal</td>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"><inline-formula id="j_nejsds36_ineq_165"><alternatives><mml:math><mml:mstyle mathvariant="bold">
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:mstyle></mml:math><tex-math><![CDATA[$\mathbf{{t_{2}}}$]]></tex-math></alternatives></inline-formula>, Mixture</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
</tr>
</tbody>
</table>
</table-wrap>
<p>The simulation scenarios that we considered are summarized in Table <xref rid="j_nejsds36_tab_001">1</xref>. We relegate most results to the Supplementary Material and only present the three highlighted cases in Table <xref rid="j_nejsds36_tab_001">1</xref> in the main paper. Methods are evaluated on two characteristics:</p>
<list>
<list-item id="j_nejsds36_li_007">
<label>•</label>
<p>Variable selection: For variable selection we considered average power and average error. Power is defined as the proportion of active variables being correctly identified as active, whereas error is defined as the proportion of inactive variables that are incorrectly declared as active. High power and low error are desired. Therefore, a method with higher power and lower error is preferred.</p>
</list-item>
<list-item id="j_nejsds36_li_008">
<label>•</label>
<p>Prediction accuracy: We used the mean squared error (MSE) for test data, 
<disp-formula id="j_nejsds36_eq_008">
<label>(4.1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">(</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="0.2778em"/>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mi mathvariant="bold-italic">β</mml:mi>
<mml:mo>−</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="0.2778em"/>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">T</mml:mi>
</mml:mrow>
</mml:msubsup><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
<mml:mo maxsize="1.19em" minsize="1.19em" fence="true" mathvariant="normal">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ MSE=\frac{1}{{n_{test}}}{\sum \limits_{i=1}^{{n_{test}}}}{\big({\mathbf{z}_{i,\hspace{0.2778em}test}^{T}}\boldsymbol{\beta }-{\mathbf{z}_{i,\hspace{0.2778em}test}^{T}}\hat{\boldsymbol{\beta }}\big)^{2}},\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_nejsds36_ineq_166"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${n_{test}}$]]></tex-math></alternatives></inline-formula> is the number of observations in the test data, <inline-formula id="j_nejsds36_ineq_167"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="0.2778em"/>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathbf{z}_{i,\hspace{0.2778em}test}}$]]></tex-math></alternatives></inline-formula> corresponds to the <italic>i</italic>th data point in the test data, and <inline-formula id="j_nejsds36_ineq_168"><alternatives><mml:math><mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold-italic">β</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover></mml:math><tex-math><![CDATA[$\hat{\boldsymbol{\beta }}$]]></tex-math></alternatives></inline-formula> consists of OLS estimates from a reduced model with zeros added for parameters of variables that were not selected for the reduced model. We set <inline-formula id="j_nejsds36_ineq_169"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[${n_{test}}=1000$]]></tex-math></alternatives></inline-formula> and use the same joint variable distribution for the full data and test data.</p>
</list-item>
</list>
<fig id="j_nejsds36_fig_002">
<label>Figure 1</label>
<caption>
<p>Variable selection performance for <inline-formula id="j_nejsds36_ineq_170"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_171"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_172"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_173"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }={I_{p}}$]]></tex-math></alternatives></inline-formula>. Average power and error are shown by solid lines and dashed lines, respectively.</p>
</caption>
<graphic xlink:href="nejsds36_g002.jpg"/>
</fig>
<p>We compare CLASS to four other approaches: 
<list>
<list-item id="j_nejsds36_li_009">
<label>1.</label>
<p>Fitting the linear regression model with all <italic>p</italic> variables using LASSO and the full data (FULL)</p>
</list-item>
<list-item id="j_nejsds36_li_010">
<label>2.</label>
<p>Fitting the linear regression model with all <italic>p</italic> variables using LASSO and subdata from a uniform sample of size <italic>k</italic> (UNI)</p>
</list-item>
<list-item id="j_nejsds36_li_011">
<label>3.</label>
<p>Fitting the linear regression model with all <italic>p</italic> variables using LASSO and <italic>D</italic>-optimal IBOSS subdata obtained by using all <italic>p</italic> variables (IBOSS) [<xref ref-type="bibr" rid="j_nejsds36_ref_044">44</xref>]</p>
</list-item>
<list-item id="j_nejsds36_li_012">
<label>4.</label>
<p>Fitting the linear regression model with all <italic>p</italic> variables using LASSO and <italic>D</italic>-optimal IBOSS subdata obtained by using <inline-formula id="j_nejsds36_ineq_174"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{+}}$]]></tex-math></alternatives></inline-formula> variables selected by applying SIS (SIS(<inline-formula id="j_nejsds36_ineq_175"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{+}}$]]></tex-math></alternatives></inline-formula>)-IBOSS) [<xref ref-type="bibr" rid="j_nejsds36_ref_044">44</xref>]</p>
</list-item>
</list> 
For all four of the above methods, we obtain improved MSE values by using OLS estimators in a linear regression model that is only based on the variables selected by LASSO. We compare the performance of CLASS with these improved MSEs. Adjusting estimators for a model selected by LASSO is not without precedence; for example, Relaxed LASSO [<xref ref-type="bibr" rid="j_nejsds36_ref_027">27</xref>] adjusts the LASSO estimators by using OLS estimators with a shrinkage parameter. The <monospace>glmnet</monospace> function in R is used to apply LASSO. Ten-fold cross-validation is used to find the value of the tuning parameter, and the LASSO solution corresponding to <monospace>lambda.1se</monospace> is selected. For CLASS, we base variable selection and MSE computation on the fitted model from step 10 of Algorithm <xref rid="j_nejsds36_fig_001">1</xref>. [<xref ref-type="bibr" rid="j_nejsds36_ref_044">44</xref>] showed that their IBOSS and SIS(<inline-formula id="j_nejsds36_ineq_176"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{+}}$]]></tex-math></alternatives></inline-formula>)-IBOSS methods are better than the leverage-based sampling methods, which are therefore omitted from our comparisons.</p>
<p>For each scenario in Table <xref rid="j_nejsds36_tab_001">1</xref>, we repeat generating full data and test data 100 times. For each method considered, the average power, error, and MSE over the 100 replications are reported in the subsequent figures.</p>
<fig id="j_nejsds36_fig_003">
<label>Figure 2</label>
<caption>
<p>MSE for <inline-formula id="j_nejsds36_ineq_177"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_178"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_179"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_180"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }={I_{p}}$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<graphic xlink:href="nejsds36_g003.jpg"/>
</fig>
<p>Figures <xref rid="j_nejsds36_fig_002">1</xref> (variable selection) and <xref rid="j_nejsds36_fig_003">2</xref> (prediction) compare the five methods when <inline-formula id="j_nejsds36_ineq_181"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_182"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_183"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }={I_{p}}$]]></tex-math></alternatives></inline-formula>, the identity matrix of order <italic>p</italic>. The regression coefficients for the active variables were generated from the <inline-formula id="j_nejsds36_ineq_184"><alternatives><mml:math>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$N(5,1)$]]></tex-math></alternatives></inline-formula> distribution and the error variance was <inline-formula id="j_nejsds36_ineq_185"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">σ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:math><tex-math><![CDATA[${\sigma ^{2}}=1$]]></tex-math></alternatives></inline-formula>. The panels of the figures correspond to the four different joint variable distributions: Normal, logNormal, <inline-formula id="j_nejsds36_ineq_186"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula>, and Mixture. The full data sizes <italic>n</italic> are <inline-formula id="j_nejsds36_ineq_187"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{4}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_188"><alternatives><mml:math>
<mml:mn>2</mml:mn>
<mml:mo>∗</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$2\ast {10^{4}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_189"><alternatives><mml:math>
<mml:mn>4</mml:mn>
<mml:mo>∗</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$4\ast {10^{4}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_190"><alternatives><mml:math>
<mml:mn>8</mml:mn>
<mml:mo>∗</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$8\ast {10^{4}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_191"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{5}}$]]></tex-math></alternatives></inline-formula>, and the subdata size is fixed at <inline-formula id="j_nejsds36_ineq_192"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>. For SIS(<inline-formula id="j_nejsds36_ineq_193"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{+}}$]]></tex-math></alternatives></inline-formula>)-IBOSS, we use <inline-formula id="j_nejsds36_ineq_194"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn></mml:math><tex-math><![CDATA[${p^{+}}=100$]]></tex-math></alternatives></inline-formula> and for Algorithm <xref rid="j_nejsds36_fig_001">1</xref>, we use <inline-formula id="j_nejsds36_ineq_195"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$nsample=1000$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_196"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn></mml:math><tex-math><![CDATA[$ntimes=100$]]></tex-math></alternatives></inline-formula>. All methods have a similar power (close to 100%) for all distributions, but, especially for heavier-tailed distributions such as <inline-formula id="j_nejsds36_ineq_197"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula> and Mixture, CLASS has a smaller error. This is not surprising because all other methods select variables based on a single LASSO run, thereby consistently declaring a large number of inactive variables as active. CLASS also has the smallest MSE among all the subdata selection methods, coming close to that for using the full data for heavier-tailed distributions.</p>
<p>For Figures <xref rid="j_nejsds36_fig_004">3</xref> and <xref rid="j_nejsds36_fig_005">4</xref>, the only change is that now <inline-formula id="j_nejsds36_ineq_198"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>. The MSE performance in Figure <xref rid="j_nejsds36_fig_005">4</xref> is similar to that in Figure <xref rid="j_nejsds36_fig_003">2</xref>, except that CLASS outperforms prediction using LASSO-OLS on the full data for heavy-tailed distributions. For the variable selection performance, CLASS stands out as the big winner when variables are correlated. In Figure <xref rid="j_nejsds36_fig_004">3</xref>, all methods except CLASS select too many variables (high error) as a result of using LASSO for variable selection.</p>
<fig id="j_nejsds36_fig_004">
<label>Figure 3</label>
<caption>
<p>Variable selection performance for <inline-formula id="j_nejsds36_ineq_199"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_200"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_201"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_202"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>. The solid lines represent the power whereas the dashed lines represent the error.</p>
</caption>
<graphic xlink:href="nejsds36_g004.jpg"/>
</fig>
<fig id="j_nejsds36_fig_005">
<label>Figure 4</label>
<caption>
<p>MSE for <inline-formula id="j_nejsds36_ineq_203"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_204"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_205"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_206"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<graphic xlink:href="nejsds36_g005.jpg"/>
</fig>
<p>In Figure <xref rid="j_nejsds36_fig_006">5</xref>, we keep <italic>n</italic> fixed at <inline-formula id="j_nejsds36_ineq_207"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{5}}$]]></tex-math></alternatives></inline-formula> and change the sample size <italic>k</italic> from 1000 to 5000. We set <inline-formula id="j_nejsds36_ineq_208"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_209"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, the joint variable distribution is <inline-formula id="j_nejsds36_ineq_210"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_211"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>. As expected, all methods perform better on variable selection and MSE as the sample size <italic>k</italic> increases. CLASS continues to perform better than LASSO-OLS on the full data because the latter declares too many inactive variables as active.</p>
<fig id="j_nejsds36_fig_006">
<label>Figure 5</label>
<caption>
<p>Variable selection and MSE for <inline-formula id="j_nejsds36_ineq_212"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$n={10^{5}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_213"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_214"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_215"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_216"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<graphic xlink:href="nejsds36_g006.jpg"/>
</fig>
<p>Finally, Figure <xref rid="j_nejsds36_fig_007">6</xref> demonstrates the results when <inline-formula id="j_nejsds36_ineq_217"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5000</mml:mn></mml:math><tex-math><![CDATA[$p=5000$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_218"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula> and the joint variable distribution is <inline-formula id="j_nejsds36_ineq_219"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula>. We keep <italic>k</italic> fixed at 1000, use <inline-formula id="j_nejsds36_ineq_220"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>, and change the full data size <italic>n</italic> from <inline-formula id="j_nejsds36_ineq_221"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{4}}$]]></tex-math></alternatives></inline-formula> to <inline-formula id="j_nejsds36_ineq_222"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{5}}$]]></tex-math></alternatives></inline-formula>. Since doing IBOSS on <inline-formula id="j_nejsds36_ineq_223"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5000</mml:mn></mml:math><tex-math><![CDATA[$p=5000$]]></tex-math></alternatives></inline-formula> variables is not possible, we use <inline-formula id="j_nejsds36_ineq_224"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn></mml:math><tex-math><![CDATA[${p^{+}}=100$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_225"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mn>250</mml:mn></mml:math><tex-math><![CDATA[${p^{+}}=250$]]></tex-math></alternatives></inline-formula> for SIS(<inline-formula id="j_nejsds36_ineq_226"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{+}}$]]></tex-math></alternatives></inline-formula>)-IBOSS. Similar to Figures <xref rid="j_nejsds36_fig_004">3</xref> and <xref rid="j_nejsds36_fig_005">4</xref>, CLASS clearly outperforms other methods on both variable selection and prediction accuracy.</p>
<fig id="j_nejsds36_fig_007">
<label>Figure 6</label>
<caption>
<p>Variable selection and MSE for <inline-formula id="j_nejsds36_ineq_227"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_228"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5000</mml:mn></mml:math><tex-math><![CDATA[$p=5000$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_229"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_230"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds36_ineq_231"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<graphic xlink:href="nejsds36_g007.jpg"/>
</fig>
<p>For <inline-formula id="j_nejsds36_ineq_232"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_233"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>, joint variable distribution <inline-formula id="j_nejsds36_ineq_234"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_235"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>, Tables <xref rid="j_nejsds36_tab_002">2</xref> and <xref rid="j_nejsds36_tab_003">3</xref> illustrate computing times for different values of <italic>n</italic> and <italic>p</italic> averaged over 50 iterations on a Desktop with an AMD Ryzen Threadripper PRO 5955WX @4.00 GHz and 64GB RAM. UNI, IBOSS, and SIS-IBOSS are fastest, with CLASS being slower due to the repeated application of LASSO. However, for larger <italic>n</italic>, CLASS actually becomes faster than IBOSS and SIS-IBOSS. For example, Table <xref rid="j_nejsds36_tab_004">4</xref> shows that CLASS is faster for <inline-formula id="j_nejsds36_ineq_236"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>7</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$n={10^{7}}$]]></tex-math></alternatives></inline-formula> with <inline-formula id="j_nejsds36_ineq_237"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn></mml:math><tex-math><![CDATA[$p=100$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_238"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_239"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }={I_{p}}$]]></tex-math></alternatives></inline-formula>, joint variable distribution <inline-formula id="j_nejsds36_ineq_240"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_241"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>. CLASS is also much faster than LASSO-OLS on the full data. Despite the slightly higher run-time for datasets with smaller <italic>n</italic>, as seen in Figures <xref rid="j_nejsds36_fig_002">1</xref>–<xref rid="j_nejsds36_fig_007">6</xref>, CLASS is a clear winner in terms of statistical performance. In addition, note that, in contrast to LASSO-OLS on the full data, CLASS can be applied in situations when <italic>n</italic> is ultra-large.</p>
<table-wrap id="j_nejsds36_tab_002">
<label>Table 2</label>
<caption>
<p>CPU times (seconds) for different <italic>n</italic> with <inline-formula id="j_nejsds36_ineq_242"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_243"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_244"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>, joint variable distribution <inline-formula id="j_nejsds36_ineq_245"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_246"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin"><italic>n</italic></td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Full</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">UNI</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">IBOSS</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">SIS(100)-IBOSS</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">CLASS</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: right"><inline-formula id="j_nejsds36_ineq_247"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$5\times {10^{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">2.35</td>
<td style="vertical-align: top; text-align: center">0.54</td>
<td style="vertical-align: top; text-align: center">0.98</td>
<td style="vertical-align: top; text-align: center">0.58</td>
<td style="vertical-align: top; text-align: center">51.82</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right"><inline-formula id="j_nejsds36_ineq_248"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$5\times {10^{4}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">33.96</td>
<td style="vertical-align: top; text-align: center">0.53</td>
<td style="vertical-align: top; text-align: center">1.81</td>
<td style="vertical-align: top; text-align: center">0.91</td>
<td style="vertical-align: top; text-align: center">50.39</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin"><inline-formula id="j_nejsds36_ineq_249"><alternatives><mml:math>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$5\times {10^{5}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">361.66</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.54</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">9.21</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">3.88</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">51.36</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_nejsds36_tab_003">
<label>Table 3</label>
<caption>
<p>CPU times (seconds) for different <italic>p</italic> with <inline-formula id="j_nejsds36_ineq_250"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5</mml:mn>
<mml:mo>×</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$n=5\times {10^{5}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_251"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_252"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>, joint variable distribution <inline-formula id="j_nejsds36_ineq_253"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_254"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin"><italic>p</italic></td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Full</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">UNI</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">IBOSS</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">SIS(100)-IBOSS</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">CLASS</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: right">100</td>
<td style="vertical-align: top; text-align: center">16.78</td>
<td style="vertical-align: top; text-align: center">0.19</td>
<td style="vertical-align: top; text-align: center">1.95</td>
<td style="vertical-align: top; text-align: center">2.31</td>
<td style="vertical-align: top; text-align: center">18.31</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">250</td>
<td style="vertical-align: top; text-align: center">55.03</td>
<td style="vertical-align: top; text-align: center">0.42</td>
<td style="vertical-align: top; text-align: center">4.79</td>
<td style="vertical-align: top; text-align: center">2.95</td>
<td style="vertical-align: top; text-align: center">43.11</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">500</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">361.66</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.54</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">9.21</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">3.88</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">51.36</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_nejsds36_tab_004">
<label>Table 4</label>
<caption>
<p>CPU times (seconds) for <inline-formula id="j_nejsds36_ineq_255"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>7</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$n={10^{7}}$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_256"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn></mml:math><tex-math><![CDATA[$p=100$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_257"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_258"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }={I_{p}}$]]></tex-math></alternatives></inline-formula>, joint variable distribution <inline-formula id="j_nejsds36_ineq_259"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_260"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin"><italic>n</italic></td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">UNI</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">IBOSS</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">SIS(80)-IBOSS</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">CLASS</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin"><inline-formula id="j_nejsds36_ineq_261"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>7</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{7}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.10</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">32.22</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">32.83</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">25.86</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As suggested by one of the reviewers, since CLASS is computationally slower than the other subdata selection methods in Tables <xref rid="j_nejsds36_tab_002">2</xref> and <xref rid="j_nejsds36_tab_003">3</xref>, one should make statistical performance comparisons after allowing the same amount of CPU time for each method. Doing this would allow the other methods to use a larger subdata size than CLASS and, presumably, show a better statistical performance than on subdata of the same size as that used for CLASS. Based on this suggestion, we added Tables <xref rid="j_nejsds36_tab_005">5</xref> and <xref rid="j_nejsds36_tab_006">6</xref>, which make a comparison between the different subdata selection methods while keeping CPU time approximately constant. Tables <xref rid="j_nejsds36_tab_005">5</xref> and <xref rid="j_nejsds36_tab_006">6</xref>, which report averages over 100 iterations for each method and combinations for <italic>n</italic> and <italic>k</italic>, list the performances when <inline-formula id="j_nejsds36_ineq_262"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_263"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_264"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula>, and the joint variable distribution is <inline-formula id="j_nejsds36_ineq_265"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula> and Normal, respectively. We consider <italic>n</italic> to be either <inline-formula id="j_nejsds36_ineq_266"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{5}}$]]></tex-math></alternatives></inline-formula> or <inline-formula id="j_nejsds36_ineq_267"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${10^{6}}$]]></tex-math></alternatives></inline-formula> and use <inline-formula id="j_nejsds36_ineq_268"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">a</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$nsample=1000$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_269"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mi mathvariant="italic">m</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>100</mml:mn></mml:math><tex-math><![CDATA[$ntimes=100$]]></tex-math></alternatives></inline-formula> for CLASS. In Table <xref rid="j_nejsds36_tab_005">5</xref>, we see that CLASS performs better on all criteria despite utilizing a far smaller subdata size of <inline-formula id="j_nejsds36_ineq_270"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1000</mml:mn></mml:math><tex-math><![CDATA[$k=1000$]]></tex-math></alternatives></inline-formula> compared to other subdata selection methods. The differences in MSE are large which we attribute to CLASS being more successful at identifying virtually all of the active variables and collecting subdata that facilitates precise estimation of the corresponding parameters. The results in Table <xref rid="j_nejsds36_tab_005">5</xref> are not entirely surprising. When CLASS performs better than FULL in terms of the MSE for prediction, it is expected to perform better than the other subdata methods since they are not expected to beat FULL. These other subdata methods perform the same analysis as FULL, but on fewer data points. Also, irrespective of the subdata size, the error rates for these methods tend to be no better than those for FULL, and can thus be much worse than those for CLASS. Similar results are seen for the mixture distribution (see the Supplementary Material). In addition, as demonstrated in Table <xref rid="j_nejsds36_tab_004">4</xref>, CLASS will become faster than IBOSS and SIS-IBOSS for very large <italic>n</italic> while continuing to perform better statistically. Table <xref rid="j_nejsds36_tab_006">6</xref> for the Normal distribution shows a different picture. With this variable distribution, the difference between CLASS and other subdata selection methods was smaller when all the methods used subdata of the same size, and now that other methods can use more subdata (in some cases almost all of the data), they start to outperform CLASS in terms of MSE. However, in contrast to Table <xref rid="j_nejsds36_tab_005">5</xref>, the differences in MSE for Table <xref rid="j_nejsds36_tab_006">6</xref> are very small. Also, CLASS is not outperformed in terms of variable selection. Similar observations as for the Normal distribution also apply for the logNormal distribution (see the Supplementary Material). In conclusion, with the same amount of CPU time, CLASS is for all cases studied here highly competitive, and in two of the four cases the clear winner in terms of both MSE and variable selection.</p>
<table-wrap id="j_nejsds36_tab_005">
<label>Table 5</label>
<caption>
<p>MSE and variable selection performance for different subdata methods with approximately equal CPU times, different <italic>n</italic> and <italic>k</italic>, <inline-formula id="j_nejsds36_ineq_271"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_272"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_273"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula> and joint variable distribution <inline-formula id="j_nejsds36_ineq_274"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${t_{2}}$]]></tex-math></alternatives></inline-formula>.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin">Method</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin"><italic>k</italic></td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Time (s)</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">MSE</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Power</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Error</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: right"/>
<td colspan="5" style="vertical-align: top; text-align: center">For <inline-formula id="j_nejsds36_ineq_275"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$n={10^{5}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">UNI</td>
<td style="vertical-align: top; text-align: center">90000</td>
<td style="vertical-align: top; text-align: center">63.13</td>
<td style="vertical-align: top; text-align: center">93.00642</td>
<td style="vertical-align: top; text-align: center">0.9892</td>
<td style="vertical-align: top; text-align: center">0.1957</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">IBOSS</td>
<td style="vertical-align: top; text-align: center">60000</td>
<td style="vertical-align: top; text-align: center">54.53</td>
<td style="vertical-align: top; text-align: center">53.00148</td>
<td style="vertical-align: top; text-align: center">0.9932</td>
<td style="vertical-align: top; text-align: center">0.2010</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">SIS(100)-IBOSS</td>
<td style="vertical-align: top; text-align: center">75000</td>
<td style="vertical-align: top; text-align: center">53.73</td>
<td style="vertical-align: top; text-align: center">49.58365</td>
<td style="vertical-align: top; text-align: center">0.9936</td>
<td style="vertical-align: top; text-align: center">0.2005</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">CLASS</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1000</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">50.13</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.084747</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.9998</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.0000</td>
</tr>
</tbody><tbody>
<tr>
<td style="vertical-align: top; text-align: right"/>
<td colspan="5" style="vertical-align: top; text-align: center">For <inline-formula id="j_nejsds36_ineq_276"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$n={10^{6}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">UNI</td>
<td style="vertical-align: top; text-align: center">80000</td>
<td style="vertical-align: top; text-align: center">55.58</td>
<td style="vertical-align: top; text-align: center">110.0874</td>
<td style="vertical-align: top; text-align: center">0.9906</td>
<td style="vertical-align: top; text-align: center">0.1932</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">IBOSS</td>
<td style="vertical-align: top; text-align: center">50000</td>
<td style="vertical-align: top; text-align: center">59.72</td>
<td style="vertical-align: top; text-align: center">103.6516</td>
<td style="vertical-align: top; text-align: center">0.9800</td>
<td style="vertical-align: top; text-align: center">0.1970</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">SIS(100)-IBOSS</td>
<td style="vertical-align: top; text-align: center">70000</td>
<td style="vertical-align: top; text-align: center">56.03</td>
<td style="vertical-align: top; text-align: center">96.07541</td>
<td style="vertical-align: top; text-align: center">0.9808</td>
<td style="vertical-align: top; text-align: center">0.1963</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">CLASS</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1000</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">52.47</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.000104</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.0000</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_nejsds36_tab_006">
<label>Table 6</label>
<caption>
<p>MSE and variable selection performance for different subdata methods with approximately equal CPU times, different <italic>n</italic> and <italic>k</italic>, <inline-formula id="j_nejsds36_ineq_277"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>500</mml:mn></mml:math><tex-math><![CDATA[$p=500$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_278"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>50</mml:mn></mml:math><tex-math><![CDATA[${p_{1}}=50$]]></tex-math></alternatives></inline-formula>, <inline-formula id="j_nejsds36_ineq_279"><alternatives><mml:math>
<mml:mi mathvariant="bold">Σ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>.</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo stretchy="false">≠</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$\boldsymbol{\Sigma }=(0.{5^{I(i\ne j)}})$]]></tex-math></alternatives></inline-formula> and joint variable distribution Normal.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin">Method</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin"><italic>k</italic></td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Time (s)</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">MSE</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Power</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Error</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: right"/>
<td colspan="5" style="vertical-align: top; text-align: center">For <inline-formula id="j_nejsds36_ineq_280"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$n={10^{5}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">UNI</td>
<td style="vertical-align: top; text-align: center">90000</td>
<td style="vertical-align: top; text-align: center">53.74</td>
<td style="vertical-align: top; text-align: center">0.001053</td>
<td style="vertical-align: top; text-align: center">1</td>
<td style="vertical-align: top; text-align: center">0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">IBOSS</td>
<td style="vertical-align: top; text-align: center">60000</td>
<td style="vertical-align: top; text-align: center">51.60</td>
<td style="vertical-align: top; text-align: center">0.000843</td>
<td style="vertical-align: top; text-align: center">1</td>
<td style="vertical-align: top; text-align: center">0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">SIS(100)-IBOSS</td>
<td style="vertical-align: top; text-align: center">75000</td>
<td style="vertical-align: top; text-align: center">46.89</td>
<td style="vertical-align: top; text-align: center">0.000663</td>
<td style="vertical-align: top; text-align: center">1</td>
<td style="vertical-align: top; text-align: center">0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">CLASS</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1000</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">43.56</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.044859</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0</td>
</tr>
</tbody><tbody>
<tr>
<td style="vertical-align: top; text-align: right"/>
<td colspan="5" style="vertical-align: top; text-align: center">For <inline-formula id="j_nejsds36_ineq_281"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[$n={10^{6}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">UNI</td>
<td style="vertical-align: top; text-align: center">80000</td>
<td style="vertical-align: top; text-align: center">48.97</td>
<td style="vertical-align: top; text-align: center">0.000687</td>
<td style="vertical-align: top; text-align: center">1</td>
<td style="vertical-align: top; text-align: center">0.0000</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">IBOSS</td>
<td style="vertical-align: top; text-align: center">50000</td>
<td style="vertical-align: top; text-align: center">69.27</td>
<td style="vertical-align: top; text-align: center">0.001016</td>
<td style="vertical-align: top; text-align: center">1</td>
<td style="vertical-align: top; text-align: center">0.0000</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">SIS(100)-IBOSS</td>
<td style="vertical-align: top; text-align: center">70000</td>
<td style="vertical-align: top; text-align: center">64.64</td>
<td style="vertical-align: top; text-align: center">0.000679</td>
<td style="vertical-align: top; text-align: center">1</td>
<td style="vertical-align: top; text-align: center">0.0002</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">CLASS</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1000</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">44.90</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.045218</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.0000</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="j_nejsds36_s_015">
<label>4.1</label>
<title>Real Data</title>
<table-wrap id="j_nejsds36_tab_007">
<label>Table 7</label>
<caption>
<p>MSPE over 100 random training-test splits on the Blog Feedback data.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Full</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">IBOSS</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">CLASS</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">902.04</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1043.65</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">1003.25</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Similar to [<xref ref-type="bibr" rid="j_nejsds36_ref_044">44</xref>], we consider the Blog Feedback data as a real case study. This data is obtained from blog posts, with the ultimate goal of predicting the number of comments to a future blog post. The data is available at the UCI repository at <uri>https://archive.ics.uci.edu/ml/datasets/BlogFeedback</uri> and is described in [<xref ref-type="bibr" rid="j_nejsds36_ref_004">4</xref>]. The raw HTML documents of the blog posts are crawled and processed to select the blog posts that were published no earlier than 72 hours before the selected base date/time. Counting both the training and test datasets together, this data has <inline-formula id="j_nejsds36_ineq_282"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>52</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>397</mml:mn></mml:math><tex-math><![CDATA[$n=52,397$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds36_ineq_283"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>280</mml:mn></mml:math><tex-math><![CDATA[$p=280$]]></tex-math></alternatives></inline-formula>. The 280 variables in the data include average, standard deviation, min, max, and median of the length of time between the publication of the blog post and “current” time, the length of the blog post, the number of comments in the last 24 hours before the base time and so on. We use a random 80–20% split of this dataset as training and test data and with <inline-formula id="j_nejsds36_ineq_284"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5600</mml:mn></mml:math><tex-math><![CDATA[$k=5600$]]></tex-math></alternatives></inline-formula>, we compute the mean squared prediction error of the test set, where 
<disp-formula id="j_nejsds36_eq_009">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mi mathvariant="italic">P</mml:mi>
<mml:mi mathvariant="italic">E</mml:mi>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:mo largeop="true" movablelimits="false">∑</mml:mo></mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
<mml:mi mathvariant="italic">e</mml:mi>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ MSPE=\frac{1}{{n_{test}}}{\sum \limits_{i=1}^{{n_{test}}}}{({y_{i}}-{\hat{y}_{i}})^{2}}\]]]></tex-math></alternatives>
</disp-formula> 
and <inline-formula id="j_nejsds36_ineq_285"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="italic">y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">ˆ</mml:mo></mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\hat{y}_{i}}$]]></tex-math></alternatives></inline-formula> is the predicted value. For the full data (Full) and IBOSS subdata on <inline-formula id="j_nejsds36_ineq_286"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>280</mml:mn></mml:math><tex-math><![CDATA[$p=280$]]></tex-math></alternatives></inline-formula> variables (IBOSS), predicted values are obtained by first fitting a LASSO model followed by OLS. For CLASS, predicted values come from the model obtained by Algorithm <xref rid="j_nejsds36_fig_001">1</xref>. The choice of <inline-formula id="j_nejsds36_ineq_287"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>5600</mml:mn></mml:math><tex-math><![CDATA[$k=5600$]]></tex-math></alternatives></inline-formula> is inspired by previous studies such as [<xref ref-type="bibr" rid="j_nejsds36_ref_044">44</xref>] so that IBOSS can select 20 observations per variable. In Table <xref rid="j_nejsds36_tab_007">7</xref>, we present the average mean squared prediction errors for these three methods computed over 100 random splits of the full data in training and test data. CLASS produces a smaller MSPE than IBOSS, and the full data does better than the two subsampling methods.</p>
</sec>
</sec>
<sec id="j_nejsds36_s_016">
<label>5</label>
<title>Concluding Remarks</title>
<p>With a very large number of observations <italic>n</italic> and a large number of variables <italic>p</italic>, it can be computationally challenging to identify the active variables and build a good model for prediction. Subdata selection methods are one way to deal with this problem. If a linear regression model is a reasonable model for the data, then IBOSS is known to be an excellent method for subdata selection. However, as discussed in previous sections, IBOSS subdata can only be found when the sample size <inline-formula id="j_nejsds36_ineq_288"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo mathvariant="normal">&gt;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$k\gt 2p$]]></tex-math></alternatives></inline-formula>. Moreover, even when <inline-formula id="j_nejsds36_ineq_289"><alternatives><mml:math>
<mml:mi mathvariant="italic">k</mml:mi>
<mml:mo stretchy="false">≤</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">p</mml:mi></mml:math><tex-math><![CDATA[$k\le 2p$]]></tex-math></alternatives></inline-formula>, the subdata tends to be better when selecting IBOSS subdata by only using the active variables.</p>
<p>In this work, under the assumption of effect sparsity, we propose a method, CLASS, that attempts to do just that. We first devise a variable selection method that uses small uniform random samples of the full data to conduct multiple LASSO runs. As demonstrated, our variable selection approach is better than applying Lasso to IBOSS subdata, SIS(<inline-formula id="j_nejsds36_ineq_290"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{+}}$]]></tex-math></alternatives></inline-formula>)-IBOSS subdata, or the full data. We then obtain IBOSS subdata using only these selected variables, and fit a linear model to the subdata using OLS estimation. CLASS results in much smaller mean squared errors than IBOSS and SIS-IBOSS, even after adding OLS estimation at the end of these methods. For heavy-tailed joint distributions of the variables, CLASS can also improve on using the full data.</p>
<p>Due to the repeated applications of LASSO, CLASS takes a larger computing time than the competing subdata selection methods. However, if <italic>n</italic> is very large, CLASS becomes computationally less expensive than IBOSS and, with values of <inline-formula id="j_nejsds36_ineq_291"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>+</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${p^{+}}$]]></tex-math></alternatives></inline-formula> that are not too small, than SIS-IBOSS. The superior statistical performance of CLASS, both in terms of variable selection and prediction accuracy, makes a strong case for its use over the competing methods. CLASS is faster than analyzing the full data and is applicable in situations where full data analysis may not be possible.</p>
</sec>
</body>
<back>
<ref-list id="j_nejsds36_reflist_001">
<title>References</title>
<ref id="j_nejsds36_ref_001">
<label>[1]</label><mixed-citation publication-type="journal"> <string-name><surname>Ai</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>H.</given-names></string-name> and <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name> (<year>2019</year>). <article-title>Optimal subsampling algorithms for big data regressions</article-title>. <source>Statistica Sinica</source>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/https://doi.org/10.5705/ss.202018.0439" xlink:type="simple">https://doi.org/https://doi.org/10.5705/ss.202018.0439</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_002">
<label>[2]</label><mixed-citation publication-type="journal"> <string-name><surname>Ai</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Zhang</surname>, <given-names>H.</given-names></string-name> (<year>2021</year>). <article-title>Optimal subsampling for large-scale quantile regression</article-title>. <source>Journal of Complexity</source> <volume>62</volume> <fpage>101512</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.jco.2020.101512" xlink:type="simple">https://doi.org/10.1016/j.jco.2020.101512</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4174536">MR4174536</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_003">
<label>[3]</label><mixed-citation publication-type="chapter"> <string-name><surname>Bach</surname>, <given-names>F. R.</given-names></string-name> (<year>2008</year>). <chapter-title>Bolasso: model consistent lasso estimation through the bootstrap</chapter-title>. In <source>Proceedings of the 25th international conference on Machine learning</source> <fpage>33</fpage>–<lpage>40</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_004">
<label>[4]</label><mixed-citation publication-type="chapter"> <string-name><surname>Buza</surname>, <given-names>K.</given-names></string-name> (<year>2014</year>). <chapter-title>Feedback prediction for blogs</chapter-title>. In <source>Data analysis, machine learning and knowledge discovery</source> <fpage>145</fpage>–<lpage>152</lpage> <publisher-name>Springer</publisher-name>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_005">
<label>[5]</label><mixed-citation publication-type="journal"> <string-name><surname>Cai</surname>, <given-names>L.</given-names></string-name> and <string-name><surname>Zhu</surname>, <given-names>Y.</given-names></string-name> (<year>2015</year>). <article-title>The challenges of data quality and data quality assessment in the big data era</article-title>. <source>Data science journal</source> <volume>14</volume>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_006">
<label>[6]</label><mixed-citation publication-type="journal"> <string-name><surname>Chen</surname>, <given-names>X.</given-names></string-name> and <string-name><surname>Xie</surname>, <given-names>M. -G.</given-names></string-name> (<year>2014</year>). <article-title>A split-and-conquer approach for analysis of extraordinarily large data</article-title>. <source>Statistica Sinica</source> <volume>24</volume>(<issue>4</issue>) <fpage>1655</fpage>–<lpage>1684</lpage>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3308656">MR3308656</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_007">
<label>[7]</label><mixed-citation publication-type="journal"> <string-name><surname>Cheng</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name> and <string-name><surname>Yang</surname>, <given-names>M.</given-names></string-name> (<year>2020</year>). <article-title>Information-based optimal subdata selection for big data logistic regression</article-title>. <source>Journal of Statistical Planning and Inference</source> <volume>209</volume> <fpage>112</fpage>–<lpage>122</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.jspi.2020.03.004" xlink:type="simple">https://doi.org/10.1016/j.jspi.2020.03.004</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4096258">MR4096258</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_008">
<label>[8]</label><mixed-citation publication-type="journal"> <string-name><surname>Derezinski</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Warmuth</surname>, <given-names>M. K.</given-names></string-name> and <string-name><surname>Hsu</surname>, <given-names>D. J.</given-names></string-name> (<year>2018</year>). <article-title>Leveraged volume sampling for linear regression</article-title>. <source>Advances in Neural Information Processing Systems</source> <volume>31</volume>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_009">
<label>[9]</label><mixed-citation publication-type="journal"> <string-name><surname>Drineas</surname>, <given-names>P.</given-names></string-name> and <string-name><surname>Mahoney</surname>, <given-names>M. W.</given-names></string-name> (<year>2016</year>). <article-title>RandNLA: randomized numerical linear algebra</article-title>. <source>Communications of the ACM</source> <volume>59</volume>(<issue>6</issue>) <fpage>80</fpage>–<lpage>90</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_010">
<label>[10]</label><mixed-citation publication-type="chapter"> <string-name><surname>Drineas</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Mahoney</surname>, <given-names>M. W.</given-names></string-name> and <string-name><surname>Muthukrishnan</surname>, <given-names>S.</given-names></string-name> (<year>2006</year>). <chapter-title>Sampling algorithms for l2 regression and applications</chapter-title>. In <source>Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm</source> <fpage>1127</fpage>–<lpage>1136</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/1109557.1109682" xlink:type="simple">https://doi.org/10.1145/1109557.1109682</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2373840">MR2373840</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_011">
<label>[11]</label><mixed-citation publication-type="journal"> <string-name><surname>Fan</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Lv</surname>, <given-names>J.</given-names></string-name> (<year>2008</year>). <article-title>Sure independence screening for ultrahigh dimensional feature space</article-title>. <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source> <volume>70</volume>(<issue>5</issue>) <fpage>849</fpage>–<lpage>911</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/j.1467-9868.2008.00674.x" xlink:type="simple">https://doi.org/10.1111/j.1467-9868.2008.00674.x</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2530322">MR2530322</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_012">
<label>[12]</label><mixed-citation publication-type="journal"> <string-name><surname>Fan</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>Y.</given-names></string-name> and <string-name><surname>Zhu</surname>, <given-names>L.</given-names></string-name> (<year>2021</year>). <article-title>Optimal subsampling for linear quantile regression models</article-title>. <source>Canadian Journal of Statistics</source> <volume>49</volume>(<issue>4</issue>) <fpage>1039</fpage>–<lpage>1057</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1002/cjs.11590" xlink:type="simple">https://doi.org/10.1002/cjs.11590</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4349634">MR4349634</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_013">
<label>[13]</label><mixed-citation publication-type="journal"> <string-name><surname>Fithian</surname>, <given-names>W.</given-names></string-name> and <string-name><surname>Hastie</surname>, <given-names>T.</given-names></string-name> (<year>2014</year>). <article-title>Local case-control sampling: Efficient subsampling in imbalanced data sets</article-title>. <source>Annals of statistics</source> <volume>42</volume>(<issue>5</issue>) <fpage>1693</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/14-AOS1220" xlink:type="simple">https://doi.org/10.1214/14-AOS1220</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3257627">MR3257627</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_014">
<label>[14]</label><mixed-citation publication-type="journal"> <string-name><surname>Friedman</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hastie</surname>, <given-names>T.</given-names></string-name> and <string-name><surname>Tibshirani</surname>, <given-names>R.</given-names></string-name> (<year>2010</year>). <article-title>Regularization Paths for Generalized Linear Models via Coordinate Descent</article-title>. <source>Journal of Statistical Software</source> <volume>33</volume>(<issue>1</issue>) <fpage>1</fpage>–<lpage>22</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18637/jss.v033.i01" xlink:type="simple">https://doi.org/10.18637/jss.v033.i01</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_015">
<label>[15]</label><mixed-citation publication-type="journal"> <string-name><surname>Fu</surname>, <given-names>W.</given-names></string-name> and <string-name><surname>Knight</surname>, <given-names>K.</given-names></string-name> (<year>2000</year>). <article-title>Asymptotics for lasso-type estimators</article-title>. <source>The Annals of Statistics</source> <volume>28</volume>(<issue>5</issue>) <fpage>1356</fpage>–<lpage>1378</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/aos/1015957397" xlink:type="simple">https://doi.org/10.1214/aos/1015957397</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=1805787">MR1805787</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_016">
<label>[16]</label><mixed-citation publication-type="journal"> <string-name><surname>Han</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Tan</surname>, <given-names>K. M.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>T.</given-names></string-name> and <string-name><surname>Zhang</surname>, <given-names>T.</given-names></string-name> (<year>2020</year>). <article-title>Local uncertainty sampling for large-scale multiclass logistic regression</article-title>. <source>The Annals of Statistics</source> <volume>48</volume>(<issue>3</issue>) <fpage>1770</fpage>–<lpage>1788</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/19-AOS1867" xlink:type="simple">https://doi.org/10.1214/19-AOS1867</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4124343">MR4124343</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_017">
<label>[17]</label><mixed-citation publication-type="book"> <string-name><surname>Hedayat</surname>, <given-names>A. S.</given-names></string-name>, <string-name><surname>Sloane</surname>, <given-names>N. J. A.</given-names></string-name> and <string-name><surname>Stufken</surname>, <given-names>J.</given-names></string-name> (<year>1999</year>) <source>Orthogonal arrays: theory and applications</source>. <publisher-name>Springer Science &amp; Business Media</publisher-name>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-1-4612-1478-6" xlink:type="simple">https://doi.org/10.1007/978-1-4612-1478-6</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=1693498">MR1693498</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_018">
<label>[18]</label><mixed-citation publication-type="journal"> <string-name><surname>Joseph</surname>, <given-names>V. R.</given-names></string-name> and <string-name><surname>Mak</surname>, <given-names>S.</given-names></string-name> (<year>2021</year>). <article-title>Supervised compression of big data</article-title>. <source>Statistical Analysis and Data Mining: The ASA Data Science Journal</source> <volume>14</volume>(<issue>3</issue>) <fpage>217</fpage>–<lpage>229</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1002/sam.11508" xlink:type="simple">https://doi.org/10.1002/sam.11508</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4303067">MR4303067</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_019">
<label>[19]</label><mixed-citation publication-type="journal"> <string-name><surname>Joseph</surname>, <given-names>V. R.</given-names></string-name> and <string-name><surname>Vakayil</surname>, <given-names>A.</given-names></string-name> (<year>2022</year>). <article-title>Split: An optimal method for data splitting</article-title>. <source>Technometrics</source> <volume>64</volume>(<issue>2</issue>) <fpage>166</fpage>–<lpage>176</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/00401706.2021.1921037" xlink:type="simple">https://doi.org/10.1080/00401706.2021.1921037</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4410911">MR4410911</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_020">
<label>[20]</label><mixed-citation publication-type="journal"> <string-name><surname>Kiefer</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Wolfowitz</surname>, <given-names>J.</given-names></string-name> (<year>1960</year>). <article-title>The equivalence of two extremum problems</article-title>. <source>Canadian Journal of Mathematics</source> <volume>12</volume> <fpage>363</fpage>–<lpage>366</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.4153/CJM-1960-030-4" xlink:type="simple">https://doi.org/10.4153/CJM-1960-030-4</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=0117842">MR0117842</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_021">
<label>[21]</label><mixed-citation publication-type="journal"> <string-name><surname>Kleiner</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Talwalkar</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Sarkar</surname>, <given-names>P.</given-names></string-name> and <string-name><surname>Jordan</surname>, <given-names>M. I.</given-names></string-name> (<year>2014</year>). <article-title>A scalable bootstrap for massive data</article-title>. <source>Journal of the Royal Statistical Society: Series B: Statistical Methodology</source> <volume>76</volume>(<issue>4</issue>) <fpage>795</fpage>–<lpage>816</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/rssb.12050" xlink:type="simple">https://doi.org/10.1111/rssb.12050</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3248677">MR3248677</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_022">
<label>[22]</label><mixed-citation publication-type="journal"> <string-name><surname>Lin</surname>, <given-names>N.</given-names></string-name> and <string-name><surname>Xi</surname>, <given-names>R.</given-names></string-name> (<year>2011</year>). <article-title>Aggregated estimating equation estimation</article-title>. <source>Statistics and Its Interface</source> <volume>4</volume>(<issue>1</issue>) <fpage>73</fpage>–<lpage>83</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.4310/SII.2011.v4.n1.a8" xlink:type="simple">https://doi.org/10.4310/SII.2011.v4.n1.a8</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2775250">MR2775250</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_023">
<label>[23]</label><mixed-citation publication-type="journal"> <string-name><surname>Ma</surname>, <given-names>P.</given-names></string-name> and <string-name><surname>Sun</surname>, <given-names>X.</given-names></string-name> (<year>2015</year>). <article-title>Leveraging for big data regression</article-title>. <source>Wiley Interdisciplinary Reviews: Computational Statistics</source> <volume>7</volume>(<issue>1</issue>) <fpage>70</fpage>–<lpage>76</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1002/wics.1324" xlink:type="simple">https://doi.org/10.1002/wics.1324</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3348722">MR3348722</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_024">
<label>[24]</label><mixed-citation publication-type="journal"> <string-name><surname>Ma</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Mahoney</surname>, <given-names>M. W.</given-names></string-name> and <string-name><surname>Yu</surname>, <given-names>B.</given-names></string-name> (<year>2015</year>). <article-title>A statistical perspective on algorithmic leveraging</article-title>. <source>The Journal of Machine Learning Research</source> <volume>16</volume>(<issue>1</issue>) <fpage>861</fpage>–<lpage>911</lpage>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3361306">MR3361306</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_025">
<label>[25]</label><mixed-citation publication-type="other"> <string-name><surname>Mahoney</surname>, <given-names>M. W.</given-names></string-name> (2011). Randomized algorithms for matrices and data. <italic>arXiv preprint</italic> <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:1104.5557"><italic>arXiv:1104.5557</italic></ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_026">
<label>[26]</label><mixed-citation publication-type="journal"> <string-name><surname>Mak</surname>, <given-names>S.</given-names></string-name> and <string-name><surname>Joseph</surname>, <given-names>V. R.</given-names></string-name> (<year>2018</year>). <article-title>Support points</article-title>. <source>The Annals of Statistics</source> <volume>46</volume>(<issue>6A</issue>) <fpage>2562</fpage>–<lpage>2592</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/17-AOS1629" xlink:type="simple">https://doi.org/10.1214/17-AOS1629</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3851748">MR3851748</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_027">
<label>[27]</label><mixed-citation publication-type="journal"> <string-name><surname>Meinshausen</surname>, <given-names>N.</given-names></string-name> (<year>2007</year>). <article-title>Relaxed lasso</article-title>. <source>Computational Statistics &amp; Data Analysis</source> <volume>52</volume>(<issue>1</issue>) <fpage>374</fpage>–<lpage>393</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.csda.2006.12.019" xlink:type="simple">https://doi.org/10.1016/j.csda.2006.12.019</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2409990">MR2409990</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_028">
<label>[28]</label><mixed-citation publication-type="journal"> <string-name><surname>Meinshausen</surname>, <given-names>N.</given-names></string-name> and <string-name><surname>Bühlmann</surname>, <given-names>P.</given-names></string-name> (<year>2010</year>). <article-title>Stability selection</article-title>. <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source> <volume>72</volume>(<issue>4</issue>) <fpage>417</fpage>–<lpage>473</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/j.1467-9868.2010.00740.x" xlink:type="simple">https://doi.org/10.1111/j.1467-9868.2010.00740.x</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2758523">MR2758523</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_029">
<label>[29]</label><mixed-citation publication-type="journal"> <string-name><surname>Meng</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Xie</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Mandal</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Zhong</surname>, <given-names>W.</given-names></string-name> and <string-name><surname>Ma</surname>, <given-names>P.</given-names></string-name> (<year>2020</year>). <article-title>LowCon: A design-based subsampling approach in a misspecified linear model</article-title>. <source>Journal of Computational and Graphical Statistics</source> <fpage>1</fpage>–<lpage>32</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/10618600.2020.1844215" xlink:type="simple">https://doi.org/10.1080/10618600.2020.1844215</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4313470">MR4313470</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_030">
<label>[30]</label><mixed-citation publication-type="journal"> <string-name><surname>Meng</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Zhong</surname>, <given-names>W.</given-names></string-name> and <string-name><surname>Ma</surname>, <given-names>P.</given-names></string-name> (<year>2020</year>). <article-title>More efficient approximation of smoothing splines via space-filling basis selection</article-title>. <source>Biometrika</source> <volume>107</volume>(<issue>3</issue>) <fpage>723</fpage>–<lpage>735</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1093/biomet/asaa019" xlink:type="simple">https://doi.org/10.1093/biomet/asaa019</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4138986">MR4138986</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_031">
<label>[31]</label><mixed-citation publication-type="chapter"> <string-name><surname>Muecke</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Reiss</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Rungenhagen</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Klein</surname>, <given-names>M.</given-names></string-name> (<year>2022</year>). <chapter-title>Data-splitting improves statistical performance in overparameterized regimes</chapter-title>. In <source>Proceedings of The 25th International Conference on Artificial Intelligence and Statistics</source> (<string-name><given-names>G.</given-names> <surname>Camps-Valls</surname></string-name>, <string-name><given-names>F. J. R.</given-names> <surname>Ruiz</surname></string-name> and <string-name><given-names>I.</given-names> <surname>Valera</surname></string-name>, eds.). <series>Proceedings of Machine Learning Research</series> <volume>151</volume> <fpage>10322</fpage>–<lpage>10350</lpage>. <publisher-name>PMLR</publisher-name>. <uri>https://proceedings.mlr.press/v151/muecke22a.html</uri>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_032">
<label>[32]</label><mixed-citation publication-type="journal"> <string-name><surname>Schifano</surname>, <given-names>E. D.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Yan</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Chen</surname>, <given-names>M. -H.</given-names></string-name> (<year>2016</year>). <article-title>Online updating of statistical inference in the big data setting</article-title>. <source>Technometrics</source> <volume>58</volume>(<issue>3</issue>) <fpage>393</fpage>–<lpage>403</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/00401706.2016.1142900" xlink:type="simple">https://doi.org/10.1080/00401706.2016.1142900</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3520668">MR3520668</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_033">
<label>[33]</label><mixed-citation publication-type="journal"> <string-name><surname>Shao</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Song</surname>, <given-names>S.</given-names></string-name> and <string-name><surname>Zhou</surname>, <given-names>Y.</given-names></string-name> (<year>2022</year>). <article-title>Optimal subsampling for large-sample quantile regression with massive data</article-title>. <source>Canadian Journal of Statistics</source>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4595236">MR4595236</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_034">
<label>[34]</label><mixed-citation publication-type="journal"> <string-name><surname>Song</surname>, <given-names>Q.</given-names></string-name> and <string-name><surname>Liang</surname>, <given-names>F.</given-names></string-name> (<year>2015</year>). <article-title>A split-and-merge Bayesian variable selection approach for ultrahigh dimensional regression</article-title>. <source>Journal of the Royal Statistical Society: Series B: Statistical Methodology</source> <volume>77</volume> <fpage>947</fpage>–<lpage>972</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/rssb.12095" xlink:type="simple">https://doi.org/10.1111/rssb.12095</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3414135">MR3414135</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_035">
<label>[35]</label><mixed-citation publication-type="journal"> <string-name><surname>Tibshirani</surname>, <given-names>R.</given-names></string-name> (<year>1996</year>). <article-title>Regression shrinkage and selection via the lasso</article-title>. <source>Journal of the Royal Statistical Society: Series B (Methodological)</source> <volume>58</volume>(<issue>1</issue>) <fpage>267</fpage>–<lpage>288</lpage>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=1379242">MR1379242</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_036">
<label>[36]</label><mixed-citation publication-type="chapter"> <string-name><surname>Ting</surname>, <given-names>D.</given-names></string-name> and <string-name><surname>Brochu</surname>, <given-names>E.</given-names></string-name> (<year>2018</year>). <chapter-title>Optimal subsampling with influence functions</chapter-title>. In <source>Advances in neural information processing systems</source> <fpage>3650</fpage>–<lpage>3659</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_037">
<label>[37]</label><mixed-citation publication-type="other"> <string-name><surname>Vakayil</surname>, <given-names>A.</given-names></string-name> and <string-name><surname>Joseph</surname>, <given-names>V. R.</given-names></string-name> (2021). Data Twinning. <italic>arXiv preprint</italic> <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/arXiv:2110.02927"><italic>arXiv:2110.02927</italic></ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4501911">MR4501911</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_038">
<label>[38]</label><mixed-citation publication-type="journal"> <string-name><surname>Wang</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>M. -H.</given-names></string-name>, <string-name><surname>Schifano</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Wu</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Yan</surname>, <given-names>J.</given-names></string-name> (<year>2016</year>). <article-title>Statistical methods and computing for big data</article-title>. <source>Statistics and its interface</source> <volume>9</volume>(<issue>4</issue>) <fpage>399</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.4310/SII.2016.v9.n4.a1" xlink:type="simple">https://doi.org/10.4310/SII.2016.v9.n4.a1</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3553369">MR3553369</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_039">
<label>[39]</label><mixed-citation publication-type="journal"> <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name> (<year>2019</year>). <article-title>More Efficient Estimation for Logistic Regression with Optimal Subsamples.</article-title> <source>Journal of Machine Learning Research</source> <volume>20</volume>(<issue>132</issue>) <fpage>1</fpage>–<lpage>59</lpage>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4002886">MR4002886</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_040">
<label>[40]</label><mixed-citation publication-type="journal"> <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name> and <string-name><surname>Ma</surname>, <given-names>Y.</given-names></string-name> (<year>2020</year>). <article-title>Optimal subsampling for quantile regression in big data</article-title>. <source>Biometrika</source>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1093/biomet/asaa043" xlink:type="simple">https://doi.org/10.1093/biomet/asaa043</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4226192">MR4226192</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_041">
<label>[41]</label><mixed-citation publication-type="journal"> <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>M.</given-names></string-name> and <string-name><surname>Stufken</surname>, <given-names>J.</given-names></string-name> (<year>2019</year>). <article-title>Information-based optimal subdata selection for big data linear regression</article-title>. <source>Journal of the American Statistical Association</source> <volume>114</volume>(<issue>525</issue>) <fpage>393</fpage>–<lpage>405</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2017.1408468" xlink:type="simple">https://doi.org/10.1080/01621459.2017.1408468</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3941263">MR3941263</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_042">
<label>[42]</label><mixed-citation publication-type="journal"> <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>R.</given-names></string-name> and <string-name><surname>Ma</surname>, <given-names>P.</given-names></string-name> (<year>2018</year>). <article-title>Optimal subsampling for large sample logistic regression</article-title>. <source>Journal of the American Statistical Association</source> <volume>113</volume>(<issue>522</issue>) <fpage>829</fpage>–<lpage>844</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2017.1292914" xlink:type="simple">https://doi.org/10.1080/01621459.2017.1292914</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3832230">MR3832230</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_043">
<label>[43]</label><mixed-citation publication-type="journal"> <string-name><surname>Wang</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Elmstedt</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wong</surname>, <given-names>W. K.</given-names></string-name> and <string-name><surname>Xu</surname>, <given-names>H.</given-names></string-name> (<year>2021</year>). <article-title>Orthogonal subsampling for big data linear regression</article-title>. <source>The Annals of Applied Statistics</source> <volume>15</volume>(<issue>3</issue>) <fpage>1273</fpage>–<lpage>1290</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/21-aoas1462" xlink:type="simple">https://doi.org/10.1214/21-aoas1462</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4316648">MR4316648</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_044">
<label>[44]</label><mixed-citation publication-type="other"> <string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>M.</given-names></string-name> and <string-name><surname>Li</surname>, <given-names>W.</given-names></string-name> (2022). Efficient Data Reduction Strategies for Big Data and High-Dimensional LASSO Regressions. <italic>in preparation</italic>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_045">
<label>[45]</label><mixed-citation publication-type="journal"> <string-name><surname>Xue</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Yan</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Schifano</surname>, <given-names>E. D.</given-names></string-name> (<year>2020</year>). <article-title>An online updating approach for testing the proportional hazards assumption with streams of survival data</article-title>. <source>Biometrics</source> <volume>76</volume>(<issue>1</issue>) <fpage>171</fpage>–<lpage>182</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/biom.13137" xlink:type="simple">https://doi.org/10.1111/biom.13137</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4098553">MR4098553</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_046">
<label>[46]</label><mixed-citation publication-type="journal"> <string-name><surname>Yao</surname>, <given-names>Y.</given-names></string-name> and <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name> (<year>2019</year>). <article-title>Optimal subsampling for softmax regression</article-title>. <source>Statistical Papers</source> <volume>60</volume>(<issue>2</issue>) <fpage>235</fpage>–<lpage>249</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s00362-018-01068-6" xlink:type="simple">https://doi.org/10.1007/s00362-018-01068-6</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3969047">MR3969047</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_047">
<label>[47]</label><mixed-citation publication-type="journal"> <string-name><surname>Yao</surname>, <given-names>Y.</given-names></string-name> and <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name> (<year>2021</year>). <article-title>A review on optimal subsampling methods for massive datasets</article-title>. <source>Journal of Data Science</source> <volume>19</volume>(<issue>1</issue>) <fpage>151</fpage>–<lpage>172</lpage>.</mixed-citation>
</ref>
<ref id="j_nejsds36_ref_048">
<label>[48]</label><mixed-citation publication-type="journal"> <string-name><surname>Yao</surname>, <given-names>Y.</given-names></string-name> and <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name> (<year>2021</year>). <article-title>A Selective Review on Statistical Techniques for Big Data</article-title>. <source>Modern Statistical Methods for Health Research</source> <fpage>223</fpage>–<lpage>245</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-3-030-72437-5_11" xlink:type="simple">https://doi.org/10.1007/978-3-030-72437-5_11</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4367515">MR4367515</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_049">
<label>[49]</label><mixed-citation publication-type="journal"> <string-name><surname>Yu</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name> (<year>2022</year>). <article-title>Subdata selection algorithm for linear model discrimination</article-title>. <source>Statistical Papers</source> <fpage>1</fpage>–<lpage>24</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s00362-022-01299-8" xlink:type="simple">https://doi.org/10.1007/s00362-022-01299-8</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4512216">MR4512216</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_050">
<label>[50]</label><mixed-citation publication-type="journal"> <string-name><surname>Yu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Ai</surname>, <given-names>M.</given-names></string-name> and <string-name><surname>Zhang</surname>, <given-names>H.</given-names></string-name> (<year>2020</year>). <article-title>Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data</article-title>. <source>Journal of the American Statistical Association</source> <fpage>1</fpage>–<lpage>12</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2020.1773832" xlink:type="simple">https://doi.org/10.1080/01621459.2020.1773832</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4399084">MR4399084</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_051">
<label>[51]</label><mixed-citation publication-type="journal"> <string-name><surname>Yuan</surname>, <given-names>M.</given-names></string-name> and <string-name><surname>Lin</surname>, <given-names>Y.</given-names></string-name> (<year>2007</year>). <article-title>On the non-negative garrotte estimator</article-title>. <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source> <volume>69</volume>(<issue>2</issue>) <fpage>143</fpage>–<lpage>161</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/j.1467-9868.2007.00581.x" xlink:type="simple">https://doi.org/10.1111/j.1467-9868.2007.00581.x</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2325269">MR2325269</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_052">
<label>[52]</label><mixed-citation publication-type="journal"> <string-name><surname>Zhang</surname>, <given-names>H.</given-names></string-name> and <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name> (<year>2021</year>). <article-title>Distributed subdata selection for big data via sampling-based approach</article-title>. <source>Computational Statistics &amp; Data Analysis</source> <volume>153</volume>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.csda.2020.107072" xlink:type="simple">https://doi.org/10.1016/j.csda.2020.107072</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4144200">MR4144200</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_053">
<label>[53]</label><mixed-citation publication-type="journal"> <string-name><surname>Zhang</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Ning</surname>, <given-names>Y.</given-names></string-name> and <string-name><surname>Ruppert</surname>, <given-names>D.</given-names></string-name> (<year>2021</year>). <article-title>Optimal sampling for generalized linear models under measurement constraints</article-title>. <source>Journal of Computational and Graphical Statistics</source> <volume>30</volume>(<issue>1</issue>) <fpage>106</fpage>–<lpage>114</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/10618600.2020.1778483" xlink:type="simple">https://doi.org/10.1080/10618600.2020.1778483</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4235968">MR4235968</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds36_ref_054">
<label>[54]</label><mixed-citation publication-type="journal"> <string-name><surname>Zhao</surname>, <given-names>P.</given-names></string-name> and <string-name><surname>Yu</surname>, <given-names>B.</given-names></string-name> (<year>2006</year>). <article-title>On model selection consistency of Lasso</article-title>. <source>The Journal of Machine Learning Research</source> <volume>7</volume> <fpage>2541</fpage>–<lpage>2563</lpage>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2274449">MR2274449</ext-link></mixed-citation>
</ref>
</ref-list>
</back>
</article>
