<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">NEJSDS</journal-id>
<journal-title-group><journal-title>The New England Journal of Statistics in Data Science</journal-title></journal-title-group>
<issn pub-type="ppub">2693-7166</issn><issn-l>2693-7166</issn-l>
<publisher>
<publisher-name>New England Statistical Society</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">NEJSDS67</article-id>
<article-id pub-id-type="doi">10.51387/24-NEJSDS67</article-id>
<article-categories><subj-group subj-group-type="area">
<subject>Software</subject></subj-group><subj-group subj-group-type="heading">
<subject>Software Tutorial and/or Review</subject></subj-group></article-categories>
<title-group>
<article-title><bold>SurrogateRsq</bold>: An R Package for Categorical Data Goodness-of-Fit Analysis Using the Surrogate <inline-formula id="j_nejsds67_ineq_001"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhu</surname><given-names>Xiaorui</given-names></name><email xlink:href="mailto:xzhu@towson.edu">xzhu@towson.edu</email><xref ref-type="aff" rid="j_nejsds67_aff_001"/><xref ref-type="fn" rid="j_nejsds67_fn_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Lin</surname><given-names>Zewei</given-names></name><email xlink:href="mailto:linzw@mail.uc.edu">linzw@mail.uc.edu</email><xref ref-type="aff" rid="j_nejsds67_aff_002"/><xref ref-type="fn" rid="j_nejsds67_fn_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname><given-names>Dungang</given-names></name><email xlink:href="mailto:liudg@ucmail.uc.edu">liudg@ucmail.uc.edu</email><xref ref-type="aff" rid="j_nejsds67_aff_003"/><xref ref-type="corresp" rid="cor1">∗</xref><xref ref-type="fn" rid="j_nejsds67_fn_001">1</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Greenwell</surname><given-names>Brandon</given-names></name><email xlink:href="mailto:greenwell.brandon@gmail.com">greenwell.brandon@gmail.com</email><xref ref-type="aff" rid="j_nejsds67_aff_004"/><xref ref-type="fn" rid="j_nejsds67_fn_001">1</xref>
</contrib>
<aff id="j_nejsds67_aff_001">Department of Business Analytics &amp; Technology Management, College of Business &amp; Economics, <institution>Towson University</institution>, <country>USA</country>. E-mail address: <email xlink:href="mailto:xzhu@towson.edu">xzhu@towson.edu</email></aff>
<aff id="j_nejsds67_aff_002">Department of Operation, Business Analytics, and Information Systems, Lindner College of Business, <institution>University of Cincinnati</institution>, <country>USA</country>. E-mail address: <email xlink:href="mailto:linzw@mail.uc.edu">linzw@mail.uc.edu</email></aff>
<aff id="j_nejsds67_aff_003">Department of Operation, Business Analytics, and Information Systems, Lindner College of Business, <institution>University of Cincinnati</institution>, <country>USA</country>. E-mail address: <email xlink:href="mailto:liudg@ucmail.uc.edu">liudg@ucmail.uc.edu</email></aff>
<aff id="j_nejsds67_aff_004">84.51<sup>∘</sup> and <institution>University of Cincinnati</institution>, <country>USA</country>. E-mail address: <email xlink:href="mailto:greenwell.brandon@gmail.com">greenwell.brandon@gmail.com</email></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp><fn id="j_nejsds67_fn_001"><label>1</label>
<p>The authors appreciate the associate editor and two referees for their invaluable feedback during the review process. Their expertise and insights enriched the quality of the work.</p></fn>
</author-notes>
<pub-date pub-type="ppub"><year>2025</year></pub-date><pub-date pub-type="epub"><day>17</day><month>6</month><year>2024</year></pub-date><volume>3</volume><issue>1</issue><fpage>94</fpage><lpage>105</lpage><history><date date-type="accepted"><day>9</day><month>5</month><year>2024</year></date></history>
<permissions><copyright-statement>© 2025 New England Statistical Society</copyright-statement><copyright-year>2025</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Categorical data are prevalent in almost all research fields and business applications. Their statistical analysis and inference often rely on probit/logistic regression models. For these common models, however, there is no universally adopted measure for performing goodness-of-fit analysis. To this end, [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] proposed a so-called surrogate <inline-formula id="j_nejsds67_ineq_002"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> that resembles the ordinary least square (OLS) <inline-formula id="j_nejsds67_ineq_003"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> for linear regression models. The surrogate <inline-formula id="j_nejsds67_ineq_004"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> used the notion of surrogacy, namely, generating a continuous response <italic>S</italic> and using it as a surrogate of the original categorical response <italic>Y</italic> [<xref ref-type="bibr" rid="j_nejsds67_ref_024">24</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_025">25</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_008">8</xref>]. In this paper, we develop an R package <inline-formula id="j_nejsds67_ineq_005"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> to implement the surrogate <inline-formula id="j_nejsds67_ineq_006"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> method [<xref ref-type="bibr" rid="j_nejsds67_ref_043">43</xref>]. The package is compatible with existing model fitting functions (e.g., <monospace>glm()</monospace>, <monospace>polr()</monospace>, <monospace>clm()</monospace>, and <monospace>vglm()</monospace>), and its features are exhibited in a wine rating analysis. Our package can be used jointly with other R packages developed for variable selection and model diagnostics so as to form a complete model development process. This process is summarized and demonstrated in a categorical-data-modeling workflow that practitioners can follow. To exemplify an extended utility of the surrogate-<inline-formula id="j_nejsds67_ineq_007"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>-based goodness-of-fit analysis, we also use this package to illustrate how to compare different empirical models trained from different samples in the wine rating analysis. The result suggests that the package allows us to evaluate comparability across multiple samples/models/studies that address the same or similar scientific or business questions.</p>
</abstract>
<kwd-group>
<label>Keywords and phrases</label>
<kwd>Categorical data analysis</kwd>
<kwd>Goodness-of-fit measure</kwd>
<kwd>Logistic regression</kwd>
<kwd>Model comparison</kwd>
<kwd>Probit model</kwd>
<kwd>Surrogate method</kwd>
<kwd>Surrogate residual</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_nejsds67_s_001">
<label>1</label>
<title>Introduction</title>
<p>Categorical data are prevalent in all areas, including economics, marketing, finance, psychology, and clinical studies. To analyze categorical data, the probit or logit models are often used to make inferences. To perform model assessment and comparison, researchers often rely on goodness-of-fit measures, such as <inline-formula id="j_nejsds67_ineq_008"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> (also known as the coefficient of determination). For example, the ordinary least square (OLS) <inline-formula id="j_nejsds67_ineq_009"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> is one of the most extensively used goodness-of-fit measures for linear models in continuous data analysis. For categorical data analysis, however, there is no such universally adopted <inline-formula id="j_nejsds67_ineq_010"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> measure [<xref ref-type="bibr" rid="j_nejsds67_ref_019">19</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_039">39</xref>]. Continuous efforts have been made to develop sensible <inline-formula id="j_nejsds67_ineq_011"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> measures for probit/logistic models, and more generally, generalized linear models [<xref ref-type="bibr" rid="j_nejsds67_ref_029">29</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_030">30</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_014">14</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_011">11</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_022">22</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_042">42</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_027">27</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_021">21</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>]. Among the existing <inline-formula id="j_nejsds67_ineq_012"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> measures, McKelvey-Zavoina’s <inline-formula id="j_nejsds67_ineq_013"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{MZ}^{2}}$]]></tex-math></alternatives></inline-formula> [<xref ref-type="bibr" rid="j_nejsds67_ref_030">30</xref>] and McFadden’s <inline-formula id="j_nejsds67_ineq_014"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> [<xref ref-type="bibr" rid="j_nejsds67_ref_029">29</xref>] are probably the most well-known and widely used in domain research [<xref ref-type="bibr" rid="j_nejsds67_ref_019">19</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_039">39</xref>]. But as demonstrated in [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>], Mckelvey-Zavoina’s <inline-formula id="j_nejsds67_ineq_015"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{MZ}^{2}}$]]></tex-math></alternatives></inline-formula> does not hold monotonicity, which means a larger model may have a smaller <inline-formula id="j_nejsds67_ineq_016"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{MZ}^{2}}$]]></tex-math></alternatives></inline-formula>. This serious defection of <inline-formula id="j_nejsds67_ineq_017"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{MZ}^{2}}$]]></tex-math></alternatives></inline-formula> may be misleading in practice and misguide the model-building process. On the other hand, McFadden’s <inline-formula id="j_nejsds67_ineq_018"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> relies on the ratio of likelihoods, and it does not preserve the interpretation of explained variance. Neither of these two <inline-formula id="j_nejsds67_ineq_019"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> measures meets all of the three criteria considered in [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>]:</p>
<list>
<list-item id="j_nejsds67_li_001">
<label>(C1)</label>
<p>It can approximate the OLS <inline-formula id="j_nejsds67_ineq_020"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> based on the latent continuous outcome.</p>
</list-item>
<list-item id="j_nejsds67_li_002">
<label>(C2)</label>
<p>It has the interpretation of the explained proportion of variance.</p>
</list-item>
<list-item id="j_nejsds67_li_003">
<label>(C3)</label>
<p>It maintains the monotonicity property between nested models, which means that a larger model should have a larger <inline-formula id="j_nejsds67_ineq_021"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> value.</p>
</list-item>
</list>
<p>[<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] proposed a so-called surrogate <inline-formula id="j_nejsds67_ineq_022"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> that satisfies all three criteria for probit models. This surrogate <inline-formula id="j_nejsds67_ineq_023"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> uses the notion of surrogacy, namely, generating a continuous response <italic>S</italic> and using it as a surrogate for the original categorical response <italic>Y</italic> [<xref ref-type="bibr" rid="j_nejsds67_ref_024">24</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_025">25</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_008">8</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_018">18</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_023">23</xref>]. In the context of probit analysis, [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] used the truncated distributions induced by the latent variable structure to generate a surrogate response <italic>S</italic>. This surrogate response <italic>S</italic> is then regressed on explanatory variables through a linear model. The OLS <inline-formula id="j_nejsds67_ineq_024"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> of this linear model is used as a surrogate <inline-formula id="j_nejsds67_ineq_025"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> for the original probit model. This surrogate <inline-formula id="j_nejsds67_ineq_026"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> meets all three criteria (C1)–(C3).</p>
<p>The goals of this paper are (i) developing an R package to implement [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>]’s method; (ii) demonstrating how this new package can be used jointly with other existing R packages for variable selection and model diagnostics in the model building process; and (iii) illustrating how this package can be used to compare different empirical models trained from two different samples (a.k.a. comparability) in real data analysis.</p>
<p>Specifically, we first develop an R package to implement the surrogate <inline-formula id="j_nejsds67_ineq_027"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> method for probit/logistic regression models. This package contains the R functions for generating the point and interval estimates of the surrogate <inline-formula id="j_nejsds67_ineq_028"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> measure. The point and interval estimates allow researchers and practitioners to evaluate the model’s overall goodness of fit and understand its uncertainty. In addition, we develop an R function that calculates the percentage contribution of each variable to the overall surrogate <inline-formula id="j_nejsds67_ineq_029"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>. This percentage reflects each variable’s contribution to the model’s overall explanatory power. Based on the contribution’s relative size, our R function provides an “importance” ranking of all the explanatory variables.</p>
<p>Second, to provide practical guidance for categorical data modeling, we use the developed R package to demonstrate how it can be used jointly with other R packages developed for variable screening/selection and model diagnostics (<inline-formula id="j_nejsds67_ineq_030"><alternatives><mml:math>
<mml:mi mathvariant="bold">leaps</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{leaps}$]]></tex-math></alternatives></inline-formula> [<xref ref-type="bibr" rid="j_nejsds67_ref_028">28</xref>], <monospace>step()</monospace> function from the R core, <bold>glmnet</bold> [<xref ref-type="bibr" rid="j_nejsds67_ref_017">17</xref>], <bold>ordinalNet</bold> [<xref ref-type="bibr" rid="j_nejsds67_ref_040">40</xref>], <bold>ncvreg</bold> [<xref ref-type="bibr" rid="j_nejsds67_ref_005">5</xref>], <bold>grpreg</bold> [<xref ref-type="bibr" rid="j_nejsds67_ref_006">6</xref>], <bold>SIS</bold> [<xref ref-type="bibr" rid="j_nejsds67_ref_034">34</xref>], <bold>sure</bold> [<xref ref-type="bibr" rid="j_nejsds67_ref_018">18</xref>], <inline-formula id="j_nejsds67_ineq_031"><alternatives><mml:math>
<mml:mi mathvariant="bold">PAsso</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{PAsso}$]]></tex-math></alternatives></inline-formula> [<xref ref-type="bibr" rid="j_nejsds67_ref_044">44</xref>]). In particular, we recommend a workflow that consists of three steps: variable screening/selection, model diagnostics, and goodness-of-fit analysis. The workflow is illustrated in the analysis of wine-tasting preference datasets.</p>
<p>Third, the comparability of the surrogate <inline-formula id="j_nejsds67_ineq_032"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> across different samples and/or models allows us to compare goodness-of-fit analysis from similar studies. The comparison can lead to additional scientific and business insights which may be useful for decision making. To illustrate this, we conduct goodness-of-fit analysis separately for the red wine and white wine samples to demonstrate the comparability of the surrogate <inline-formula id="j_nejsds67_ineq_033"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>. Our analysis result reveals that (i) the same set of explanatory variables has different explanatory power for red wine and white wine (43.8% versus 31.0%), and (ii) the importance ranking of the explanatory variable (in terms of their contribution to the surrogate <inline-formula id="j_nejsds67_ineq_034"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>) is different between red wine and white wine.</p>
<p>Our <inline-formula id="j_nejsds67_ineq_035"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> package has broad applicability. It is compatible with the following R functions that can fit probit/logistic regression models for a binary or ordinal response: <monospace>glm()</monospace> in the <bold>R</bold> core, <monospace>polr()</monospace> in the <inline-formula id="j_nejsds67_ineq_036"><alternatives><mml:math>
<mml:mi mathvariant="bold">MASS</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{MASS}$]]></tex-math></alternatives></inline-formula> package [<xref ref-type="bibr" rid="j_nejsds67_ref_033">33</xref>], <monospace>clm()</monospace> in the <inline-formula id="j_nejsds67_ineq_037"><alternatives><mml:math>
<mml:mi mathvariant="bold">ordinal</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{ordinal}$]]></tex-math></alternatives></inline-formula> package [<xref ref-type="bibr" rid="j_nejsds67_ref_009">9</xref>], and <monospace>vglm()</monospace> in the <inline-formula id="j_nejsds67_ineq_038"><alternatives><mml:math>
<mml:mi mathvariant="bold">VGAM</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{VGAM}$]]></tex-math></alternatives></inline-formula> package [<xref ref-type="bibr" rid="j_nejsds67_ref_041">41</xref>], although we only demonstrate the functions of <inline-formula id="j_nejsds67_ineq_039"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> package through ordinal probit regression models using <monospace>plor()</monospace> in this paper. More examples and details can be found on the website: <uri>https://xiaorui.site/SurrogateRsq/</uri>.</p>
</sec>
<sec id="j_nejsds67_s_002">
<label>2</label>
<title>Review of the Surrogate <inline-formula id="j_nejsds67_ineq_040"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula></title>
<p>We briefly review the surrogate <inline-formula id="j_nejsds67_ineq_041"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> measure in the study of [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>]. For the model setting, we consider a probit/logit model with a set of explanatory variables. The categorical response is either a binary or ordinal variable <italic>Y</italic> that has <italic>J</italic> categories <inline-formula id="j_nejsds67_ineq_042"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">J</mml:mi>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{1,2,\dots ,J\}$]]></tex-math></alternatives></inline-formula>, with the order <inline-formula id="j_nejsds67_ineq_043"><alternatives><mml:math>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mo stretchy="false">⋯</mml:mo>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">J</mml:mi></mml:math><tex-math><![CDATA[$1\lt 2\lt \cdots \lt J$]]></tex-math></alternatives></inline-formula>, 
<disp-formula id="j_nejsds67_eq_001">
<label>(2.1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mo movablelimits="false">Pr</mml:mo>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo stretchy="false">≤</mml:mo>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">G</mml:mi>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>−</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">⋯</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:mi mathvariant="italic">j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo movablelimits="false">…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mi mathvariant="italic">J</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ \Pr \{Y\le j\}=G\{{\alpha _{j}}-({\beta _{1}}{X_{1}}+\cdots +{\beta _{l}}{X_{p}})\},\hspace{1em}j=1,\dots ,J,\]]]></tex-math></alternatives>
</disp-formula> 
where <inline-formula id="j_nejsds67_ineq_044"><alternatives><mml:math>
<mml:mo>−</mml:mo>
<mml:mi>∞</mml:mi>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mo stretchy="false">⋯</mml:mo>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>∞</mml:mi></mml:math><tex-math><![CDATA[$-\infty \lt {\alpha _{1}}\lt \cdots \lt {\alpha _{J}}\lt +\infty $]]></tex-math></alternatives></inline-formula>. The link function <inline-formula id="j_nejsds67_ineq_045"><alternatives><mml:math>
<mml:mi mathvariant="italic">G</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mo>·</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$G(\cdot )$]]></tex-math></alternatives></inline-formula> can be a probit (<inline-formula id="j_nejsds67_ineq_046"><alternatives><mml:math>
<mml:mi mathvariant="italic">G</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mo>·</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">Φ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mo>·</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$G(\cdot )=\Phi (\cdot )$]]></tex-math></alternatives></inline-formula>) link or a logit (<inline-formula id="j_nejsds67_ineq_047"><alternatives><mml:math>
<mml:mi mathvariant="italic">G</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">η</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" stretchy="false">/</mml:mo>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">e</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">η</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$G(\eta )=1/(1+{e^{-\eta }})$]]></tex-math></alternatives></inline-formula>). Each generic symbol of {<inline-formula id="j_nejsds67_ineq_048"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{1}},\dots {X_{p}}$]]></tex-math></alternatives></inline-formula>} in Model (<xref rid="j_nejsds67_eq_001">2.1</xref>) can represent a single variable of interest, a high-order term (e.g., <inline-formula id="j_nejsds67_ineq_049"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${X^{2}}$]]></tex-math></alternatives></inline-formula>), or an interaction term between <italic>X</italic> and another variable. It is well-known that an equivalent way to express Model (<xref rid="j_nejsds67_eq_001">2.1</xref>) is through a latent variable. For example, if the link is probit, the latent variable has the following form with a normally distributed <italic>ϵ</italic>: 
<disp-formula id="j_nejsds67_eq_002">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">Z</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">⋯</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo stretchy="false">∼</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ Z={\alpha _{1}}+{\beta _{1}}{X_{1}}+\cdots +{\beta _{p}}{X_{p}}+\epsilon ,\hspace{1em}\epsilon \sim N(0,1).\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>The categorical response <italic>Y</italic> can be viewed as generated from censoring the continuous latent variable <italic>Z</italic> in the following way: 
<disp-formula id="j_nejsds67_eq_003">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfenced separators="" open="{" close="">
<mml:mrow>
<mml:mtable columnspacing="10.0pt" equalrows="false" columnlines="none" equalcolumns="false" columnalign="left left">
<mml:mtr>
<mml:mtd class="array">
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd class="array">
<mml:mspace width="1em"/>
<mml:mtext>if</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mo>−</mml:mo>
<mml:mi>∞</mml:mi>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">Z</mml:mi>
<mml:mo stretchy="false">≤</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mn>2</mml:mn>
</mml:mtd>
<mml:mtd class="array">
<mml:mspace width="1em"/>
<mml:mtext>if</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">Z</mml:mi>
<mml:mo stretchy="false">≤</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mo stretchy="false">⋯</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mi mathvariant="italic">J</mml:mi>
</mml:mtd>
<mml:mtd class="array">
<mml:mspace width="1em"/>
<mml:mtext>if</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">Z</mml:mi>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>∞</mml:mi>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ Y=\left\{\begin{array}{l@{\hskip10.0pt}l}1& \hspace{1em}\text{if}\hspace{2.5pt}-\infty \lt Z\le {\alpha _{1}}+{\alpha _{1}},\\ {} 2& \hspace{1em}\text{if}\hspace{2.5pt}{\alpha _{1}}+{\alpha _{1}}\lt Z\le {\alpha _{2}}+{\alpha _{1}},\\ {} \cdots \\ {} J& \hspace{1em}\text{if}\hspace{2.5pt}{\alpha _{J-1}}+{\alpha _{1}}\lt Z\lt +\infty .\end{array}\right.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>To construct a goodness-of-fit <inline-formula id="j_nejsds67_ineq_050"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>, [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] adopted the surrogate approach proposed by [<xref ref-type="bibr" rid="j_nejsds67_ref_024">24</xref>]. The idea of the surrogate approach is to simulate a continuous variable and use it as a surrogate for the original categorical variable in the analysis [<xref ref-type="bibr" rid="j_nejsds67_ref_024">24</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_025">25</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_008">8</xref>]. In the context of probit models, [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] proposed to generate a surrogate response variable using the following truncated conditional distribution: 
<disp-formula id="j_nejsds67_eq_004">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo stretchy="false">∼</mml:mo>
<mml:mfenced separators="" open="{" close="">
<mml:mrow>
<mml:mtable columnspacing="10.0pt" equalrows="false" columnlines="none" equalcolumns="false" columnalign="left left">
<mml:mtr>
<mml:mtd class="array">
<mml:mi mathvariant="italic">Z</mml:mi>
<mml:mo stretchy="false">∣</mml:mo>
<mml:mo>−</mml:mo>
<mml:mi>∞</mml:mi>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">Z</mml:mi>
<mml:mo stretchy="false">≤</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mspace width="1em"/>
<mml:mtext>if</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mi mathvariant="italic">Z</mml:mi>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">Z</mml:mi>
<mml:mo stretchy="false">≤</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd class="array">
<mml:mspace width="1em"/>
<mml:mtext>if</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mo stretchy="false">⋯</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array">
<mml:mi mathvariant="italic">Z</mml:mi>
<mml:mo stretchy="false">∣</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mi mathvariant="italic">Z</mml:mi>
<mml:mo mathvariant="normal">&lt;</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>∞</mml:mi>
</mml:mtd>
<mml:mtd class="array">
<mml:mspace width="1em"/>
<mml:mtext>if</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mi mathvariant="italic">Y</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">J</mml:mi>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ S\sim \left\{\begin{array}{l@{\hskip10.0pt}l}Z\mid -\infty \lt Z\le {\alpha _{1}}+{\alpha _{1}}& \hspace{1em}\text{if}\hspace{2.5pt}Y=1,\\ {} Z\mid {\alpha _{1}}+{\alpha _{1}}\lt Z\le {\alpha _{2}}+{\alpha _{1}}& \hspace{1em}\text{if}\hspace{2.5pt}Y=2,\\ {} \cdots \\ {} Z\mid {\alpha _{J-1}}+{\alpha _{1}}\lt Z\lt +\infty & \hspace{1em}\text{if}\hspace{2.5pt}Y=J.\end{array}\right.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>[<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] proposed to regress the surrogate response <italic>S</italic> on {<inline-formula id="j_nejsds67_ineq_051"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${X_{1}},\dots ,{X_{p}}$]]></tex-math></alternatives></inline-formula>} using a linear model below: 
<disp-formula id="j_nejsds67_eq_005">
<label>(2.2)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">α</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">⋯</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mspace width="1em"/>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo stretchy="false">∼</mml:mo>
<mml:mi mathvariant="italic">N</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ S={\alpha _{1}}+{\beta _{1}}{X_{1}}+\cdots +{\beta _{p}}{X_{p}}+\epsilon ,\hspace{1em}\epsilon \sim N(0,1).\]]]></tex-math></alternatives>
</disp-formula> 
Their approach used the OLS <inline-formula id="j_nejsds67_ineq_052"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> measure of this linear model as a surrogate <inline-formula id="j_nejsds67_ineq_053"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> for Model (<xref rid="j_nejsds67_eq_001">2.1</xref>): 
<disp-formula id="j_nejsds67_eq_006">
<alternatives><mml:math display="block">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">}</mml:mo>
<mml:mo>=</mml:mo>
<mml:mtext>the OLS</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mspace width="2.5pt"/>
<mml:mtext>of the linear model</mml:mtext>
<mml:mspace width="2.5pt"/>
<mml:mtext>(2.2)</mml:mtext>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[ {R_{(S)}^{2}}\{{X_{1}},\dots ,{X_{p}}\}=\text{the OLS}\hspace{2.5pt}{R^{2}}\hspace{2.5pt}\text{of the linear model}\hspace{2.5pt}\text{(2.2)}.\]]]></tex-math></alternatives>
</disp-formula>
</p>
<p>[<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] showed that the surrogate <inline-formula id="j_nejsds67_ineq_054"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula> measure has three desirable properties. First, it approximates the OLS <inline-formula id="j_nejsds67_ineq_055"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> calculated using the latent continuous outcome <italic>Z</italic>. This property enables us to compare surrogate <inline-formula id="j_nejsds67_ineq_056"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>’s and OLS <inline-formula id="j_nejsds67_ineq_057"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>’s across different models and samples that address the same scientific question. Second, as it is the OLS <inline-formula id="j_nejsds67_ineq_058"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> calculated using the continuous surrogate response <italic>S</italic>, the surrogate <inline-formula id="j_nejsds67_ineq_059"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula> has the interpretation of the explained proportion of variance. It measures the explained proportion of the variance of the surrogate response S through the linear model. This explained proportion of variance implies the explanatory power of all the features in the fitted model. Third, the surrogate <inline-formula id="j_nejsds67_ineq_060"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula> maintains monotonicity between nested models, which makes it suitable for comparing the relative explanatory power of different models. In contrast, the well-known McFadden’s <inline-formula id="j_nejsds67_ineq_061"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> does not preserve the first two properties of the surrogate <inline-formula id="j_nejsds67_ineq_062"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula>. McFadden’s <inline-formula id="j_nejsds67_ineq_063"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> relies on the ratio of likelihoods, so it neither approximates the OLS <inline-formula id="j_nejsds67_ineq_064"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> nor preserves the interpretation of explained variance. On the other hand, [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] showed that McKelvey-Zavoina’s <inline-formula id="j_nejsds67_ineq_065"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{MZ}^{2}}$]]></tex-math></alternatives></inline-formula> did not necessarily maintain monotonicity between nested models. This serious issue may make McKelvey-Zavoina’s <inline-formula id="j_nejsds67_ineq_066"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{MZ}^{2}}$]]></tex-math></alternatives></inline-formula> an unsuitable tool for measuring the goodness of fit.</p>
<p>To make inferences for the surrogate <inline-formula id="j_nejsds67_ineq_067"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula>, [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] provided procedures to produce point and interval estimates. Since the surrogate response <italic>S</italic> is obtained through simulation, [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] used a multiple-sampling scheme to “stabilize” the point estimate. They also provided an implementation to produce an interval estimate with a <inline-formula id="j_nejsds67_ineq_068"><alternatives><mml:math>
<mml:mn>95</mml:mn>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$95\% $]]></tex-math></alternatives></inline-formula> confidence level. This confidence interval is constructed through a bootstrap-based pseudo algorithm. When the sample size is large (e.g., <inline-formula id="j_nejsds67_ineq_069"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2000</mml:mn></mml:math><tex-math><![CDATA[$n=2000$]]></tex-math></alternatives></inline-formula>), [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>]’s numerical studies show that the interval measure of the surrogate <inline-formula id="j_nejsds67_ineq_070"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula> can approximate the nominal coverage probability.</p>
<p>It is also worth noting that [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>]’s method requires a full model. This paper will illustrate how to use existing tools, such as variable selection and model diagnostics, to initiate a full model. The full model is used to generate a common surrogate response <italic>S</italic>, which is then used to calculate the surrogate <inline-formula id="j_nejsds67_ineq_071"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula>’s of whatever reduced models. We will demonstrate how to carry it out in a real data analysis presented in Section <xref rid="j_nejsds67_s_005">5</xref>.</p>
</sec>
<sec id="j_nejsds67_s_003">
<label>3</label>
<title>Main Functions of the <inline-formula id="j_nejsds67_ineq_072"><alternatives><mml:math>
<mml:mtext mathvariant="bold">SurrogateRsq</mml:mtext></mml:math><tex-math><![CDATA[$\textbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> Package</title>
<p>We develop an R package <inline-formula id="j_nejsds67_ineq_073"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> for goodness-of-fit analysis of probit models [<xref ref-type="bibr" rid="j_nejsds67_ref_043">43</xref>]. This package contains functions to provide (i) a point estimate of the surrogate <inline-formula id="j_nejsds67_ineq_074"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>; (ii) an interval estimate of the surrogate <inline-formula id="j_nejsds67_ineq_075"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>; (iii) an importance ranking of explanatory variables based on their contributions to the total surrogate <inline-formula id="j_nejsds67_ineq_076"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> of the full model; and (iv) other existing <inline-formula id="j_nejsds67_ineq_077"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> measures in the literature. In this section, we explicitly explain the inputs and outputs of these functions. In the next two sections, we will demonstrate the use of these functions through a recommended workflow and real data examples.</p>
<list>
<list-item id="j_nejsds67_li_004">
<label>1.</label>
<p><monospace>surr_rsq</monospace>: a function for producing a point estimate of the surrogate <inline-formula id="j_nejsds67_ineq_078"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula> for a user-specified model. It requires three inputs: a reduced model, a full model, and a dataset. This function generates an S3 object of the class “<monospace>surr_rsq</monospace>”. Other functions in this package can directly call this S3 object. The details of the three inputs are as follows:</p>
<list>
<list-item id="j_nejsds67_li_005">
<label>•</label>
<p><monospace>model</monospace>: a model to be evaluated for the goodness of fit. Our implementation supports a few popular classes of objects. They are the <monospace>probit</monospace> model from the <monospace>glm</monospace> function in the R core <monospace>stats</monospace> package, the ordered probit model generated from the <monospace>plor</monospace> function in the <monospace>MASS</monospace> package, <monospace>clm()</monospace> in the <monospace>ordinal</monospace> package, and <monospace>vglm()</monospace> in the <monospace>VGAM</monospace> package.</p>
</list-item>
<list-item id="j_nejsds67_li_006">
<label>•</label>
<p><monospace>full_model</monospace>: a full model initiated by the investigator. [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>]’s method requires a full model. In Sections <xref rid="j_nejsds67_s_004">4</xref> and <xref rid="j_nejsds67_s_005">5</xref>, we discuss in detail how to initiate a full model. Besides, This model object should contain the dataset for fitting the full model and the reduced model.</p>
</list-item>
<list-item id="j_nejsds67_li_007">
<label>•</label>
<p><monospace>avg.num</monospace>: an optional input that specifies the numbers of simulations used in multiple sampling. The default value is 30. The surrogate <inline-formula id="j_nejsds67_ineq_079"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula> is calculated using the simulated surrogate response <italic>S</italic>. A multiple-sampling scheme can be used to “stabilize” the point estimate of <inline-formula id="j_nejsds67_ineq_080"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula> by using the average of multiple <inline-formula id="j_nejsds67_ineq_081"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula>’s values.</p>
</list-item>
<list-item id="j_nejsds67_li_008">
<label>•</label>
<p><monospace>asym</monospace>: an optional logical argument that specifies whether to use the asymptotic version of the surrogate R-squared. The default value is FALSE. If TRUE, we calculate the surrogate <inline-formula id="j_nejsds67_ineq_082"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> using the asymptotic formula on page 208 of the paper by [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>]. More details are provided in that paper. This approach avoids calculating the average of multiple <inline-formula id="j_nejsds67_ineq_083"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula> in the above argument.</p>
</list-item>
</list>
</list-item>
</list>
<graphic xlink:href="nejsds67_g001.jpg"/>
<list>
<list-item id="j_nejsds67_li_009">
<label>2.</label>
<p><monospace>surr_rsq_ci</monospace>: a function for generating an interval measure of the surrogate <inline-formula id="j_nejsds67_ineq_084"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> with the designated confidence level. This interval accounts for and reflects the uncertainty in the <inline-formula id="j_nejsds67_ineq_085"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> statistic. This function requires three inputs:</p>
<list>
<list-item id="j_nejsds67_li_010">
<label>•</label>
<p><monospace>object</monospace>: an object generated from the previous <monospace>surr_rsq</monospace> function.</p>
</list-item>
<list-item id="j_nejsds67_li_011">
<label>•</label>
<p><monospace>alpha</monospace>: the value of <monospace>alpha</monospace> determines the confidence level of the interval, namely, <inline-formula id="j_nejsds67_ineq_086"><alternatives><mml:math>
<mml:mn>100</mml:mn>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
<mml:mi mathvariant="normal">%</mml:mi></mml:math><tex-math><![CDATA[$100(1-\alpha )\% $]]></tex-math></alternatives></inline-formula>. The default value of <monospace>alpha</monospace> is 0.05.</p>
</list-item>
<list-item id="j_nejsds67_li_012">
<label>•</label>
<p><monospace>B</monospace>: the number of bootstrap replications. The default value of <monospace>B</monospace> is 2000. The confidence interval is derived from a bootstrap distribution for <inline-formula id="j_nejsds67_ineq_087"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula>. See the section of “Inference by Multiple Sampling” in [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>].</p>
</list-item>
<list-item id="j_nejsds67_li_013">
<label>•</label>
<p><monospace>asym</monospace>: an optional logical argument that specifies whether to use the asymptotic version of the surrogate R-squared. The default value is FALSE.</p>
</list-item>
<list-item id="j_nejsds67_li_014">
<label>•</label>
<p><monospace>parallel</monospace>: an optional logical argument that controls parallel computing using the <monospace>foreach</monospace> [<xref ref-type="bibr" rid="j_nejsds67_ref_001">1</xref>]. The default value is FALSE. If TRUE, the parallel clusters need to be registered through <monospace>registerDoParallel()</monospace> or <monospace>registerDoSNOW()</monospace> beforehand.</p>
</list-item>
</list>
</list-item>
</list>
<graphic xlink:href="nejsds67_g002.jpg"/>
<fig id="j_nejsds67_fig_001">
<label>Figure 1</label>
<caption>
<p>An illustration of a workflow for modeling categorical data. Grey boxes show statistical analysis steps that should be carried out before our goodness-of-fit analysis. Light blue boxes contain the main functions in the SurrogateRsq package. Orange boxes highlight inference outcomes produced by our SurrogateRsq package.</p>
</caption>
<graphic xlink:href="nejsds67_g003.jpg"/>
</fig>
<list>
<list-item id="j_nejsds67_li_015">
<label>3.</label>
<p><monospace>surr_rsq_rank</monospace>: a function to give ranks of explanatory variables based on their contributions to the overall surrogate <inline-formula id="j_nejsds67_ineq_088"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>. The rank is based on the variance contribution of each variable. Specifically, it calculates the reduction of the surrogate <inline-formula id="j_nejsds67_ineq_089"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">S</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{(S)}^{2}}$]]></tex-math></alternatives></inline-formula> of the model that removes each variable one at a time. The rank is then determined according to the reduction, which indicates the importance of each variable relevant to others. In addition to the ranks, the output table includes the <inline-formula id="j_nejsds67_ineq_090"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> reduction and its percentage in reference to the total surrogate <inline-formula id="j_nejsds67_ineq_091"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> of the full model. The function only requires the <monospace>object</monospace> input. It is a generated object from the <monospace>surr_rsq</monospace> function. The optional <monospace>avg.num</monospace> argument is the same as the one in the <monospace>surr_rsq</monospace> function, and the option <monospace>var.set</monospace> is explained below.</p>
<list>
<list-item id="j_nejsds67_li_016">
<label>•</label>
<p><monospace>object</monospace>: an object generated from the previous <monospace>surr_rsq</monospace> function.</p>
</list-item>
<list-item id="j_nejsds67_li_017">
<label>•</label>
<p><monospace>var.set</monospace>: an optional argument that allows users to examine the contribution of a set of variables, as a whole, to the total surrogate <inline-formula id="j_nejsds67_ineq_092"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>. If not specified, the function calculates the goodness-of-fit contributions to the overall surrogate <inline-formula id="j_nejsds67_ineq_093"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> for individual variables.</p>
</list-item>
</list>
</list-item>
</list>
<graphic xlink:href="nejsds67_g004.jpg"/>
</sec>
<sec id="j_nejsds67_s_004">
<label>4</label>
<title>Using R Packages for Categorical Data Modeling: A Workflow</title>
<p>In empirical studies, goodness-of-fit analysis should be used jointly with other statistical tools, such as variable screening/selection and model diagnostics, in the model-building and refining process. In this section, we discuss how to follow the workflow in Figure <xref rid="j_nejsds67_fig_001">1</xref> to carry out statistical modeling for categorical data. We also discuss how to use the <inline-formula id="j_nejsds67_ineq_094"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> package with other existing R packages to implement this workflow. As [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>]’s method requires a full model, researchers and practitioners can also follow the process in Figure <xref rid="j_nejsds67_fig_001">1</xref> to initiate a full model to facilitate goodness-of-fit analysis. 
<list>
<list-item id="j_nejsds67_li_018">
<label>1.</label>
<p>In <monospace>Step-0</monospace>, we can use the AIC/BIC/LASSO or any other variable selection methods deemed appropriate to trim or prune the set of explanatory variables to a “manageable” size (e.g., less than 20). The goal is to eliminate irrelevant variables so researchers can better investigate the model structure and assessment. The variable selection techniques have been studied extensively in the literature. Specifically, one can implement (i) the best subset selection using the function <monospace>regsubsets()</monospace> in the <monospace>leaps</monospace> package; (ii) the forward/backward/stepwise selection using the function <monospace>step()</monospace> in the R core; (iii) the shrinkage methods including the (adaptive) LASSO in the <bold>glmnet</bold> package; (iv) the regularized ordinal regression model with an elastic net penalty in the <bold>ordinalNet</bold> package; and (v) the penalized regression models with minimax concave penalty (MCP) or smoothly clipped absolute deviation (SCAD) penalty in the <bold>ncvreg</bold> package [<xref ref-type="bibr" rid="j_nejsds67_ref_036">36</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_046">46</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_045">45</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_035">35</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_040">40</xref>]. When the dimension is ultrahigh, the sure independence screening method can be applied through the <bold>SIS</bold> package [<xref ref-type="bibr" rid="j_nejsds67_ref_016">16</xref>]. When the variables are grouped, one can apply the group selection methods including the group lasso, group MCP, and group SCAD through the <bold>grpreg</bold> package [<xref ref-type="bibr" rid="j_nejsds67_ref_006">6</xref>]. In some cases, <monospace>Step-0</monospace> may be skipped if the experiment only involves a (small) set of controlled variables. In these cases, the controlled variables should be modeled regardless of statistical significance or predictive power. We limit our discussion here because our focus is on goodness-of-fit analysis.</p>
</list-item>
<list-item id="j_nejsds67_li_019">
<label>2.</label>
<p>In <monospace>Step-1</monospace>, we can use diagnostic tools to inspect the model passed from <monospace>Step-0</monospace>, adjust its functional form, and add additional elements if needed (e.g., higher-order or interaction terms). For categorical data, we can use the function <monospace>autoplot.resid()</monospace> in the <bold>sure</bold> package [<xref ref-type="bibr" rid="j_nejsds67_ref_024">24</xref>, <xref ref-type="bibr" rid="j_nejsds67_ref_018">18</xref>] to generate three types of diagnostic plots: residual Q-Q plot, residual-vs-covariate plot, and residual-vs-fitted plots. These plots can be used to visualize the discrepancy between the working model and the “true” model. Similar plots can be produced using the function <monospace>diagnostic.plot()</monospace> in the <bold>PAsso</bold> package [<xref ref-type="bibr" rid="j_nejsds67_ref_044">44</xref>]. These diagnostic plots provide practitioners insights on how to refine the model by possibly transforming the regression form or adding higher-order terms. At the end of this diagnosing and refining process, we expect to have a <bold>full model</bold> (<inline-formula id="j_nejsds67_ineq_095"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathcal{M}_{full}}$]]></tex-math></alternatives></inline-formula>) for subsequent inferences including goodness-of-fit analysis.</p>
</list-item>
<list-item id="j_nejsds67_li_020">
<label>3.</label>
<p>In <monospace>Step-2</monospace>, we can use the functions developed in our <bold>SurrogateRsq</bold> package to examine the goodness of fit of the full model <inline-formula id="j_nejsds67_ineq_096"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathcal{M}_{full}}$]]></tex-math></alternatives></inline-formula> and various reduced models of interest. Specifically, we can produce the point and interval estimates of the surrogate <inline-formula id="j_nejsds67_ineq_097"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> by using the functions <monospace>surr_rsq()</monospace> and <monospace>surr_rsq_ci()</monospace>. In addition, we can quantify the contribution of each individual variable to the overall surrogate <inline-formula id="j_nejsds67_ineq_098"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> by using the function <monospace>surr_rsq_rank()</monospace>. Based on the percentage contribution, the function <monospace>surr_rsq_rank()</monospace> also provides ranks of the explanatory variables to show their relative importance. In the following section, we will show in a case study how our package can help us understand the relative importance of explanatory variables and compare the results across different samples. The “comparability” across different samples and/or models is an appealing feature of the surrogate <inline-formula id="j_nejsds67_ineq_099"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>, which will be discussed in detail along with the R implementation.</p>
</list-item>
</list>
</p>
</sec>
<sec id="j_nejsds67_s_005">
<label>5</label>
<title>Analysis of the Wine Rating Data: A Demonstration</title>
<p>In this section, we demonstrate how to use our <monospace>SurrogateRsq</monospace> package, coupled with R packages for model selection and diagnostics, to carry out statistical analysis of the wine rating data. A critical problem in wine analysis is to understand how the physicochemical properties of wines may influence human tasting preferences [<xref ref-type="bibr" rid="j_nejsds67_ref_010">10</xref>]. For this purpose, [<xref ref-type="bibr" rid="j_nejsds67_ref_010">10</xref>] collected a dataset that contains wine ratings for 1599 red wine samples and 4898 white wine samples. The response variable, wine ratings, is measured on an ordinal scale ranging from 0 (very bad) to 10 (excellent). The explanatory variables are 11 physicochemical features, including alcohol, sulphates, acidity, dioxide, pH, and others.</p>
<p>Our analysis of the wine rating data follows the workflow discussed in Section <xref rid="j_nejsds67_s_004">4</xref>. Specifically, in Section <xref rid="j_nejsds67_s_006">5.1</xref>, we initiate a full model using several R packages for variable selection and model diagnostics. In Section <xref rid="j_nejsds67_s_009">5.2</xref>, we use our <monospace>SurrogateRsq</monospace> package to evaluate (i) the goodness-of-fit of the full model and several reduced models; (ii) the contribution of each individual variable to the overall <inline-formula id="j_nejsds67_ineq_100"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>; and (iii) the difference between the red wine and white wine in terms of how physicochemical features may influence human tasting differently.</p>
<sec id="j_nejsds67_s_006">
<label>5.1</label>
<title>Initiating a Full Model Using Variable Selection and Model Diagnostics</title>
<p>To start, we use the function <monospace>polr()</monospace> to fit a probit model to the red wine sample using all 11 explanatory variables. This “naive” model has identified three explanatory variables that are insignificant: they are <monospace>fixed.acidity</monospace>, <monospace>citric.acid</monospace>, and <monospace>residual.sugar</monospace>.</p><graphic xlink:href="nejsds67_g005.jpg"/><graphic xlink:href="nejsds67_g006.jpg"/>
<sec id="j_nejsds67_s_007">
<label>5.1.1</label>
<title>Variable Selection</title>
<p>As the number of explanatory variables is small, we use the exhaustive search method to select variables.</p><graphic xlink:href="nejsds67_g007.jpg"/>
<fig id="j_nejsds67_fig_002">
<label>Figure 2</label>
<caption>
<p>The selection results of the exhaustive search method for the red wine analysis.</p>
</caption>
<graphic xlink:href="nejsds67_g008.jpg"/>
</fig>
<p>Figure <xref rid="j_nejsds67_fig_002">2</xref> plots the exhaustive search selection results based on the BIC. Each row in the plot represents a model that has been trained with the variables highlighted in black color. The top row is the selected model with the smallest BIC value. This model does not select <monospace>fixed.acidity</monospace>, <monospace>citric.acid</monospace>, <monospace>residual.sugar</monospace>, and <monospace>density</monospace>. Note that the first three are not significant. We will perform diagnostics on this model in the subsection that follows.</p><graphic xlink:href="nejsds67_g009.jpg"/>
<p>We remark that if the number of explanatory variables is (moderately) large, we can use the step-wise selection method or regularization methods (e.g., with an L1, elastic net, minimax concave, or SCAD penalty). An example code is attached in the supplementary materials.</p>
</sec>
<sec id="j_nejsds67_s_008">
<label>5.1.2</label>
<title>Model Diagnostics</title>
<p>We conduct diagnostics of the model with variables selected in the preview step. For this purpose, we use surrogate residuals [<xref ref-type="bibr" rid="j_nejsds67_ref_024">24</xref>], which can be implemented by the function <monospace>autoplot.resid()</monospace> in the package <monospace>sure</monospace> [<xref ref-type="bibr" rid="j_nejsds67_ref_018">18</xref>] or the function <monospace>diagnostic.plot()</monospace> in the package <monospace>PAsso</monospace> [<xref ref-type="bibr" rid="j_nejsds67_ref_044">44</xref>]. The code below produces residual-vs-covariate plots for the object <monospace>select_model</monospace> by specifying the <monospace>output = "covariate"</monospace>.</p><graphic xlink:href="nejsds67_g010.jpg"/>
<fig id="j_nejsds67_fig_003">
<label>Figure 3</label>
<caption>
<p>Plots of surrogate residual versus <monospace>sulphates</monospace> for (a) the model with a linear term of <monospace>sulphates</monospace>; (b) the model with an additional quadratic term of <monospace>sulphates</monospace>; and (c) the model with an additional cubic term of <inline-formula id="j_nejsds67_ineq_101"><alternatives><mml:math>
<mml:mo mathvariant="monospace" movablelimits="false">sulphates</mml:mo></mml:math><tex-math><![CDATA[$sulphates$]]></tex-math></alternatives></inline-formula>. The solid red curves are LOESS curves.</p>
</caption>
<graphic xlink:href="nejsds67_g011.jpg"/>
</fig>
<p>Among all the residual-vs-covariate plots, we find that the residual-vs-<monospace>sulphates</monospace> plot in Figure <xref rid="j_nejsds67_fig_003">3</xref>(a) shows an inverted U-shape pattern, which suggests a missing quadratic term of <monospace>sulphates</monospace>. We update the model by adding a squared term <inline-formula id="j_nejsds67_ineq_102"><alternatives><mml:math>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="monospace">sulphates</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$I({\mathtt{sulphates}^{2}})$]]></tex-math></alternatives></inline-formula> to the object <monospace>select_model</monospace> and run model diagnostics again using the code below. Figure <xref rid="j_nejsds67_fig_003">3</xref>(b) shows that the plot for <monospace>sulphates</monospace> still exhibits a nonlinear pattern. We therefore add a cubic term <inline-formula id="j_nejsds67_ineq_103"><alternatives><mml:math>
<mml:mi mathvariant="italic">I</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="monospace">sulphates</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo></mml:math><tex-math><![CDATA[$I({\mathtt{sulphates}^{3}})$]]></tex-math></alternatives></inline-formula> to the model. The LOESS curve in the updated plot in Figure <xref rid="j_nejsds67_fig_003">3</xref>(c) turns out to be flat. We use this model as our <bold>full model</bold> (<inline-formula id="j_nejsds67_ineq_104"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathcal{M}_{full}}$]]></tex-math></alternatives></inline-formula>).</p><graphic xlink:href="nejsds67_g012.jpg"/>
<table-wrap id="j_nejsds67_tab_001">
<label>Table 1</label>
<caption>
<p>Model development for the red wine by variable selection and model diagnostics.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: double"/>
<td colspan="4" style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin"><italic>Dependent variable: quality</italic></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Model</td>
<td style="vertical-align: top; text-align: center">Naive</td>
<td style="vertical-align: top; text-align: center">Selected</td>
<td style="vertical-align: top; text-align: center">+ sulphates<inline-formula id="j_nejsds67_ineq_105"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">+ sulphates<inline-formula id="j_nejsds67_ineq_106"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{3}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"><bold>full model</bold> <inline-formula id="j_nejsds67_ineq_107"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathcal{M}_{full}}$]]></tex-math></alternatives></inline-formula></td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">fixed.acidity</td>
<td style="vertical-align: top; text-align: center">0.026</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center">(0.028)</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">volatile.acidity</td>
<td style="vertical-align: top; text-align: center">−1.868<inline-formula id="j_nejsds67_ineq_108"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−1.722<inline-formula id="j_nejsds67_ineq_109"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−1.534<inline-formula id="j_nejsds67_ineq_110"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−1.491<inline-formula id="j_nejsds67_ineq_111"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center">(0.213)</td>
<td style="vertical-align: top; text-align: center">(0.180)</td>
<td style="vertical-align: top; text-align: center">(0.183)</td>
<td style="vertical-align: top; text-align: center">(0.183)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">citric.acid</td>
<td style="vertical-align: top; text-align: center">−0.337</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center">(0.256)</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">residual.sugar</td>
<td style="vertical-align: top; text-align: center">0.011</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center">(0.021)</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">chlorides</td>
<td style="vertical-align: top; text-align: center">−3.234<inline-formula id="j_nejsds67_ineq_112"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−3.488<inline-formula id="j_nejsds67_ineq_113"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−2.965<inline-formula id="j_nejsds67_ineq_114"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−2.604<inline-formula id="j_nejsds67_ineq_115"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center">(0.733)</td>
<td style="vertical-align: top; text-align: center">(0.699)</td>
<td style="vertical-align: top; text-align: center">(0.707)</td>
<td style="vertical-align: top; text-align: center">(0.715)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">free.sulfur.dioxide</td>
<td style="vertical-align: top; text-align: center">0.010<inline-formula id="j_nejsds67_ineq_116"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.011<inline-formula id="j_nejsds67_ineq_117"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.010<inline-formula id="j_nejsds67_ineq_118"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.010<inline-formula id="j_nejsds67_ineq_119"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center">(0.004)</td>
<td style="vertical-align: top; text-align: center">(0.004)</td>
<td style="vertical-align: top; text-align: center">(0.004)</td>
<td style="vertical-align: top; text-align: center">(0.004)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">total.sulfur.dioxide</td>
<td style="vertical-align: top; text-align: center">−0.007<inline-formula id="j_nejsds67_ineq_120"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−0.008<inline-formula id="j_nejsds67_ineq_121"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−0.007<inline-formula id="j_nejsds67_ineq_122"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−0.007<inline-formula id="j_nejsds67_ineq_123"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center">(0.001)</td>
<td style="vertical-align: top; text-align: center">(0.001)</td>
<td style="vertical-align: top; text-align: center">(0.001)</td>
<td style="vertical-align: top; text-align: center">(0.001)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">density</td>
<td style="vertical-align: top; text-align: center">−6.679<inline-formula id="j_nejsds67_ineq_124"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center">(0.538)</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">pH</td>
<td style="vertical-align: top; text-align: center">−0.754<inline-formula id="j_nejsds67_ineq_125"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−0.780<inline-formula id="j_nejsds67_ineq_126"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−0.969<inline-formula id="j_nejsds67_ineq_127"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−1.028<inline-formula id="j_nejsds67_ineq_128"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center">(0.277)</td>
<td style="vertical-align: top; text-align: center">(0.205)</td>
<td style="vertical-align: top; text-align: center">(0.208)</td>
<td style="vertical-align: top; text-align: center">(0.209)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">sulphates</td>
<td style="vertical-align: top; text-align: center">1.589<inline-formula id="j_nejsds67_ineq_129"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">1.570<inline-formula id="j_nejsds67_ineq_130"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">5.937<inline-formula id="j_nejsds67_ineq_131"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">15.147<inline-formula id="j_nejsds67_ineq_132"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center">(0.195)</td>
<td style="vertical-align: top; text-align: center">(0.193)</td>
<td style="vertical-align: top; text-align: center">(0.678)</td>
<td style="vertical-align: top; text-align: center">(2.591)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">sulphates<inline-formula id="j_nejsds67_ineq_133"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{2}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">−2.515<inline-formula id="j_nejsds67_ineq_134"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">−12.397<inline-formula id="j_nejsds67_ineq_135"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">(0.374)</td>
<td style="vertical-align: top; text-align: center">(2.707)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">sulphates<inline-formula id="j_nejsds67_ineq_136"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">3.092<inline-formula id="j_nejsds67_ineq_137"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">(0.839)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">alcohol</td>
<td style="vertical-align: top; text-align: center">0.481<inline-formula id="j_nejsds67_ineq_138"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.479<inline-formula id="j_nejsds67_ineq_139"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.475<inline-formula id="j_nejsds67_ineq_140"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center">0.472<inline-formula id="j_nejsds67_ineq_141"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: double"/>
<td style="vertical-align: top; text-align: center; border-bottom: double">(0.032)</td>
<td style="vertical-align: top; text-align: center; border-bottom: double">(0.031)</td>
<td style="vertical-align: top; text-align: center; border-bottom: double">(0.031)</td>
<td style="vertical-align: top; text-align: center; border-bottom: double">(0.031)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Note:</italic> <inline-formula id="j_nejsds67_ineq_142"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{\ast }}$]]></tex-math></alternatives></inline-formula>p&lt;0.1<inline-formula id="j_nejsds67_ineq_143"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mo>;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${;^{\ast \ast }}$]]></tex-math></alternatives></inline-formula>p&lt;0.05<inline-formula id="j_nejsds67_ineq_144"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mo>;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
<mml:mo>∗</mml:mo>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${;^{\ast \ast \ast }}$]]></tex-math></alternatives></inline-formula>p&lt;0.01</p>
</table-wrap-foot>
</table-wrap>
<p>Table <xref rid="j_nejsds67_tab_001">1</xref> summarizes the model fitting results for the naive model and models progressively trained in the procedures of variable selection and model diagnostics. Compared to the naive model, the “Selected” column basically removes <monospace>density</monospace> and three non-significant variables, which results in a lower BIC value. The last two columns of Table <xref rid="j_nejsds67_tab_001">1</xref> confirm the statistical significance of both the squared and cubic terms of <monospace>sulphates</monospace>, which are identified and added in the model diagnostics procedure. The model presented in the last column will be used as the <bold>full model</bold> <inline-formula id="j_nejsds67_ineq_145"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathcal{M}_{full}}$]]></tex-math></alternatives></inline-formula> in our goodness-of-fit assessment in the next subsection.</p>
</sec>
</sec>
<sec id="j_nejsds67_s_009">
<label>5.2</label>
<title>Goodness-of-Fit Analysis and Its Extended Utility</title>
<p>In this subsection, we use our developed <inline-formula id="j_nejsds67_ineq_146"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> package to illustrate how to use the surrogate <inline-formula id="j_nejsds67_ineq_147"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> to (i) assess the goodness-of-fit of the full model and reduced models; (ii) rank exploratory variables based on their contributions to <inline-formula id="j_nejsds67_ineq_148"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>; and (iii) compare the goodness of fit across multiple samples and/or models.</p>
<sec id="j_nejsds67_s_010">
<label>5.2.1</label>
<title>Surrogate <inline-formula id="j_nejsds67_ineq_149"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> for the Full Model</title>
<p>First of all, we use the function <monospace>surr_rsq</monospace> to calculate the surrogate <inline-formula id="j_nejsds67_ineq_150"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> of the full model <inline-formula id="j_nejsds67_ineq_151"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathcal{M}_{full}}$]]></tex-math></alternatives></inline-formula> identified in the previous subsection. To do so, in the code below we set the arguments <monospace>model</monospace> and <monospace>full_model</monospace> to be the same as <inline-formula id="j_nejsds67_ineq_152"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathcal{M}_{full}}$]]></tex-math></alternatives></inline-formula>. We use 30 as the number of simulations for multiple sampling. The purpose of performing multiple sampling is to “stabilize” the point estimate of <inline-formula id="j_nejsds67_ineq_153"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>].</p><graphic xlink:href="nejsds67_g013.jpg"/>
<p>This function provides a point estimate of the surrogate <inline-formula id="j_nejsds67_ineq_154"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> of the full model. The value 0.439 implies 43.9% of the variance of the surrogate response <italic>S</italic> can be explained by the seven explanatory variables and two nonlinear terms of <monospace>sulphates</monospace>.</p>
</sec>
<sec id="j_nejsds67_s_011">
<label>5.2.2</label>
<title>Surrogate <inline-formula id="j_nejsds67_ineq_155"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> for a Reduced Model</title>
<p>We can also use the same function <monospace>surr_rsq</monospace> to calculate the surrogate <inline-formula id="j_nejsds67_ineq_156"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> of a reduce model. For example, to evaluate the goodness of fit of the model without high-order terms of <monospace>sulphates</monospace>, we simply need to change the <monospace>model</monospace> argument to be the reduced model <monospace>select_model</monospace> as shown in the code below. The specification of the full model is still required in the code, and such a full model should be common to all the reduced models to be compared. This is a way to eliminate the non-monotonicity issue as seen in Mckelvey-Zavoina’s <inline-formula id="j_nejsds67_ineq_157"><alternatives><mml:math>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">M</mml:mi>
<mml:mi mathvariant="italic">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup></mml:math><tex-math><![CDATA[${R_{MZ}^{2}}$]]></tex-math></alternatives></inline-formula> [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>].</p><graphic xlink:href="nejsds67_g014.jpg"/>
<p>The result shows that the surrogate <inline-formula id="j_nejsds67_ineq_158"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> has been reduced to 0.41 if the squared and cubic terms of <monospace>sulphates</monospace> are removed from the model. This means that the high-order terms of <monospace>sulphates</monospace> constitute 6.60% of the total surrogate <inline-formula id="j_nejsds67_ineq_159"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>.</p>
</sec>
<sec id="j_nejsds67_s_012">
<label>5.2.3</label>
<title>Confidence Interval for the Surrogate <inline-formula id="j_nejsds67_ineq_160"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula></title>
<p>The package <inline-formula id="j_nejsds67_ineq_161"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> allows us to produce a confidence interval for the surrogate <inline-formula id="j_nejsds67_ineq_162"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> using the function <monospace>surr_rsq_ci</monospace>. This function can directly use the object <monospace>surr_obj_mod_full</monospace> created earlier as the input of the <monospace>object</monospace> argument. In the code below, we set the significance level <monospace>alpha = 0.05</monospace> to produce a 95% confidence interval and the number of bootstrap repetitions to be 2000. The output is a table with the lower and upper bounds of the confidence interval. For the full model <inline-formula id="j_nejsds67_ineq_163"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
<mml:mi mathvariant="italic">u</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
<mml:mi mathvariant="italic">l</mml:mi>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\mathcal{M}_{full}}$]]></tex-math></alternatives></inline-formula>, the 95% confidence interval of the surrogate <inline-formula id="j_nejsds67_ineq_164"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> is <inline-formula id="j_nejsds67_ineq_165"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:mn>0.402</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>0.485</mml:mn>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[0.402,0.485]$]]></tex-math></alternatives></inline-formula>. The tightness of this interval implies that the uncertainty of the <inline-formula id="j_nejsds67_ineq_166"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> inference is low. <graphic xlink:href="nejsds67_g015.jpg"/></p>
</sec>
<sec id="j_nejsds67_s_013">
<label>5.2.4</label>
<title>Importance Ranking of Explanatory Variables</title>
<p>We apply the function <monospace>surr_rsq_rank()</monospace> to examine the contribution of each individual variable to the overall surrogate <inline-formula id="j_nejsds67_ineq_167"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>, which in turn produces a table of importance ranking. In the code below, we set the <monospace>object</monospace> argument as the object <monospace>surr_obj_mod_full</monospace> created earlier to examine the relative contribution of the variables in the full model. The output table shows (i) the surrogate <inline-formula id="j_nejsds67_ineq_168"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> for the model that removes an explanatory variable one at a time; (ii) the reduction of the <inline-formula id="j_nejsds67_ineq_169"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> after removing such a variable; (iii) the percentage contribution of this variable to the total surrogate <inline-formula id="j_nejsds67_ineq_170"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>; and (iv) the rank of the variable by its percentage contribution. In the table below, we observe that the variable <monospace>alcohol</monospace> is ranked at the top as it explains 25.80% of the total surrogate <inline-formula id="j_nejsds67_ineq_171"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>. It is followed by <monospace>volatile.acidity</monospace> (7.12%), <monospace>total.sulfur.dioxide</monospace> (3.52%), and <monospace>sulphates</monospace> (3.13%). The rest of the explanatory variables contribute less than 3% to the total surrogate <inline-formula id="j_nejsds67_ineq_172"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>.</p><graphic xlink:href="nejsds67_g016.jpg"/>
<p>In the ranking table above, the contributions of <monospace>sulphates</monospace> and its higher order terms <inline-formula id="j_nejsds67_ineq_173"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="monospace">sulphates</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\mathtt{sulphates}^{2}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds67_ineq_174"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="monospace">sulphates</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\mathtt{sulphates}^{3}}$]]></tex-math></alternatives></inline-formula> to the surrogate <inline-formula id="j_nejsds67_ineq_175"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> are evaluated separately. This is the default setting of the function <monospace>surr_rsq_rank()</monospace> if the optional argument <monospace>var_set</monospace> is not specified. If it is of interest to evaluate the factor sulphates as a whole, the function <monospace>surr_rsq_rank()</monospace> allows us to group <monospace>sulphates</monospace>, <inline-formula id="j_nejsds67_ineq_176"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="monospace">sulphates</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\mathtt{sulphates}^{2}}$]]></tex-math></alternatives></inline-formula>, and <inline-formula id="j_nejsds67_ineq_177"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="monospace">sulphates</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${\mathtt{sulphates}^{3}}$]]></tex-math></alternatives></inline-formula> by using the optional argument <monospace>var_set</monospace>. For example, in the code below we create a list of two groups: one group contains all terms of <monospace>sulphates</monospace> and the second group only contains higher order terms of <monospace>sulphates</monospace>.</p><graphic xlink:href="nejsds67_g017.jpg"/>
<p>The output table above shows that the factor <monospace>sulphates</monospace> in fact contributes 13.82% to the total surrogate <inline-formula id="j_nejsds67_ineq_178"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> if its linear, squared, and cubic terms are considered altogether. This percentage contribution is much higher than that when only the linear term of <monospace>sulphates</monospace> was evaluated (3.13%). By this result, <monospace>sulphates</monospace> is lifted to the second place in terms of its relative contribution to the total surrogate <inline-formula id="j_nejsds67_ineq_179"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>. The output table also shows that if we only consider the higher order terms of <monospace>sulphates</monospace>, the percentage contribution is 6.19%, which is higher than any other individual variables except <monospace>volatile.acidity</monospace> (7.12%). This is another piece of evidence that can support the inclusion of the squared and cubic terms of <monospace>sulphates</monospace> in the full model.</p>
</sec>
<sec id="j_nejsds67_s_014">
<label>5.2.5</label>
<title>Comparability of the Surrogate <inline-formula id="j_nejsds67_ineq_180"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> Across Different Samples and Models</title>
<p>One of the motives of [<xref ref-type="bibr" rid="j_nejsds67_ref_026">26</xref>] is to find an <inline-formula id="j_nejsds67_ineq_181"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> measure so that we can compare the goodness of fit across different models (e.g., linear, binary, or ordinal regression models) and/or samples that address the same or similar scientific/business question. We use the wine data in [<xref ref-type="bibr" rid="j_nejsds67_ref_010">10</xref>] to demonstrate that the surrogate <inline-formula id="j_nejsds67_ineq_182"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> enables this comparability, which may lead to new insights into decision-making. [<xref ref-type="bibr" rid="j_nejsds67_ref_010">10</xref>]’s data include 1599 red wine samples and 4898 white wine samples. Although the same rating scale (i.e., from 0 to 10) was offered to wine experts, in the red wine sample only 6 rating categories (3 to 8) were observed whereas, in the white wine sample, 7 rating categories (3 to 9) were observed. As a result, the ordered probit models fitted to red and white wine samples have a different number of intercept parameters. In addition, after conducting the same analysis but for the white wine sample (using a similar code as presented before), we find out that the set of selected variables is not the same. The 7 selected variables are <monospace>alcohol</monospace>, <monospace>volatile.acidity</monospace>, <monospace>residual.sugar</monospace>, <monospace>free.sulfur.dioxide</monospace>, <monospace>sulphates</monospace>, <monospace>fixed.acidity</monospace>, and <monospace>pH</monospace>. As a result, the ordered probit models fitted to red and white wine samples have a different number of slope parameters as well. Given the differences between the samples and models, the surrogate <inline-formula id="j_nejsds67_ineq_183"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>, nevertheless, enables us to compare goodness-of-fit measures across the board. Table <xref rid="j_nejsds67_tab_002">2</xref> summarizes the result obtained using our developed package <inline-formula id="j_nejsds67_ineq_184"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula>.</p>
<table-wrap id="j_nejsds67_tab_002">
<label>Table 2</label>
<caption>
<p>Percentage contributions and ranks of the physicochemical variables in the analysis of the red wine and white wine samples.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: double"/>
<td colspan="2" style="vertical-align: top; text-align: center; border-top: double"><italic>Red wine data</italic></td>
<td style="vertical-align: top; text-align: center; border-top: double"/>
<td colspan="2" style="vertical-align: top; text-align: center; border-top: double"><italic>White wine data</italic></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
<td colspan="2" style="vertical-align: top; text-align: center; border-bottom: solid thin">Surrogate <inline-formula id="j_nejsds67_ineq_185"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mn>0.439</mml:mn></mml:math><tex-math><![CDATA[${R^{2}}=0.439$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center"/>
<td colspan="2" style="vertical-align: top; text-align: center; border-bottom: solid thin">Surrogate <inline-formula id="j_nejsds67_ineq_186"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mn>0.307</mml:mn></mml:math><tex-math><![CDATA[${R^{2}}=0.307$]]></tex-math></alternatives></inline-formula></td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Variable</td>
<td style="vertical-align: top; text-align: center">Contribution</td>
<td style="vertical-align: top; text-align: center">Ranking</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">Contribution</td>
<td style="vertical-align: top; text-align: center">Ranking</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">alcohol</td>
<td style="vertical-align: top; text-align: center">25.80%</td>
<td style="vertical-align: top; text-align: center">1</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">77.16%</td>
<td style="vertical-align: top; text-align: center">1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">sulphates (&amp; higher-order terms)</td>
<td style="vertical-align: top; text-align: center">13.82%</td>
<td style="vertical-align: top; text-align: center">2</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">0.51%</td>
<td style="vertical-align: top; text-align: center">5</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">volatile.acidity</td>
<td style="vertical-align: top; text-align: center">7.12%</td>
<td style="vertical-align: top; text-align: center">3</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">20.39%</td>
<td style="vertical-align: top; text-align: center">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">total.sulfur.dioxide</td>
<td style="vertical-align: top; text-align: center">3.52%</td>
<td style="vertical-align: top; text-align: center">4</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">pH</td>
<td style="vertical-align: top; text-align: center">2.78%</td>
<td style="vertical-align: top; text-align: center">5</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">0.06%</td>
<td style="vertical-align: top; text-align: center">7</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">chlorides</td>
<td style="vertical-align: top; text-align: center">1.21%</td>
<td style="vertical-align: top; text-align: center">6</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">free.sulfur.dioxide</td>
<td style="vertical-align: top; text-align: center">0.96%</td>
<td style="vertical-align: top; text-align: center">7</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">1.42%</td>
<td style="vertical-align: top; text-align: center">4</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">residual.sugar</td>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center"/>
<td style="vertical-align: top; text-align: center">5.34%</td>
<td style="vertical-align: top; text-align: center">3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">fixed.acidity</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">0.32%</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">6</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: double">sulphates<inline-formula id="j_nejsds67_ineq_187"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{2}}$]]></tex-math></alternatives></inline-formula> &amp; sulphates<inline-formula id="j_nejsds67_ineq_188"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{3}}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: center; border-bottom: double">6.19%</td>
<td style="vertical-align: top; text-align: center; border-bottom: double"/>
<td style="vertical-align: top; text-align: center; border-bottom: double"/>
<td style="vertical-align: top; text-align: center; border-bottom: double"/>
<td style="vertical-align: top; text-align: center; border-bottom: double"/>
</tr>
</tbody>
</table>
</table-wrap>
<p>By comparing the result in the two panels (red versus white wine) of Table <xref rid="j_nejsds67_tab_002">2</xref>, we can make the following conclusions: (i) the same set of measured physicochemical features in the experiment of [<xref ref-type="bibr" rid="j_nejsds67_ref_010">10</xref>] has greater explanatory power for red wine (43.9% versus 30.7%); (ii) the ranking of explanatory variables is different for the two types of wine with only one exception which is <monospace>alcohol</monospace> (top for both); and (iii) the percentage contributions of each variable differ significantly in magnitude for red versus white wine (e.g., <monospace>alcohol</monospace>, 25.80% versus 77.16%; <monospace>sulphates</monospace>, 13.82% versus 0.51%; <monospace>volatile.acidity</monospace>, 7.12% versus 20.39%). These insights drawn from our goodness-of-fit analysis may be useful to help us understand how physicochemical features influence wine ratings and how the influence may be different depending on the type of wine. The percentage contributions and ranking of physicochemical features may be used to guide or even devise the wine-making process.</p>
</sec>
</sec>
</sec>
<sec id="j_nejsds67_s_015">
<label>6</label>
<title>Discussion</title>
<p>In this paper, we have developed the R package <inline-formula id="j_nejsds67_ineq_189"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> for categorical data goodness-of-fit analysis using the surrogate <inline-formula id="j_nejsds67_ineq_190"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>. The package applies to probit/logistic regression models, and it is compatible with commonly used R packages for binary and ordinal data analysis. With <inline-formula id="j_nejsds67_ineq_191"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula>, we are able to obtain the point estimate and the interval estimates of the surrogate <inline-formula id="j_nejsds67_ineq_192"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula>. An importance ranking table for all explanatory variables can be produced as well. These new features can be used in conjunction with other R packages developed for variable selection and model diagnostics. This “whole-analysis” is summarized in a workflow diagram, which can be followed in practice for categorical data analysis. To examine the utility of this package in real data analysis, we have used a wine rating dataset as an example and provided the sample codes. In addition, we have used the package <inline-formula id="j_nejsds67_ineq_193"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> to demonstrate that the surrogate <inline-formula id="j_nejsds67_ineq_194"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> allows us to compare different models trained from the red wine sample and white wine sample. The comparison has led to new findings and insights that deepen our understanding of how physicochemical features influence wine quality. The result suggests that our package can be used in a similar way to analyze multiple studies (and/or models) that address the same or similar scientific or business question.</p>
<p>We use the red wine data to examine the computational time of the functions in our package. Table <xref rid="j_nejsds67_tab_003">3</xref> presents the comparison, where the column (n = 1597) corresponds to the real data, and the other columns (n = 3000, 6000, 12000) are based on pseudo-real data sets generated by randomly sampling more rows from the real data. The numbers are the average running time in seconds over 10 times of repetition on an Apple Macbook Pro Max with the M1 Max CPU. The upper panel of Table <xref rid="j_nejsds67_tab_003">3</xref> shows if only a point estimate of the surrogate <inline-formula id="j_nejsds67_ineq_195"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> is needed, our <monospace>surr_rsq()</monospace> function takes almost no time to provide the result (e.g., merely 0.213 seconds when <inline-formula id="j_nejsds67_ineq_196"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>12000</mml:mn></mml:math><tex-math><![CDATA[$n=12000$]]></tex-math></alternatives></inline-formula>). However, the bottom panel of Table <xref rid="j_nejsds67_tab_003">3</xref> shows if confidence intervals are wanted, our <monospace>surr_rsq_ci()</monospace> function takes longer time with one core of CPU (e.g., 1421 seconds = 23.6 minutes when <inline-formula id="j_nejsds67_ineq_197"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>12000</mml:mn></mml:math><tex-math><![CDATA[$n=12000$]]></tex-math></alternatives></inline-formula>). Given that a CPU with 6 10 cores is quite common nowadays, we recommend using parallel computing aforementioned. It reduces the computing time to 211.93 seconds = 3.5 minutes when <inline-formula id="j_nejsds67_ineq_198"><alternatives><mml:math>
<mml:mi mathvariant="italic">n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>12000</mml:mn></mml:math><tex-math><![CDATA[$n=12000$]]></tex-math></alternatives></inline-formula>.</p>
<table-wrap id="j_nejsds67_tab_003">
<label>Table 3</label>
<caption>
<p>Computational time estimates of the functions in <bold>SurrogateRsq</bold> package. The presented numbers in the table are the average time in seconds over 10 times of repetition for these scenarios using an Apple M1 Max Chip with 10 cores and a clock rate of 2.06 ∼ 3.22 GHz. Note: The <monospace>asym</monospace> and <monospace>parallel</monospace> options are the arguments controlling the asymptotic version of surrogate <inline-formula id="j_nejsds67_ineq_199"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> and the parallel computing introduced in Section <xref rid="j_nejsds67_s_003">3</xref>.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: double; border-bottom: solid thin">Function: <monospace>surr_rsq()</monospace></td>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin">n=1,597</td>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin">n=3,000</td>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin">n=6,000</td>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin">n=12,000</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><monospace>avg.num = 30</monospace></td>
<td style="vertical-align: top; text-align: right">0.048</td>
<td style="vertical-align: top; text-align: right">0.067</td>
<td style="vertical-align: top; text-align: right">0.123</td>
<td style="vertical-align: top; text-align: right">0.213</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><monospace>asym   = TRUE</monospace></td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">0.001</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">0.002</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">0.004</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">0.007</td>
</tr>
</tbody><tbody>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Function: <monospace>surr_rsq_ci(B = 2000)</monospace></td>
<td style="vertical-align: top; text-align: right"/>
<td style="vertical-align: top; text-align: right"/>
<td style="vertical-align: top; text-align: right"/>
<td style="vertical-align: top; text-align: right"/>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><monospace>avg.num = 30,  parallel = FALSE</monospace></td>
<td style="vertical-align: top; text-align: right">241.95</td>
<td style="vertical-align: top; text-align: right">403.73</td>
<td style="vertical-align: top; text-align: right">777.29</td>
<td style="vertical-align: top; text-align: right">1421.41</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><monospace>asym   = TRUE, parallel = FALSE</monospace></td>
<td style="vertical-align: top; text-align: right">149.40</td>
<td style="vertical-align: top; text-align: right">258.97</td>
<td style="vertical-align: top; text-align: right">528.27</td>
<td style="vertical-align: top; text-align: right">977.23</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><monospace>avg.num = 30,  parallel = TRUE</monospace></td>
<td style="vertical-align: top; text-align: right">35.68</td>
<td style="vertical-align: top; text-align: right">60.17</td>
<td style="vertical-align: top; text-align: right">116.40</td>
<td style="vertical-align: top; text-align: right">211.93</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: double"><monospace>asym   = TRUE, parallel = TRUE</monospace></td>
<td style="vertical-align: top; text-align: right; border-bottom: double">21.54</td>
<td style="vertical-align: top; text-align: right; border-bottom: double">35.85</td>
<td style="vertical-align: top; text-align: right; border-bottom: double">118.11</td>
<td style="vertical-align: top; text-align: right; border-bottom: double">144.95</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>If software developers want to build or modify this package for their specific scientific inquiries, they can modify one or all of the three components of our package. First, what we really need as an input for the functions in our <inline-formula id="j_nejsds67_ineq_200"><alternatives><mml:math>
<mml:mi mathvariant="bold">SurrogateRsq</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{SurrogateRsq}$]]></tex-math></alternatives></inline-formula> package is the fitted model from another model training package (e.g., glm(), polr()). Software developers can replace the object with the model of their interest. For example, this surrogate <inline-formula id="j_nejsds67_ineq_201"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="italic">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${R^{2}}$]]></tex-math></alternatives></inline-formula> approach may still work for the discrete choice models studied by [<xref ref-type="bibr" rid="j_nejsds67_ref_008">8</xref>]. Second, depending on the form of the model, software developers can choose or modify what distribution to use for simulating the surrogate response. Third, once the surrogate responses are available, one can follow the inference procedures discussed in our paper and tailor them to meet specific needs.</p>
</sec>
</body>
<back>
<app-group>
<app id="j_nejsds67_app_001"><label>Appendix A</label>
<p>In this section, we provide the sample codes for variable selection using the step-wise selection method and the regularization method with an elastic net penalty.</p>
<p>The step-wise selection method starts with a null model (<monospace>null_model</monospace>) with an intercept only. The largest model we specify is the “naive model” with all explanatory variables. The result below shows that this method selects the same variables as the exhaustive search method.</p><graphic xlink:href="nejsds67_g018.jpg"/> 
<p>We also use the function <monospace>ordinalNet()</monospace> in the R package <inline-formula id="j_nejsds67_ineq_202"><alternatives><mml:math>
<mml:mi mathvariant="bold">ordinalNet</mml:mi></mml:math><tex-math><![CDATA[$\mathbf{ordinalNet}$]]></tex-math></alternatives></inline-formula> to fit a cumulative probit model with an elastic net penalty. The result below shows it only excludes a single variable which is <monospace>density</monospace>.</p><graphic xlink:href="nejsds67_g019.jpg"/> 
<p>Figure <xref rid="j_nejsds67_fig_004">4</xref> contains diagnostic plots for the full model developed in Section <xref rid="j_nejsds67_s_006">5.1</xref> after performing variable selection and model diagnostics.</p>
<fig id="j_nejsds67_fig_004">
<label>Figure 4</label>
<caption>
<p>Plots of surrogate residuals versus each of the explanatory variables for the full model after adding the squared and cubic terms of sulphates.</p>
</caption>
<graphic xlink:href="nejsds67_g020.jpg"/>
</fig>
</app></app-group>
<ref-list id="j_nejsds67_reflist_001">
<title>References</title>
<ref id="j_nejsds67_ref_001">
<label>[1]</label><mixed-citation publication-type="journal"><string-name><surname>Analytics</surname>, <given-names>R.</given-names></string-name> and <string-name><surname>Weston</surname>, <given-names>S.</given-names></string-name> (<year>2015</year>). <article-title>foreach: Provides foreach looping construct for R</article-title>. <source>R package version 1.4.3</source>. <ext-link ext-link-type="doi" xlink:href="https://CRAN.R-project.org/package=foreach" xlink:type="simple">https://CRAN.R-project.org/package=foreach</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_002">
<label>[2]</label><mixed-citation publication-type="other"><string-name><surname>Anderson</surname>, <given-names>D.</given-names></string-name> and <string-name><surname>Kurtz</surname>, <given-names>T.</given-names></string-name> Continuous time Markov chain models for chemical reaction networks. <uri>http://www.math.wisc.edu/~kurtz/papers/AndKurJuly10.pdf</uri>. Accessed 27 July 2010.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_003">
<label>[3]</label><mixed-citation publication-type="chapter"><string-name><surname>Blanchet</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Leder</surname>, <given-names>K.</given-names></string-name> and <string-name><surname>Glynn</surname>, <given-names>P.</given-names></string-name> (<year>2009</year>). <chapter-title>Efficient Simulation of Light-Tailed Sums: an Old-Folk Song Sung to a Faster New Tune...</chapter-title> In <source>Monte Carlo and Quasi-Monte Carlo Methods</source> (<string-name><given-names>P.</given-names> <surname>L’ Ecuyer</surname></string-name> and <string-name><given-names>A. B.</given-names> <surname>Owen</surname></string-name>, eds.) <publisher-name>Springer</publisher-name>, <publisher-loc>Berlin</publisher-loc>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-3-642-04107-5_13" xlink:type="simple">https://doi.org/10.1007/978-3-642-04107-5_13</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2743897">MR2743897</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_004">
<label>[4]</label><mixed-citation publication-type="journal"><string-name><surname>Blanchet</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Leder</surname>, <given-names>K.</given-names></string-name> and <string-name><surname>Shi</surname>, <given-names>Y.</given-names></string-name> (<year>2011</year>). <article-title>Analysis of a splitting estimator for rare event probabilities in Jackson networks</article-title>. <source>Stochastic Systems</source> <volume>1</volume> <fpage>306</fpage>–<lpage>339</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1214/11-SSY026" xlink:type="simple">https://doi.org/10.1214/11-SSY026</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2949543">MR2949543</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_005">
<label>[5]</label><mixed-citation publication-type="journal"><string-name><surname>Breheny</surname>, <given-names>P.</given-names></string-name> (<year>2013</year>). <article-title>ncvreg: Regularization paths for scad-and mcp-penalized regression models</article-title>. <source>R package version</source> <volume>2</volume> <comment>6–0</comment>. <ext-link ext-link-type="doi" xlink:href="https://pbreheny.github.io/ncvreg/" xlink:type="simple">https://pbreheny.github.io/ncvreg/</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_006">
<label>[6]</label><mixed-citation publication-type="other"><string-name><surname>Breheny</surname>, <given-names>P.</given-names></string-name> and <string-name><surname>Breheny</surname>, <given-names>M. P.</given-names></string-name> (2014). <italic>Package ‘grpreg’</italic>. <ext-link ext-link-type="doi" xlink:href="https://pbreheny.github.io/grpreg/" xlink:type="simple">https://pbreheny.github.io/grpreg/</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_007">
<label>[7]</label><mixed-citation publication-type="book"><string-name><surname>Chao</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Miyazawa</surname>, <given-names>M.</given-names></string-name> and <string-name><surname>Pinedo</surname>, <given-names>M.</given-names></string-name> (<year>1999</year>) <source>Queueing Networks: Customers, Signals and Product Form Solutions</source>. <publisher-name>Wiley</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_008">
<label>[8]</label><mixed-citation publication-type="journal"><string-name><surname>Cheng</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>R.</given-names></string-name> and <string-name><surname>Zhang</surname>, <given-names>H.</given-names></string-name> (<year>2021</year>). <article-title>Surrogate Residuals for Discrete Choice Models</article-title>. <source>Journal of Computational and Graphical Statistics</source> <volume>30</volume>(<issue>1</issue>) <fpage>67</fpage>–<lpage>77</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/10618600.2020.1775618" xlink:type="simple">https://doi.org/10.1080/10618600.2020.1775618</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_009">
<label>[9]</label><mixed-citation publication-type="other"><string-name><surname>Christensen</surname>, <given-names>R. H. B.</given-names></string-name> (2019). <italic>ordinal—Regression Models for Ordinal Data</italic>. R package version 2019.12-10. <uri>https://CRAN.R-project.org/package=ordinal</uri>. <ext-link ext-link-type="doi" xlink:href="http://www2.uaem.mx/r-mirror/web/packages/ordinal/" xlink:type="simple">http://www2.uaem.mx/r-mirror/web/packages/ordinal/</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_010">
<label>[10]</label><mixed-citation publication-type="journal"><string-name><surname>Cortez</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Cerdeira</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Almeida</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Matos</surname>, <given-names>T.</given-names></string-name> and <string-name><surname>Reis</surname>, <given-names>J.</given-names></string-name> (<year>2009</year>). <article-title>Modeling Wine Preferences by Data Mining from Physicochemical Properties</article-title>. <source>Decision Support Systems</source> <volume>47</volume>(<issue>4</issue>) <fpage>547</fpage>–<lpage>553</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.dss.2009.05.016" xlink:type="simple">https://doi.org/10.1016/j.dss.2009.05.016</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_011">
<label>[11]</label><mixed-citation publication-type="journal"><string-name><surname>Cox</surname>, <given-names>D. R.</given-names></string-name> and <string-name><surname>Wermuth</surname>, <given-names>N.</given-names></string-name> (<year>1992</year>). <article-title>A Comment on the Coefficient of Determination for Binary Responses</article-title>. <source>The American Statistician</source> <volume>46</volume>(<issue>1</issue>) <fpage>1</fpage>–<lpage>4</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/00031305.1992.10475836" xlink:type="simple">https://doi.org/10.1080/00031305.1992.10475836</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_012">
<label>[12]</label><mixed-citation publication-type="book"><string-name><surname>Cox</surname>, <given-names>D.</given-names></string-name> and <string-name><surname>Snell</surname>, <given-names>E.</given-names></string-name> (<year>1989</year>) <source>Analysis of Binary Data</source> <volume>32</volume>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1201/9781315137391" xlink:type="simple">https://doi.org/10.1201/9781315137391</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_013">
<label>[13]</label><mixed-citation publication-type="journal"><string-name><surname>Cragg</surname>, <given-names>J. G.</given-names></string-name> and <string-name><surname>Uhler</surname>, <given-names>R. S.</given-names></string-name> (<year>1970</year>). <article-title>The Demand for Automobiles</article-title>. <source>The Canadian Journal of Economics</source> <volume>3</volume>(<issue>3</issue>) <fpage>386</fpage>–<lpage>406</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.2307/133656" xlink:type="simple">https://doi.org/10.2307/133656</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_014">
<label>[14]</label><mixed-citation publication-type="journal"><string-name><surname>Efron</surname>, <given-names>B.</given-names></string-name> (<year>1978</year>). <article-title>Regression and ANOVA with Zero-one Data: Measures of Residual Variation</article-title>. <source>Journal of the American Statistical Association</source> <volume>73</volume>(<issue>361</issue>) <fpage>113</fpage>–<lpage>121</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.1978.10480013" xlink:type="simple">https://doi.org/10.1080/01621459.1978.10480013</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=0501624">MR0501624</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_015">
<label>[15]</label><mixed-citation publication-type="book"><string-name><surname>Efron</surname>, <given-names>B.</given-names></string-name> and <string-name><surname>Tibshirani</surname>, <given-names>R. J.</given-names></string-name> <source>An Introduction to the Bootstrap</source>. <publisher-name>Springer US</publisher-name>. <ext-link ext-link-type="doi" xlink:href="http://link.springer.com/10.1007/978-1-4899-4541-9" xlink:type="simple">http://link.springer.com/10.1007/978-1-4899-4541-9</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_016">
<label>[16]</label><mixed-citation publication-type="journal"><string-name><surname>Fan</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Lv</surname>, <given-names>J.</given-names></string-name> (<year>2008</year>). <article-title>Sure Independence Screening for Ultrahigh Dimensional Feature Space</article-title>. <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source> <volume>70</volume>(<issue>5</issue>) <fpage>849</fpage>–<lpage>911</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/j.1467-9868.2008.00674.x@10.1111/(ISSN)1467-9868.TOP_SERIES_B_RESEARCH" xlink:type="simple">https://doi.org/10.1111/j.1467-9868.2008.00674.x@10.1111/(ISSN)1467-9868.TOP_SERIES_B_RESEARCH</ext-link>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/j.1467-9868.2008.00674.x" xlink:type="simple">https://doi.org/10.1111/j.1467-9868.2008.00674.x</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2530322">MR2530322</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_017">
<label>[17]</label><mixed-citation publication-type="journal"><string-name><surname>Friedman</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hastie</surname>, <given-names>T.</given-names></string-name> and <string-name><surname>Tibshirani</surname>, <given-names>R.</given-names></string-name> (<year>2010</year>). <article-title>Regularization paths for generalized linear models via coordinate descent</article-title>. <source>Journal of Statistical Software</source> <volume>33</volume>(<issue>1</issue>) <fpage>1</fpage>. <ext-link ext-link-type="doi" xlink:href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/" xlink:type="simple">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_018">
<label>[18]</label><mixed-citation publication-type="journal"><string-name><surname>Greenwell</surname>, <given-names>B. M.</given-names></string-name>, <string-name><surname>McCarthy</surname>, <given-names>A. J.</given-names></string-name>, <string-name><surname>Boehmke</surname>, <given-names>B. C.</given-names></string-name> and <string-name><surname>Liu</surname>, <given-names>D.</given-names></string-name> (<year>2018</year>). <article-title>Residuals and Diagnostics for Binary and Ordinal Regression Models: An Introduction to the sure Package</article-title>. <source>The R Journal</source> <volume>10</volume>(<issue>1</issue>) <fpage>381</fpage>–<lpage>394</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.32614/RJ-2018-004" xlink:type="simple">https://doi.org/10.32614/RJ-2018-004</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_019">
<label>[19]</label><mixed-citation publication-type="journal"><string-name><surname>Hagle</surname>, <given-names>T. M.</given-names></string-name> and <string-name><surname>Mitchell</surname>, <given-names>G. E.</given-names></string-name> (<year>1992</year>). <article-title>Goodness-of-Fit Measures for Probit and Logit</article-title>. <source>American Journal of Political Science</source> <volume>36</volume>(<issue>3</issue>) <fpage>762</fpage>–<lpage>784</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.2307/2111590" xlink:type="simple">https://doi.org/10.2307/2111590</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_020">
<label>[20]</label><mixed-citation publication-type="other"><string-name><surname>Harrell Jr</surname>, <given-names>F. E.</given-names></string-name> (2019). rms: Regression Modeling Strategies. <italic>R package version 5.1-4</italic>. <ext-link ext-link-type="doi" xlink:href="https://CRAN.R-project.org/package=rms" xlink:type="simple">https://CRAN.R-project.org/package=rms</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_021">
<label>[21]</label><mixed-citation publication-type="journal"><string-name><surname>Hu</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Shao</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Palta</surname>, <given-names>M.</given-names></string-name> (<year>2006</year>). <article-title>Pseudo-R<inline-formula id="j_nejsds67_ineq_203"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{2}}$]]></tex-math></alternatives></inline-formula> in Logistic Regression Model</article-title>. <source>Statistica Sinica</source> <volume>16</volume>(<issue>3</issue>) <fpage>847</fpage>–<lpage>860</lpage>. <ext-link ext-link-type="doi" xlink:href="https://www.jstor.org/stable/24307577" xlink:type="simple">https://www.jstor.org/stable/24307577</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_022">
<label>[22]</label><mixed-citation publication-type="journal"><string-name><surname>Laitila</surname>, <given-names>T.</given-names></string-name> (<year>1993</year>). <article-title>A Pseudo-R<inline-formula id="j_nejsds67_ineq_204"><alternatives><mml:math>
<mml:msup>
<mml:mrow/>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${^{2}}$]]></tex-math></alternatives></inline-formula> Measure for Limited and Qualitative Dependent Variable Models</article-title>. <source>Journal of Econometrics</source> <volume>56</volume>(<issue>3</issue>) <fpage>341</fpage>–<lpage>356</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/0304-4076(93)90125-O" xlink:type="simple">https://doi.org/10.1016/0304-4076(93)90125-O</ext-link>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/0304-4076(93)90125-O" xlink:type="simple">https://doi.org/10.1016/0304-4076(93)90125-O</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=1219168">MR1219168</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_023">
<label>[23]</label><mixed-citation publication-type="journal"><string-name><surname>Li</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>Y.</given-names></string-name> and <string-name><surname>Liu</surname>, <given-names>D.</given-names></string-name> (<year>2021</year>). <article-title>PAsso: an R Package for Assessing Partial Association between Ordinal Variables</article-title>. <source>The R Journal</source> <volume>13</volume>(<issue>2</issue>) <fpage>135</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.32614/RJ-2021-088" xlink:type="simple">https://doi.org/10.32614/RJ-2021-088</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_024">
<label>[24]</label><mixed-citation publication-type="journal"><string-name><surname>Liu</surname>, <given-names>D.</given-names></string-name> and <string-name><surname>Zhang</surname>, <given-names>H.</given-names></string-name> (<year>2018</year>). <article-title>Residuals and Diagnostics for Ordinal Regression Models: A Surrogate Approach</article-title>. <source>Journal of the American Statistical Association</source> <volume>113</volume>(<issue>522</issue>) <fpage>845</fpage>–<lpage>854</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2017.1292915" xlink:type="simple">https://doi.org/10.1080/01621459.2017.1292915</ext-link>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2017.1292915" xlink:type="simple">https://doi.org/10.1080/01621459.2017.1292915</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3832231">MR3832231</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_025">
<label>[25]</label><mixed-citation publication-type="journal"><string-name><surname>Liu</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>Y.</given-names></string-name> and <string-name><surname>Moustaki</surname>, <given-names>I.</given-names></string-name> (<year>2021</year>). <article-title>Assessing Partial Association Between Ordinal Variables: Quantification, Visualization, and Hypothesis Testing</article-title>. <source>Journal of the American Statistical Association</source> <volume>116</volume>(<issue>534</issue>) <fpage>955</fpage>–<lpage>968</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2020.1796394" xlink:type="simple">https://doi.org/10.1080/01621459.2020.1796394</ext-link>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2020.1796394" xlink:type="simple">https://doi.org/10.1080/01621459.2020.1796394</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4270036">MR4270036</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_026">
<label>[26]</label><mixed-citation publication-type="journal"><string-name><surname>Liu</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Greenwell</surname>, <given-names>B.</given-names></string-name> and <string-name><surname>Lin</surname>, <given-names>Z.</given-names></string-name> (<year>2023</year>). <article-title>A new goodness-of-fit measure for probit models: Surrogate R2</article-title>. <source>British Journal of Mathematical and Statistical Psychology</source> <volume>76</volume>(<issue>1</issue>) <fpage>192</fpage>–<lpage>210</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/bmsp.12289" xlink:type="simple">https://doi.org/10.1111/bmsp.12289</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_027">
<label>[27]</label><mixed-citation publication-type="journal"><string-name><surname>Liu</surname>, <given-names>I.</given-names></string-name> and <string-name><surname>Agresti</surname>, <given-names>A.</given-names></string-name> (<year>2005</year>). <article-title>The Analysis of Ordered Categorical Data: An Overview and a Survey of Recent Developments (with discussion)</article-title>. <source>Test</source> <volume>14</volume>(<issue>1</issue>) <fpage>1</fpage>–<lpage>73</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/BF02595397" xlink:type="simple">https://doi.org/10.1007/BF02595397</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_028">
<label>[28]</label><mixed-citation publication-type="journal"><string-name><surname>Lumley</surname>, <given-names>T.</given-names></string-name> and <string-name><surname>Lumley</surname>, <given-names>M. T.</given-names></string-name> (<year>2013</year>). <article-title>Package ‘leaps’</article-title>. <source>Regression subset selection. Thomas Lumley Based on Fortran Code by Alan Miller. Available online: <uri>http://CRAN.R-project.org/package=leaps</uri> (Accessed on 18 March 2018)</source>. <ext-link ext-link-type="doi" xlink:href="https://cran.r-project.org/web/packages/leaps/index.html" xlink:type="simple">https://cran.r-project.org/web/packages/leaps/index.html</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_029">
<label>[29]</label><mixed-citation publication-type="chapter"><string-name><surname>McFadden</surname>, <given-names>D.</given-names></string-name> (<year>1973</year>). <chapter-title>Conditional Logit Analysis of Qualitative Choice Behavior</chapter-title>. In <source>Frontiers in Econometrics</source> (<string-name><given-names>P.</given-names> <surname>Zarembka</surname></string-name>, ed.) <fpage>105</fpage>–<lpage>142</lpage>. <ext-link ext-link-type="doi" xlink:href="https://eml.berkeley.edu/reprints/mcfadden/zarembka.pdf" xlink:type="simple">https://eml.berkeley.edu/reprints/mcfadden/zarembka.pdf</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_030">
<label>[30]</label><mixed-citation publication-type="journal"><string-name><surname>McKelvey</surname>, <given-names>R. D.</given-names></string-name> and <string-name><surname>Zavoina</surname>, <given-names>W.</given-names></string-name> (<year>1975</year>). <article-title>A Statistical Model for the Analysis of Ordinal Level Dependent Variables</article-title>. <source>Journal of Mathematical Sociology</source> <volume>4</volume>(<issue>1</issue>) <fpage>103</fpage>–<lpage>120</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/0022250X.1975.9989847" xlink:type="simple">https://doi.org/10.1080/0022250X.1975.9989847</ext-link>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/0022250x.1975.9989847" xlink:type="simple">https://doi.org/10.1080/0022250x.1975.9989847</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=0400610">MR0400610</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_031">
<label>[31]</label><mixed-citation publication-type="journal"><string-name><surname>Nagelkerke</surname>, <given-names>N. J.</given-names></string-name> (<year>1991</year>). <article-title>A Note on a General Definition of the Coefficient of Determination</article-title>. <source>Biometrika</source> <volume>78</volume>(<issue>3</issue>) <fpage>691</fpage>–<lpage>692</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1093/biomet/78.3.691" xlink:type="simple">https://doi.org/10.1093/biomet/78.3.691</ext-link>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1093/biomet/78.3.691" xlink:type="simple">https://doi.org/10.1093/biomet/78.3.691</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=1130937">MR1130937</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_032">
<label>[32]</label><mixed-citation publication-type="chapter"><string-name><surname>Pant</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Blaauw</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Zolotov</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Sundareswaran</surname>, <given-names>S.</given-names></string-name> and <string-name><surname>Panda</surname>, <given-names>R.</given-names></string-name> (<year>2004</year>). <chapter-title>A stochastic approach to power grid analysis</chapter-title>. In <source>Proceedings of the 41st annual Design Automation Conference</source>. <series>DAC ’04</series> <fpage>171</fpage>–<lpage>176</lpage>. <publisher-name>ACM</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_033">
<label>[33]</label><mixed-citation publication-type="journal"><string-name><surname>Ripley</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Venables</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Bates</surname>, <given-names>D. M.</given-names></string-name>, <string-name><surname>Hornik</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Gebhardt</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Firth</surname>, <given-names>D.</given-names></string-name> and <string-name><surname>Ripley</surname>, <given-names>M. B.</given-names></string-name> (<year>2013</year>). <article-title>Package ‘mass’</article-title>. <source>CRAN R</source> <volume>538</volume> <fpage>113</fpage>–<lpage>120</lpage>. <ext-link ext-link-type="doi" xlink:href="http://www.stats.ox.ac.uk/pub/MASS4/" xlink:type="simple">http://www.stats.ox.ac.uk/pub/MASS4/</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_034">
<label>[34]</label><mixed-citation publication-type="journal"><string-name><surname>Saldana</surname>, <given-names>D. F.</given-names></string-name> and <string-name><surname>Feng</surname>, <given-names>Y.</given-names></string-name> (<year>2018</year>). <article-title>SIS: An R package for sure independence screening in ultrahigh-dimensional statistical models</article-title>. <source>Journal of Statistical Software</source> <volume>83</volume> <fpage>1</fpage>–<lpage>25</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18637/jss.v083.i02" xlink:type="simple">https://doi.org/10.18637/jss.v083.i02</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_035">
<label>[35]</label><mixed-citation publication-type="journal"><string-name><surname>Simon</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Friedman</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Hastie</surname>, <given-names>T.</given-names></string-name> and <string-name><surname>Tibshirani</surname>, <given-names>R.</given-names></string-name> (<year>2011</year>). <article-title>Regularization paths for Cox’s proportional hazards model via coordinate descent</article-title>. <source>Journal of Statistical Software</source> <volume>39</volume>(<issue>5</issue>) <fpage>1</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18637/jss.v039.i05" xlink:type="simple">https://doi.org/10.18637/jss.v039.i05</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_036">
<label>[36]</label><mixed-citation publication-type="journal"><string-name><surname>Tibshirani</surname>, <given-names>R.</given-names></string-name> (<year>1996</year>). <article-title>Regression shrinkage and selection via the lasso</article-title>. <source>Journal of the Royal Statistical Society: Series B (Methodological)</source> <volume>58</volume>(<issue>1</issue>) <fpage>267</fpage>–<lpage>288</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/j.2517-6161.1996.tb02080.x" xlink:type="simple">https://doi.org/10.1111/j.2517-6161.1996.tb02080.x</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=1379242">MR1379242</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_037">
<label>[37]</label><mixed-citation publication-type="journal"><string-name><surname>Tibshirani</surname>, <given-names>R.</given-names></string-name> (<year>1996</year>). <article-title>Regression Shrinkage and Selection via the Lasso</article-title>. <source>Journal of the Royal Statistical Society. Series B (Methodological)</source> <volume>58</volume>(<issue>1</issue>) <fpage>267</fpage>–<lpage>288</lpage>. <ext-link ext-link-type="doi" xlink:href="https://www.jstor.org/stable/2346178" xlink:type="simple">https://www.jstor.org/stable/2346178</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=1379242">MR1379242</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_038">
<label>[38]</label><mixed-citation publication-type="journal"><string-name><surname>Tjur</surname>, <given-names>T.</given-names></string-name> (<year>2009</year>). <article-title>Coefficients of Determination in Logistic Regression Models—A New Proposal: The Coefficient of Discrimination</article-title>. <source>The American Statistician</source> <volume>63</volume>(<issue>4</issue>) <fpage>366</fpage>–<lpage>372</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1198/tast.2009.08210" xlink:type="simple">https://doi.org/10.1198/tast.2009.08210</ext-link>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1198/tast.2009.08210" xlink:type="simple">https://doi.org/10.1198/tast.2009.08210</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2751755">MR2751755</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_039">
<label>[39]</label><mixed-citation publication-type="journal"><string-name><surname>Veall</surname>, <given-names>M. R.</given-names></string-name> and <string-name><surname>Zimmermann</surname>, <given-names>K. F.</given-names></string-name> (<year>1996</year>). <article-title>Pseudo-R<sup>2</sup> Measures for Some Common Limited Dependent Variable Models</article-title>. <source>Journal of Economic Surveys</source> <volume>10</volume>(<issue>3</issue>) <fpage>241</fpage>–<lpage>259</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/j.1467-6419.1996.tb00013.x" xlink:type="simple">https://doi.org/10.1111/j.1467-6419.1996.tb00013.x</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_040">
<label>[40]</label><mixed-citation publication-type="journal"><string-name><surname>Wurm</surname>, <given-names>M. J.</given-names></string-name>, <string-name><surname>Rathouz</surname>, <given-names>P. J.</given-names></string-name> and <string-name><surname>Hanlon</surname>, <given-names>B. M.</given-names></string-name> (<year>2021</year>). <article-title>Regularized Ordinal Regression and the ordinalNet R Package</article-title>. <source>Journal of Statistical Software</source> <volume>99</volume> <fpage>1</fpage>–<lpage>42</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18637/jss.v099.i06" xlink:type="simple">https://doi.org/10.18637/jss.v099.i06</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_041">
<label>[41]</label><mixed-citation publication-type="journal"><string-name><surname>Yee</surname>, <given-names>T. W.</given-names></string-name> <etal>et al.</etal> (<year>2010</year>). <article-title>The VGAM Package for Categorical Data Analysis</article-title>. <source>Journal of Statistical Software</source> <volume>32</volume>(<issue>10</issue>) <fpage>1</fpage>–<lpage>34</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18637/jss.v032.i10" xlink:type="simple">https://doi.org/10.18637/jss.v032.i10</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_042">
<label>[42]</label><mixed-citation publication-type="journal"><string-name><surname>Zheng</surname>, <given-names>B.</given-names></string-name> and <string-name><surname>Agresti</surname>, <given-names>A.</given-names></string-name> (<year>2000</year>). <article-title>Summarizing the Predictive Power of a Generalized Linear Model</article-title>. <source>Statistics in Medicine</source> <volume>19</volume>(<issue>13</issue>) <fpage>1771</fpage>–<lpage>1781</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1002/1097-0258(20000715)19:13&lt;1771::AID-SIM485&gt;3.0.CO;2-P" xlink:type="simple">https://doi.org/10.1002/1097-0258(20000715)19:13&lt;1771::AID-SIM485&gt;3.0.CO;2-P</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_043">
<label>[43]</label><mixed-citation publication-type="other"><string-name><surname>Zhu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Lin</surname>, <given-names>Z.</given-names></string-name> and <string-name><surname>Liu</surname>, <given-names>D.</given-names></string-name> (2024). SurrogateRsq: Goodness-of-Fit Analysis for Categorical Data using the Surrogate R-Squared. <italic>R package version 0.2.1.9000</italic>. <uri>https://xiaorui.site/SurrogateRsq/</uri>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_044">
<label>[44]</label><mixed-citation publication-type="journal"><string-name><surname>Zhu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>Y.</given-names></string-name> and <string-name><surname>Liu</surname>, <given-names>D.</given-names></string-name> (<year>2020</year>). <article-title>PAsso: an R Package for Assessing Partial Association between Ordinal Variables</article-title>. <source>R package Version 0.1.9</source>. <ext-link ext-link-type="doi" xlink:href="https://xiaorui.site/PAsso/" xlink:type="simple">https://xiaorui.site/PAsso/</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds67_ref_045">
<label>[45]</label><mixed-citation publication-type="journal"><string-name><surname>Zou</surname>, <given-names>H.</given-names></string-name> (<year>2006</year>). <article-title>The Adaptive Lasso and Its Oracle Properties</article-title>. <source>Journal of the American Statistical Association</source> <volume>101</volume>(<issue>476</issue>) <fpage>1418</fpage>–<lpage>1429</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1198/016214506000000735" xlink:type="simple">https://doi.org/10.1198/016214506000000735</ext-link>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1198/016214506000000735" xlink:type="simple">https://doi.org/10.1198/016214506000000735</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2279469">MR2279469</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds67_ref_046">
<label>[46]</label><mixed-citation publication-type="journal"><string-name><surname>Zou</surname>, <given-names>H.</given-names></string-name> and <string-name><surname>Hastie</surname>, <given-names>T.</given-names></string-name> (<year>2005</year>). <article-title>Regularization and Variable Selection via the Elastic Net</article-title>. <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source> <volume>67</volume>(<issue>2</issue>) <fpage>301</fpage>–<lpage>320</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/j.1467-9868.2005.00503.x" xlink:type="simple">https://doi.org/10.1111/j.1467-9868.2005.00503.x</ext-link>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/j.1467-9868.2005.00503.x" xlink:type="simple">https://doi.org/10.1111/j.1467-9868.2005.00503.x</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2137327">MR2137327</ext-link></mixed-citation>
</ref>
</ref-list>
</back>
</article>
