<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">NEJSDS</journal-id>
<journal-title-group><journal-title>The New England Journal of Statistics in Data Science</journal-title></journal-title-group>
<issn pub-type="ppub">2693-7166</issn><issn-l>2693-7166</issn-l>
<publisher>
<publisher-name>New England Statistical Society</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">NEJSDS81</article-id>
<article-id pub-id-type="doi">10.51387/25-NEJSDS81</article-id>
<article-categories><subj-group subj-group-type="area">
<subject>NextGen</subject></subj-group><subj-group subj-group-type="heading">
<subject>Case Study, Application, and/or Practice Article</subject></subj-group></article-categories>
<title-group>
<article-title>A Study on Reproducibility and the Reliability of the Hosmer-Lemeshow Test in Published Research</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Yang</surname><given-names>Audrey</given-names></name><email xlink:href="mailto:ay2658@columbia.edu">ay2658@columbia.edu</email><xref ref-type="aff" rid="j_nejsds81_aff_001"/><xref ref-type="corresp" rid="cor1">∗</xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Yang</surname><given-names>Karen</given-names></name><email xlink:href="mailto:kareny2357@gmail.com">kareny2357@gmail.com</email><xref ref-type="aff" rid="j_nejsds81_aff_002"/>
</contrib>
<aff id="j_nejsds81_aff_001">Department of Statistics, <institution>Columbia University</institution>, <country>United States</country>. E-mail address: <email xlink:href="mailto:ay2658@columbia.edu">ay2658@columbia.edu</email></aff>
<aff id="j_nejsds81_aff_002"><institution>Wayzata High School</institution>, Plymouth, Minnesota, <country>United States</country>. E-mail address: <email xlink:href="mailto:kareny2357@gmail.com">kareny2357@gmail.com</email></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2025</year></pub-date><pub-date pub-type="epub"><day>28</day><month>3</month><year>2025</year></pub-date><volume>3</volume><issue>1</issue><fpage>73</fpage><lpage>81</lpage><supplementary-material id="S1" content-type="archive" xlink:href="nejsds81_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>We include our meta-data dataset, described in Section <xref rid="j_nejsds81_s_004">2.2</xref> of the paper. We also include the R code used to run the regressions and tests.</p>
</caption>
</supplementary-material><history><date date-type="accepted"><day>11</day><month>2</month><year>2025</year></date></history>
<permissions><copyright-statement>© 2025 New England Statistical Society</copyright-statement><copyright-year>2025</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>This paper discusses two elements of reproducibility in published research. First, it examines whether published results are reproducible with author-supplied data: specifically, whether the authors publish their data, whether authors respond to requests for data when data are <italic>claimed</italic> to be available upon reasonable request, and whether data provided are usable to reproduce the authors’ results. Second, we seek to substantiate the currently mostly theoretical concerns about the Hosmer-Lemeshow goodness-of-fit test’s lack of power by investigating its usage in practice: in published research, by authors aiming to validate their models. By using the authors’ data to build larger alternative models and doing hypothesis testing to show that the smaller models—validated by Hosmer-Lemeshow—do not adequately capture information that is available in the data, we demonstrate that the Hosmer-Lemeshow goodness of fit test is often incapable of detecting inadequacies in models.</p>
</abstract>
<kwd-group>
<label>Keywords and phrases</label>
<kwd>Hosmer-Lemeshow test</kwd>
<kwd>Reverse p-hacking</kwd>
<kwd>Goodness-of-fit</kwd>
<kwd>Logistic regression</kwd>
<kwd>Reproducibility</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="j_nejsds81_s_001">
<label>1</label>
<title>Introduction</title>
<p>Concerns of reproducibility and replicability are continually on the rise across many academic disciplines, with contributing factors including “p-hacking,” improper statistical analysis, and poor practices in data management and data sharing. In fact, the problem of p-hacking has become so widespread that, in 2016, the American Statistical Association felt it necessary to release a statement on p-hacking, the proper usage of p-values, and the impact of misusing p-values on reproducibility [<xref ref-type="bibr" rid="j_nejsds81_ref_019">19</xref>], and a psychology journal actually banned the usage of p-values entirely [<xref ref-type="bibr" rid="j_nejsds81_ref_020">20</xref>]. Problems with data sharing and data availability have also generated concern; for example, starting in 2014, the journal <italic>PLOS ONE</italic> implemented policies to promote data sharing and encouraging researchers to make the data used in its publications publicly available [<xref ref-type="bibr" rid="j_nejsds81_ref_006">6</xref>]. The question of whether these policies are feasible and effective, especially due to confidentiality concerns, has sparked discussion. Nonetheless, some positive changes have been reported. For example, a study [<xref ref-type="bibr" rid="j_nejsds81_ref_007">7</xref>] done in 2023 finds that, for the journal <italic>Management Science</italic>, the “Data and Code Disclosure” policy implemented in 2019 marked a substantial increase in the proportion of articles that could be reproduced.</p>
<p>In this paper, we conduct analyses on both of these factors—p-hacking and lack of responsible data sharing—that contribute to the “reproducibility crisis”.</p>
<p>In consideration of the former problem, p-hacking, one goal of this paper is to examine a currently under-examined facet of this issue, which we will refer to as “<italic>reverse</italic> p-hacking.” The term “reverse p-hacking” has been used previously to describe “[ensuring] that tests produce a nonsignificant result” [<xref ref-type="bibr" rid="j_nejsds81_ref_004">4</xref>], which is similar in spirit to our intended meaning. We will use the same vocabulary, but our specific aims are to investigate the usage of “reverse p-hacking” in the process of model validation. We seek to examine the practice, intentional or otherwise, of using tests with low power to validate models, thus producing insignificant p-values that support the models based on which authors draw conclusions.</p>
<p>Here, we define our specific area of focus: we analyze published papers that develop a binary logistic regression model, then use the Hosmer-Lemeshow goodness-of-fit test to validate it. This particular topic is of interest due to the prevalence in usage of the Hosmer-Lemeshow test. The binary logistic regression remains one of the most popular statistical models in application, and goodness-of-fit tests are typically performed to validate the final logistic regression models. When the data are grouped, the standard chi-square test or deviance test are asymptotically valid [<xref ref-type="bibr" rid="j_nejsds81_ref_005">5</xref>]; however, when one or more predictors are continuous, these tests cannot be used. The Hosmer-Lemeshow test [<xref ref-type="bibr" rid="j_nejsds81_ref_009">9</xref>] was developed to address this issue. It became and remains a very widely-used [<xref ref-type="bibr" rid="j_nejsds81_ref_012">12</xref>] goodness-of-fit test for logistic regression on ungrouped data and is widely cited (see our analysis on trends for citations, with data downloaded from the online article via Taylor &amp; Francis), and it is also taught in popular textbooks (one of which was written by Hosmer &amp; Lemeshow themselves [<xref ref-type="bibr" rid="j_nejsds81_ref_010">10</xref>] along with Rodney X. Sturdivant).</p>
<p>Recently, concerns have been raised regarding the use of the Hosmer-Lemeshow test for logistic regression (e.g. [<xref ref-type="bibr" rid="j_nejsds81_ref_013">13</xref>], [<xref ref-type="bibr" rid="j_nejsds81_ref_021">21</xref>]). One major concern is the test’s lack of power in detecting inadequacy of the model being assessed, which can be especially serious when the model has missed important variables available in the data that should have been included (i.e. they significantly improve the model when included). If these theoretical concerns are practically relevant, the usage of the Hosmer-Lemeshow test may lead to, if not intentional, passive “reverse p-hacking” for justification of a model by the data analyst.</p>
<p>One goal of this paper is therefore to investigate the ability, or lack thereof, of the Hosmer-Lemeshow test to detect a poorly-fit model. We conduct a practically-oriented examination of the test’s applications—specifically, its actual usage in published research.</p>
<p>In addition to the analysis on the Hosmer-Lemeshow goodness-of-fit test, we also investigate the second factor of the reproducibility problem highlighted above: the lack of proper data sharing practices, either by not sharing data at all or by supplying data that cannot be used to reproduce the results of the paper. Collecting our own data is outside the scope of this project, so we rely on raw data being made available by authors to reproduce results described in their papers. Therefore, since reproducing the results described in the papers is a necessary first step in our analysis of the Hosmer-Lemeshow test, we also examine potential problems with data availability. We investigate this problem from two different angles. First, whether data is made available, particularly when the authors <italic>claim</italic> that data already is or can be made available. Second, for the data provided—whether they are already included in a “Supplemental Materials” section of the paper or supplied by request—we check if they are actually usable to reproduce the results described in the paper.</p>
<p>When we say “reproducing results”, we specify two objectives: the first is to replicate the results of the regression itself, preferably exactly, by getting identical or very nearly identical odds ratios as those reported in the given paper; the second is to reproduce the conclusion of the Hosmer-Lemeshow test—simply obtaining a non-significant p-value at the threshold indicated in the paper, and therefore reaching the same conclusion as the paper’s, is sufficient for our purposes. It is after the successful replication of these two components of the paper’s results that we begin our analysis of the Hosmer-Lemeshow test itself.</p>
</sec>
<sec id="j_nejsds81_s_002">
<label>2</label>
<title>Study Design</title>
<sec id="j_nejsds81_s_003">
<label>2.1</label>
<title>Search Term Engineering</title>
<p>To examine publications in a systematic way, we designed a search term set on Google Scholar, then reviewed the papers in the order that Google Scholar displayed them. To obtain the highest proportion of relevant and usable papers in our analysis, we used a specific set of keywords as the search terms in Google Scholar.</p>
<p>Naturally, to obtain papers that use the Hosmer-Lemeshow test, we include the term “<bold>hosmer-lemeshow</bold>.” During the process of search-term engineering, we found that raw data is rarely easily accessible, but the few papers that did have data available usually included a <italic>data availability statement</italic>. We thus extracted some of the most commonly used phrases in such data availability statements, such as “<bold>data availability</bold>,” “<bold>availability of data</bold>,” and “<bold>relevant data are available</bold>,” and included them in our search term set using the “OR” keyword. Using these search terms yielded a high proportion of desired results, but a large number of them included a calibration for the model. To avoid this complexity, we added an additional term <bold>-calibration NEAR hosmer-lemeshow</bold>.</p>
<p>Thus, the final search term set used to find papers on Google Scholar was: <bold>“hosmer-lemeshow” AND “data availability” OR “availability of data” OR “relevant data are available” -calibration NEAR hosmer-lemeshow</bold>.</p>
</sec>
<sec id="j_nejsds81_s_004">
<label>2.2</label>
<title>Dataset Organization</title>
<p>We created our own dataset to keep a record of all papers we analyzed. We organize the dataset as follows: the first column is the title of the paper, the next is the year of publication, then the link Google Scholar provided, then the “availability statement status” of the paper—whether the paper included data availability statement or not. There are 6 categories for the “statement status:” 
<list>
<list-item id="j_nejsds81_li_001">
<label>1.</label>
<p>“<bold>Not Relevant</bold>”: These papers are not usable for our analysis, e.g. Hosmer and Lemeshow’s own paper, or some kind of meta-analysis done on previous results.</p>
</list-item>
<list-item id="j_nejsds81_li_002">
<label>2.</label>
<p>“<bold>No Statement</bold>”: These papers had no data availability statement. These papers reference the availability of data obtained from other sources that they needed to produce their <italic>own</italic> results, not whether they will provide their raw data to their readers.</p>
</list-item>
<list-item id="j_nejsds81_li_003">
<label>3.</label>
<p>“<bold>Contacted</bold>”: In these cases, the statement directs the reader to email the corresponding author(s). It is stated that data are available upon reasonable request.</p>
</list-item>
<list-item id="j_nejsds81_li_004">
<label>4.</label>
<p>“<bold>Claimed Available</bold>”: In these cases, the data availability statement reads, approximately, “relevant data are included in the paper or the Supporting Materials section,” but the data is actually not accessible, or the supplemental materials provided are not raw data.</p>
</list-item>
<list-item id="j_nejsds81_li_005">
<label>5.</label>
<p>“<bold>Claimed Not Available</bold>”: This means that the statement is present, but it directly informs the reader that data cannot be made available, usually due to confidentiality concerns regarding sensitive data.</p>
</list-item>
<list-item id="j_nejsds81_li_006">
<label>6.</label>
<p>“<bold>Data Available</bold>”: This means that raw data is included, usually in the “Supporting Information” section of the paper.</p>
</list-item>
</list> 
The last column indicates the paper’s replication status: whether we ultimately were able to replicate the published logistic regression (matching odds ratios as reported in the paper), or if there were roadblocks, such as missing variables.</p>
</sec>
<sec id="j_nejsds81_s_005">
<label>2.3</label>
<title>Timeline of Research</title>
<p>Systematic review of the papers using the search term set described above began in August of 2022.</p>
<p>As of the date 23 August, 2023, all papers in the first 8 pages of results in Google Scholar are included in our dataset of analyzed articles in the order they appeared in. However, because the order of results in Google Scholar shifts over time, there are articles included at the bottom of our dataset (specifically, the last 8) that occur on later pages of a Google Scholar Search. These papers were examined at earlier dates and are now not displayed within the first 8 pages of Google Scholar. We include them in our dataset for the sake of completeness. Therefore, we have 88 examined papers in total.</p>
<p>We briefly note that, due to the inclusion of the three search terms about data availability, these 88 papers are, in the far majority, published within the last few years, with 74% published 2020 or later, and 87.5% published 2016 or later. The Google Scholar search results with these search terms omitted are far older.</p>
</sec>
</sec>
<sec id="j_nejsds81_s_006">
<label>3</label>
<title>Results</title>
<sec id="j_nejsds81_s_007">
<label>3.1</label>
<title>Data Availability of Examined Publications</title>
<p>Out of 88 papers, we deemed 81 relevant to our research, so we will examine the data availability claims and the final outcomes (whether data was actually obtainable and ultimately usable) of these 81 papers.</p>
<p>Even when a data availability statement was present, some were merely suggestions to contact the corresponding author(s), and even when data are claimed to be available, what was provided was not always raw data or usable data.</p>
<p>We therefore present a detailed breakdown of the 81 papers in question and their data availability: in Figure <xref rid="j_nejsds81_fig_001">1</xref>, we visualize the set of papers in a tree, where each set of branches divides its node into subsets using some data availability status criterion. For example, the first set of branches divides the root node into two categories: one for papers that have data availability statements, and one for papers that do not. For each child node, the percentage of its parent node its category encompasses is labeled.</p>
<fig id="j_nejsds81_fig_001">
<label>Figure 1</label>
<caption>
<p>Tree of replicability status of resulting papers.</p>
</caption>
<graphic xlink:href="nejsds81_g001.jpg"/>
</fig>
<p>After omitting irrelevant papers (e.g. Hosmer and Lemeshow’s own paper) and papers with no data availability statement, we discard 24 papers and retain 64. We describe the breakdown of the data availability statement categories, i.e. claiming “data is available” or “data is not available,” or including a comment to “contact the author(s),” and the ultimate <italic>true</italic> availability of the data (i.e. whether raw, <italic>usable</italic> data can actually be obtained).</p>
<p>Out of the 64 papers with data availability statements, 43 claimed data would be made available upon reasonable request, 15 claimed data to be available in the paper or its “Supporting Information” section, and 6 claimed confidentiality concerns. <graphic xlink:href="nejsds81_g002.jpg"/></p>
<p>We further examine the former two categories: out of the 15 papers that claimed data are available, 4 papers included supplemental materials that were not raw data, 2 papers had links to data that are not functional, 4 provided data that are not usable due to missing predictors or the outcome, and 5 provided usable raw data. <graphic xlink:href="nejsds81_g003.jpg"/></p>
<p>Out of the 43 papers for which we contacted corresponding author(s), 33 yielded no response (the most recent email sent was on 23 August, 2023), 7 responses cited confidentiality concerns, some requesting certification with institutions in their home country (which we did not pursue), and some requesting specific publication details (which we were not able to provide). We encountered 1 message send error, in which the email was not delivered, possibly due to an invalid or outdated email address. In 2 cases, usable raw data was provided by email. <graphic xlink:href="nejsds81_g004.jpg"/></p>
<p>Thus, in total, we find several concerning trends in published research regarding data availability: 
<list>
<list-item id="j_nejsds81_li_007">
<label>1.</label>
<p>Lack of data availability statement: Unless the presence of a data availability statement is explicitly included in our set of search terms, published papers displayed in a Google Scholar search often do not include one.</p>
</list-item>
<list-item id="j_nejsds81_li_008">
<label>2.</label>
<p>Providing non-data: Data are claimed to be available, but the supporting materials were actually summary statistics or other materials, not raw data used to build the regression model.</p>
</list-item>
<list-item id="j_nejsds81_li_009">
<label>3.</label>
<p>Providing incomplete data: Data given has variables omitted without disclaimer or otherwise clear indication of reason for its omission. Processes were sometimes described for the derivation of variables ultimately used in the regression, but not in enough detail to allow reproduction of results.</p>
</list-item>
<list-item id="j_nejsds81_li_010">
<label>4.</label>
<p>Ignoring email requests: Although the data availability statement welcomes requests sent by email to the corresponding author(s), more than 70% of emails received no response.</p>
</list-item>
</list> 
Even though a majority of papers claim that data already is or can be made available, for very few papers were we actually able to obtain usable data. This finding offers some support of earlier research concluding that mandating these data availability statements does not make data sharing substantially more effective [<xref ref-type="bibr" rid="j_nejsds81_ref_016">16</xref>].</p>
</sec>
<sec id="j_nejsds81_s_008">
<label>3.2</label>
<title>Data Description for Papers Included in Further Analysis</title>
<p>There are 7 papers included in the ultimate analysis on the reliability of the usage of the Hosmer-Lemeshow test, and because two papers developed 2 separate logistic regression models, there are a total of 9 models analyzed.</p>
<p>The first step is replicate the authors’ selected models as closely as possible, as a baseline for comparison. To do this, we analyze the available datasets.</p>
<sec id="j_nejsds81_s_009">
<label>3.2.1</label>
<title>Dataset Summary</title>
<p>We summarize the basic metadata of each dataset we used to replicate the models.</p>
<p>We note in Table <xref rid="j_nejsds81_tab_001">1</xref> the number of observations (rows) in the dataset, excluding null rows, the number of total variables (columns) included in the dataset by the authors, and the number of variables ultimately included in the final multivariate binary logistic regression model, including the outcome variable.</p>
<table-wrap id="j_nejsds81_tab_001">
<label>Table 1</label>
<caption>
<p>Paper Datasets Basic Metadata.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Paper Authors</td>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin">Number of Observations</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Total Variables</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Variables in Model</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: center">Mithra et al. [<xref ref-type="bibr" rid="j_nejsds81_ref_014">14</xref>]</td>
<td style="vertical-align: top; text-align: right">450</td>
<td style="vertical-align: top; text-align: right">222</td>
<td style="vertical-align: top; text-align: right">7</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Gebeyehu et al. [<xref ref-type="bibr" rid="j_nejsds81_ref_008">8</xref>]</td>
<td style="vertical-align: top; text-align: right">421</td>
<td style="vertical-align: top; text-align: right">53</td>
<td style="vertical-align: top; text-align: right">2</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Campos et al. [<xref ref-type="bibr" rid="j_nejsds81_ref_003">3</xref>]</td>
<td style="vertical-align: top; text-align: right">198</td>
<td style="vertical-align: top; text-align: right">37</td>
<td style="vertical-align: top; text-align: right">3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Peterer et al. (Model 1) [<xref ref-type="bibr" rid="j_nejsds81_ref_015">15</xref>]</td>
<td style="vertical-align: top; text-align: right">311</td>
<td style="vertical-align: top; text-align: right">66</td>
<td style="vertical-align: top; text-align: right">6</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Peterer et al. (Model 2) [<xref ref-type="bibr" rid="j_nejsds81_ref_015">15</xref>]</td>
<td style="vertical-align: top; text-align: right">311</td>
<td style="vertical-align: top; text-align: right">66</td>
<td style="vertical-align: top; text-align: right">6</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Zhu et al. (Model 1) [<xref ref-type="bibr" rid="j_nejsds81_ref_022">22</xref>]</td>
<td style="vertical-align: top; text-align: right">76,359</td>
<td style="vertical-align: top; text-align: right">23</td>
<td style="vertical-align: top; text-align: right">3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Zhu et al. (Model 2) [<xref ref-type="bibr" rid="j_nejsds81_ref_022">22</xref>]</td>
<td style="vertical-align: top; text-align: right">77,018</td>
<td style="vertical-align: top; text-align: right">26</td>
<td style="vertical-align: top; text-align: right">7</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Kibi et al. [<xref ref-type="bibr" rid="j_nejsds81_ref_011">11</xref>]</td>
<td style="vertical-align: top; text-align: right">5,313</td>
<td style="vertical-align: top; text-align: right">37</td>
<td style="vertical-align: top; text-align: right">7</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">Wang et al. [<xref ref-type="bibr" rid="j_nejsds81_ref_018">18</xref>]</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">115</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">6</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">5</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_nejsds81_s_010">
<label>3.2.2</label>
<title>Datasets Enabling Exact Replication of Expected Results</title>
<p>Five of the analyzed models, with two from the same paper, had data exactly as described. In most cases, other variables that were not included in the final multiple regression model were also present, either because they were included in univariate analyses, because the authors deemed them otherwise important to keep in the dataset, or they were not removed for the sake of completeness. To replicate the authors’ models exactly, we first removed these extraneous variables during this step, keeping only those specified to be in the final multivariate model.</p>
<p>For categorical variables, it was necessary to map them to binary or otherwise numeric values (one-hot encoding). When the binary logistic regression was run, the odds ratios were identical or nearly identical to the described results in the paper (it was sometimes necessary to use the additive reciprocal of the output coefficient to match the given odds ratios, when the coefficients produced by the regression were the opposites of those reported).</p>
</sec>
<sec id="j_nejsds81_s_011">
<label>3.2.3</label>
<title>Datasets with Anomalies in the Replication Process</title>
<p>For three of the models, there were anomalies in the data, omissions in the description of the data, or other roadblocks in replicating the papers’ reported results exactly. We describe these three datasets here.</p>
<p>For the paper by Gebeyehu et al., only a subset of the data were made publicly available. The rest of the data were not released due to a data sharing policy (as relayed to us via an email conversation). We thus proceeded with the publicly available subset of the data, which includes 421 rows, and we attempted to replicate the results in the paper. Our odds ratios obtained are quite similar to those reported in the paper, and using Hosmer-Lemeshow obtains a non-significant p-value of nearly 1. The paper does not report an exact p-value, just stating that the significance level is 0.05 and that the p-value obtained was not significant.</p>
<p>For the paper by Kibi et al., the process of dealing with third-category responses to binary questions (such as “I don’t know whether I received a flu vaccine”) was not described. Experimenting with different methods of dealing with these third-category responses, we were still able to get extremely similar odds ratios/coefficients to those reported in the paper for all but one variable. However, we encountered another unexpected problem: when using the Hosmer-Lemeshow test, we obtain a significant p-value (with number of bins ranging from 2 to 15 tested) at the alpha = 0.05 level. The paper’s authors reported using a 0.05 significance level but did not report their exact p-value, and we did not receive a response to our email inquiring about the significant results.</p>
<p>For the paper by Peterer et al., there are two models built, each with the same set of predictors but a different outcome. One of the used variables in the two final multivariate logistic regressions, “Gender”, was not available due to confidentiality. However, the authors did not find this variable to be significant in either model. We thus attempted to run logistic regressions using the other variables, omitting the missing “Gender” variable. We find that our coefficients are quite close to those reported in the paper, despite the Gender variable being excluded. Upon using Hosmer-Lemeshow, we also find an insignificant p-value. We thus still choose to include this paper in the next stage of our analysis.</p>
</sec>
</sec>
<sec id="j_nejsds81_s_012">
<label>3.3</label>
<title>Reproducing Binary Logistic Regressions</title>
<p>For all 9 models, we obtained odds ratio values that are reasonably close to those reported by the authors. Nearly all were comfortably within the confidence interval reported in the results (far closer to the exact value reported than the boundaries). The one exception was a single predictor in the paper by Kibi et al., which was outside the reported confidence interval.</p>
<p>The p-values we obtained from conducting the Hosmer-Lemeshow test do differ from the ones reported in the papers. This is likely due to varying implementations of the Hosmer-Lemeshow test. Additionally, the number of bins used was not specified in any of the papers, so we used a default of 10 in the following reported values. An examination of the differing p-values resultant from bin count variation is included in later analysis in this paper. Despite the differences in p-values, the conclusions drawn, using a threshold <inline-formula id="j_nejsds81_ineq_001"><alternatives><mml:math>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0.05</mml:mn></mml:math><tex-math><![CDATA[$\alpha =0.05$]]></tex-math></alternatives></inline-formula>, are consistent with those of the authors (with the exception described above for Kibi et al.).</p>
<p>We report our obtained p-values using the default bin number of 10, compared to those given in the original papers, in Table <xref rid="j_nejsds81_tab_002">2</xref>.</p>
<table-wrap id="j_nejsds81_tab_002">
<label>Table 2</label>
<caption>
<p>Comparison of p-Values.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Paper Title</td>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin">Given p-Value</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Obtained p-Value</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: center">Mithra et al.</td>
<td style="vertical-align: top; text-align: right">Not Reported (&gt; 0.05)</td>
<td style="vertical-align: top; text-align: right">0.299</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Gebeyehu et al.</td>
<td style="vertical-align: top; text-align: right">Not Reported (&gt; 0.05)</td>
<td style="vertical-align: top; text-align: right">1.0</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Campos et al.</td>
<td style="vertical-align: top; text-align: right">0.684</td>
<td style="vertical-align: top; text-align: right">0.442</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Peterer et al.</td>
<td style="vertical-align: top; text-align: right">0.11</td>
<td style="vertical-align: top; text-align: right">0.600</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Peterer et al.</td>
<td style="vertical-align: top; text-align: right">0.88</td>
<td style="vertical-align: top; text-align: right">0.640</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Zhu et al.</td>
<td style="vertical-align: top; text-align: right">0.130</td>
<td style="vertical-align: top; text-align: right">0.970</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Zhu et al.</td>
<td style="vertical-align: top; text-align: right">0.638</td>
<td style="vertical-align: top; text-align: right">0.843</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Kibi et al.</td>
<td style="vertical-align: top; text-align: right">Not Reported</td>
<td style="vertical-align: top; text-align: right">7.20e-10</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">Wang et al.</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">0.962</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">0.462</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_nejsds81_s_013">
<label>3.4</label>
<title>Hosmer-Lemeshow Sensitivity to Bin Count</title>
<p>To analyze the sensitivity of the Hosmer-Lemeshow test to the number of bins used in the implementation, we vary the bin count when conducting the test with the models we obtained. Existing concerns about the Hosmer-Lemeshow test include bin sensitivity: In a blog post by Paul Allison in <italic>Statistical Horizons</italic> [<xref ref-type="bibr" rid="j_nejsds81_ref_002">2</xref>], it was found that changing the bins from 8 to 9 changes the result from significant (<inline-formula id="j_nejsds81_ineq_002"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0.0499</mml:mn></mml:math><tex-math><![CDATA[$p=0.0499$]]></tex-math></alternatives></inline-formula>) to not significant (<inline-formula id="j_nejsds81_ineq_003"><alternatives><mml:math>
<mml:mi mathvariant="italic">p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0.11</mml:mn></mml:math><tex-math><![CDATA[$p=0.11$]]></tex-math></alternatives></inline-formula>), and further increasing the bins to 11 increases the p-value to 0.64. To analyze these concerns on our collection of datasets, we use bins 5–16, and examine the resultant p-value.</p>
<p>We find that varying the bin count does not change the conclusion for any of the models. In 8 out of 9 models, the p-value is above the reported cutoff threshold, 0.05, for all 12 bin counts. For the 1 model in which we obtained a significant p-value with the default 10 bins, the p-values are significant for all 12 bins.</p>
<p>We do notice, however, that there is a rather large range the p-values take depending on the bin count. Many papers have a range as wide as 0.6 (excluding the paper with a significant p-value, which had a range of several factors of 10), and in the paper by Mithra et al., the p-value dropped to 0.07, near the 0.05 threshold, with 8 bins.</p>
</sec>
<sec id="j_nejsds81_s_014">
<label>3.5</label>
<title>Assessing the Efficacy of Hosmer-Lemeshow</title>
<p>We now study the efficacy of the Hosmer-Lemeshow test. We are interested in analyzing whether the Hosmer-Lemeshow goodness-of-fit test is able to detect a poorly-fit or non-ideal model. A comparison to the full model with all available predictors, the saturated model, is a requirement of a goodness-of-fit test for binary logistic regressions [<xref ref-type="bibr" rid="j_nejsds81_ref_010">10</xref>], so if, when we use additional predictors from the dataset, the model is significantly improved, this suggests that the Hosmer-Lemeshow test was not able to detect that the original model is not a proper fit—that the original model was missing information that can substantially improve model performance.</p>
<p>To test this, we consider not only those predictors ultimately chosen in the authors’ final multivariate logistic regression model, but also those predictors not selected (but still present in the data, of course). We work to produce an alternative model to the authors’ reported final model, and specifically, we require our model have a superset of the original predictors. The following describes the process in producing this alternative model: 
<list>
<list-item id="j_nejsds81_li_011">
<label>1.</label>
<p>We begin with all available data. We first omit variables with too many null values. We also omit variables where the values are too complex (for example, if the value is description based, not numerical or categorical). We then omit variables that were either used in any way to construct the outcome, or are too similar to the outcome, such as a continuous version of the binary outcome variable.</p>
</list-item>
<list-item id="j_nejsds81_li_012">
<label>2.</label>
<p>To consider predictors that were not used in the authors’ final multivariate logistic regression, it was necessary to omit rows in which there were null values in the originally unused columns. This was done to ensure consistency in the regression comparison, and particularly in the comparison of the degrees of freedom. In certain cases, this caused fluctuations in how well the original model was able to be reproduced, but changes in odds ratios were not significant.</p>
</list-item>
<list-item id="j_nejsds81_li_013">
<label>3.</label>
<p>Using this dataset containing all acceptable predictors, we first re-run the original model proposed by the authors as a baseline. Next, using all predictors, we implement a backward stepwise regression to select the “ideal” subset of them (as defined by the backward stepwise regression process).</p>
</list-item>
<list-item id="j_nejsds81_li_014">
<label>4.</label>
<p>We then compare this model ultimately chosen in the backward stepwise regression with the original model chosen by the authors. We take note of the differences in predictors selected, then run a regression with the following set of predictors: 
<list>
<list-item id="j_nejsds81_li_015">
<label>(a)</label>
<p>all variables originally chosen by the authors, and</p>
</list-item>
<list-item id="j_nejsds81_li_016">
<label>(b)</label>
<p>the variables chosen in the backward stepwise regression that were not in the original model.</p>
</list-item>
</list> 
We thus produce an alternative model, one with a superset of the original predictors.</p>
</list-item>
</list> 
Prior to discussing the comparison between the original and alternative models, we discuss the execution of the above process for each paper.</p>
<p>First, we find that for two papers, Wang et al. and Mithra et al., the remaining predictors included in the data are difficult to extract due to missingness, encoding difficulties, or other factors (for example, convergence issues when there are too many binary variables included in the model). We thus do not include them in this section of analysis.</p>
<p>The paper by Gebeyehu et al., uses the Hosmer-Lemeshow test for variable selection rather than for validating their final multivariate regression, so we choose to analyze separately, and thus do not include it in this section of analysis.</p>
<p>We will thus focus on examining the remaining four papers using the methodology described above. For papers Campos et al. and the model for male patients in Zhu et al., we identified 3 and 4 additional variables that improve the model, respectively. For the model for female patients in Zhu et al., the backward stepwise regression process produced a subset of the original variables chosen by the authors, so we do not investigate this model further.</p>
<p>In the other cases, an alternative model was produced, but with deviations in the process described above, and a more detailed description of the methodology of constructing the model is required:</p>
<list>
<list-item id="j_nejsds81_li_017">
<label>1.</label>
<p>In the paper by Kibi et al., there is a column containing the age category of the participants, in which there are 5 categories. In the final regression model described in the paper, there is one binary age variable used, where the age categories are combined to form two categories: under 18 and over 18. We find that using the original “age category” variable instead of the binary “adult” variable yields a significantly better model.</p>
<p>Another set of variables identified by the backward stepwise regression are the “education” variables, with 7 categories in total.</p>
<p>We therefore examine three different alternative models: one with the binary age variable changed to 5 categories, one keeping the original binary age variable and adding the “education” variables, and one doing both.</p>
</list-item>
<list-item id="j_nejsds81_li_018">
<label>2.</label>
<p>In the paper by Peterer et al., the backward stepwise regression did not yield a significantly better model. However, we consider the encoding of a variable the authors did use: the Injury Severity Score (ISS) is a score ranging from 0–75. The column of raw ISS values are mapped to ISS groups, following the standard of a cutoff at 16: scores 9–15 are considered moderate, and 16 and above are considered severe. Some literature supports usage of an additional category: separating values above 16 into 16–24 and 25 and above [<xref ref-type="bibr" rid="j_nejsds81_ref_017">17</xref>]. We experiment with using the raw ISS score, a continuous variable, and using the ISS score with three categories instead of two. In both cases, we see an improvement in the model. We also note that when changing the discrete binary “ISS group” variable into the continuous version, there is no loss in degrees of freedom, but when we use the alternative discrete “ISS group” value, we lose 1 degree of freedom due to the addition of the third category. This consideration is naturally only relevant if dividing the score into a binary variable is not a rigid requirement.</p>
</list-item>
</list>
<p>We may use the Akaike information criterion (AIC) [<xref ref-type="bibr" rid="j_nejsds81_ref_001">1</xref>], a widely used model selection tool, to conduct a first comparison of the original and alternative models. We immediately see that the AIC scores of our alternative models are lower than those of the original models.</p>
<p>Next, we do a more rigorous comparison of the original versus the alternative models. We do this by examining the residual deviance. For two models where one has a subset of the predictors of the other, the larger model naturally has fewer degrees of freedom. Under the assumption that the smaller model is proper, the difference in the residual deviance follows a chi-square distribution with <inline-formula id="j_nejsds81_ineq_004"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${n_{2}}-{n_{1}}$]]></tex-math></alternatives></inline-formula> degrees of freedom, where <inline-formula id="j_nejsds81_ineq_005"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${n_{2}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds81_ineq_006"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${n_{1}}$]]></tex-math></alternatives></inline-formula> are the degrees of freedom for the larger and smaller models, respectively.</p>
<table-wrap id="j_nejsds81_tab_003">
<label>Table 3</label>
<caption>
<p>AIC Scores.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Paper Title</td>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin">Original Model</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Alternative Model</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: center">Campos et al.</td>
<td style="vertical-align: top; text-align: right">239.8</td>
<td style="vertical-align: top; text-align: right">228.66</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Kibi et al. (with age)</td>
<td style="vertical-align: top; text-align: right">5520.3</td>
<td style="vertical-align: top; text-align: right">5344.6</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Kibi et al. (with education)</td>
<td style="vertical-align: top; text-align: right">5520.3</td>
<td style="vertical-align: top; text-align: right">5385.1</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Kibi et al. (with both)</td>
<td style="vertical-align: top; text-align: right">5520.3</td>
<td style="vertical-align: top; text-align: right">5312.7</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Zhu et al.</td>
<td style="vertical-align: top; text-align: right">1471.1</td>
<td style="vertical-align: top; text-align: right">1461.5</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">Peterer et al. (continuous)</td>
<td style="vertical-align: top; text-align: right">266.98</td>
<td style="vertical-align: top; text-align: right">211.45</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">Peterer et al. (ternary)</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">266.98</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">253.72</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We thus may take the difference in the residual deviance, and use a chi-square test with the appropriate degrees of freedom to determine whether the discrepancy is significant.</p>
<p>We visualize our results in the Table <xref rid="j_nejsds81_tab_004">4</xref>, where the sub-rows are the residual deviance and degrees of freedom for the original and alternative models as well as the difference between them, which is the test statistic used. The last column is the p-value obtained using the chi-square distribution.</p>
<table-wrap id="j_nejsds81_tab_004">
<label>Table 4</label>
<caption>
<p>Comparison of Original Model with Alternative.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Paper Title</td>
<td style="vertical-align: top; text-align: right; border-top: double; border-bottom: solid thin">Original Model</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Alternative Model</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">Test Statistic</td>
<td style="vertical-align: top; text-align: center; border-top: double; border-bottom: solid thin">p-Value</td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: center">Campos et al.</td>
<td style="vertical-align: top; text-align: right">219.80</td>
<td style="vertical-align: top; text-align: right">202.66</td>
<td style="vertical-align: top; text-align: right">17.14</td>
<td style="vertical-align: top; text-align: right">6.614e-4</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">df = 188</td>
<td style="vertical-align: top; text-align: right">df = 185</td>
<td style="vertical-align: top; text-align: right">df = 3</td>
<td style="vertical-align: top; text-align: right"/>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: center">Kibi et al. (age)</td>
<td style="vertical-align: top; text-align: right">5498.3</td>
<td style="vertical-align: top; text-align: right">5316.6</td>
<td style="vertical-align: top; text-align: right">181.7</td>
<td style="vertical-align: top; text-align: right">3.787e-39</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">df = 5255</td>
<td style="vertical-align: top; text-align: right">df = 5252</td>
<td style="vertical-align: top; text-align: right">df = 3</td>
<td style="vertical-align: top; text-align: right"/>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: center">Kibi et al. (edu)</td>
<td style="vertical-align: top; text-align: right">5498.3</td>
<td style="vertical-align: top; text-align: right">5353.1</td>
<td style="vertical-align: top; text-align: right">145.2</td>
<td style="vertical-align: top; text-align: right">1.403e-29</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">df = 5255</td>
<td style="vertical-align: top; text-align: right">df = 5250</td>
<td style="vertical-align: top; text-align: right">df = 5</td>
<td style="vertical-align: top; text-align: right"/>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: center">Kibi et al. (both)</td>
<td style="vertical-align: top; text-align: right">5498.3</td>
<td style="vertical-align: top; text-align: right">5274.7</td>
<td style="vertical-align: top; text-align: right">223.6</td>
<td style="vertical-align: top; text-align: right">6.680e-44</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">df = 5255</td>
<td style="vertical-align: top; text-align: right">df = 5247</td>
<td style="vertical-align: top; text-align: right">df = 8</td>
<td style="vertical-align: top; text-align: right"/>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: center">Zhu et al.</td>
<td style="vertical-align: top; text-align: right">1463.1</td>
<td style="vertical-align: top; text-align: right">1445.5</td>
<td style="vertical-align: top; text-align: right">17.6</td>
<td style="vertical-align: top; text-align: right">1.477e-3</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">df = 76355</td>
<td style="vertical-align: top; text-align: right">df = 76351</td>
<td style="vertical-align: top; text-align: right">df = 4</td>
<td style="vertical-align: top; text-align: right"/>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: center">Peterer et al. (cont.)</td>
<td style="vertical-align: top; text-align: right">256.98</td>
<td style="vertical-align: top; text-align: right">201.45</td>
<td style="vertical-align: top; text-align: right">55.53</td>
<td style="vertical-align: top; text-align: right">n/a</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right">df = 205</td>
<td style="vertical-align: top; text-align: right">df = 205</td>
<td style="vertical-align: top; text-align: right">n/a</td>
<td style="vertical-align: top; text-align: right"/>
</tr>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: center; border-bottom: solid thin">Peterer et al. (ternary)</td>
<td style="vertical-align: top; text-align: right">256.98</td>
<td style="vertical-align: top; text-align: right">241.72</td>
<td style="vertical-align: top; text-align: right">15.26</td>
<td style="vertical-align: top; text-align: right">9.368e-05</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">df = 205</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">df = 204</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin">df = 1</td>
<td style="vertical-align: top; text-align: right; border-bottom: solid thin"/>
</tr>
</tbody>
</table>
</table-wrap>
<p>We thus find that the discrepancy in model performance is highly significant.</p>
<sec id="j_nejsds81_s_015">
<label>3.5.1</label>
<title>A Misuse of the Hosmer-Lemeshow Test</title>
<p>During our analysis of the usage of the Hosmer-Lemeshow test, we found that it was used incorrectly for a task unrelated to the nature of the test, i.e., the methodology of using the test for variable selection as described in Gebeyehu et al.</p>
<p>The authors did not use the Hosmer-Lemeshow test to analyze the goodness-of-fit of the final model, but was rather used it on each of the univariate regressions to test whether, for that predictor, the difference in expected and observed proportions is significant. The predictors that passed this test were then included in the further multivariate analysis.</p>
<p>We express skepticism about the validity of this methodology, and conduct the following two experiments: 
<list>
<list-item id="j_nejsds81_li_019">
<label>1.</label>
<p>In this experiment, we generate <italic>y</italic> values randomly, and test whether the Hosmer-Lemeshow test allows predictors to pass the test despite having no predictive ability (the outcome being random). 
<list>
<list-item id="j_nejsds81_li_020">
<label>(a)</label>
<p>Using a Bernoulli random value generator, we generate random outcome values with probability of success being 0.9, matching the proportion of positive outcomes in the actual data.</p>
</list-item>
<list-item id="j_nejsds81_li_021">
<label>(b)</label>
<p>We then use the Hosmer-Lemeshow test on a univariate logistic regression with each independent variable, and observe the p-values.</p>
</list-item>
</list> 
Ultimately, around 98% of runs, we obtain the result that all predictors have non-significant p-values, in fact, nearly all are almost exactly 1. As we can see from this experiment, the Hosmer-Lemeshow test is not capable of identifying when the variable is not very predictive—all predictors tested passed the test, despite the fact that the outcome values are randomly generated.</p>
</list-item>
<list-item id="j_nejsds81_li_022">
<label>2.</label>
<p>We conduct another experiment in the other direction: we consider whether the Hosmer-Lemeshow test would reject a significant predictor when conducting univariate analysis.</p>
<p>In this experiment, we first randomly generate data in the following manner: 
<list>
<list-item id="j_nejsds81_li_023">
<label>(a)</label>
<p>Let variable <inline-formula id="j_nejsds81_ineq_007"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{1}}$]]></tex-math></alternatives></inline-formula> be a value randomly generated from a uniform distribution with range [−5, 5].</p>
</list-item>
<list-item id="j_nejsds81_li_024">
<label>(b)</label>
<p>Let variable <inline-formula id="j_nejsds81_ineq_008"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{2}}$]]></tex-math></alternatives></inline-formula> be equal to <inline-formula id="j_nejsds81_ineq_009"><alternatives><mml:math>
<mml:msup>
<mml:mrow>
<mml:mo mathvariant="normal" fence="true" stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">ϵ</mml:mi>
<mml:mo mathvariant="normal" fence="true" stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup></mml:math><tex-math><![CDATA[${({x_{1}}+\epsilon )^{2}}$]]></tex-math></alternatives></inline-formula>, where <italic>ϵ</italic> is noise randomly generated from a normal distribution with mean 0 and variance 2. Using this value for variance produces <inline-formula id="j_nejsds81_ineq_010"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds81_ineq_011"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{2}}$]]></tex-math></alternatives></inline-formula> vectors with a Pearson correlation coefficient of approximately 0.5.</p>
</list-item>
<list-item id="j_nejsds81_li_025">
<label>(c)</label>
<p>Let coefficients <inline-formula id="j_nejsds81_ineq_012"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\beta _{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds81_ineq_013"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${\beta _{2}}$]]></tex-math></alternatives></inline-formula> both be 0.5, and let the intercept be <inline-formula id="j_nejsds81_ineq_014"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">β</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn></mml:math><tex-math><![CDATA[${\beta _{0}}=0$]]></tex-math></alternatives></inline-formula>. We use these chosen values to calculate probabilities of being in the positive class for each row of data.</p>
</list-item>
<list-item id="j_nejsds81_li_026">
<label>(d)</label>
<p>Using Bernoulli random variables with the calculated probabilities for each row, we then generate values for each <italic>y</italic>.</p>
</list-item>
<list-item id="j_nejsds81_li_027">
<label>(e)</label>
<p>We now conduct two regressions: the first is a bivariate regression with both <inline-formula id="j_nejsds81_ineq_015"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{1}}$]]></tex-math></alternatives></inline-formula> and <inline-formula id="j_nejsds81_ineq_016"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{2}}$]]></tex-math></alternatives></inline-formula>, and the second is a univariate regression with only <inline-formula id="j_nejsds81_ineq_017"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{1}}$]]></tex-math></alternatives></inline-formula>. For the bivariate regression, we check whether the coefficient for <inline-formula id="j_nejsds81_ineq_018"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{1}}$]]></tex-math></alternatives></inline-formula> is significant, using a threshold of <inline-formula id="j_nejsds81_ineq_019"><alternatives><mml:math>
<mml:mi mathvariant="italic">α</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0.05</mml:mn></mml:math><tex-math><![CDATA[$\alpha =0.05$]]></tex-math></alternatives></inline-formula>.</p>
</list-item>
<list-item id="j_nejsds81_li_028">
<label>(f)</label>
<p>We then use the Hosmer-Lemeshow test on the univariate regression with only <inline-formula id="j_nejsds81_ineq_020"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{1}}$]]></tex-math></alternatives></inline-formula> and check whether the test rejects <inline-formula id="j_nejsds81_ineq_021"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{1}}$]]></tex-math></alternatives></inline-formula> as a fit predictor for the outcome, mimicking the univariate analysis done in Gebeyehu et al. We find that, with approximately 95% of runs, although <inline-formula id="j_nejsds81_ineq_022"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${x_{1}}$]]></tex-math></alternatives></inline-formula> is deemed to be significant in the bivariate analysis, the Hosmer-Lemeshow test actually rejects the predictor.</p>
</list-item>
</list> 
We see that in this case, the Hosmer-Lemeshow test erroneously rejects an important predictor.</p>
</list-item>
</list> 
These two experiments illustrate the fact that the Hosmer-Lemeshow test, by design, is not a proper tool for variable selection, and its usage in Gebeyehu et al. is therefore not suitable.</p>
</sec>
</sec>
</sec>
<sec id="j_nejsds81_s_016">
<label>4</label>
<title>Conclusion</title>
<p>The irreproducibility crisis is now widely recognized in the scientific community. There are a number of contributing factors, and two are addressed in this work; namely, the ability to obtain data and reproduce the same quantitative results by the same data analysis procedure, and the reliability of a statistical tool that is free from p-hacking or reverse p-hacking.</p>
<p>Our research has these two specific goals: to investigate the reproducibility of the results from published papers whose objective is to develop a logistic regression and validate their results using the Hosmer-Lemeshow test, and to investigate the adequacies of the widely-used Hosmer-Lemeshow test itself.</p>
<p>First, on the reproducibility of results in published research using authors’ data, we found serious discrepancies between data that are claimed to be available and data that are usable to reproduce results. Out of the 88 papers initially included in our study, 64 were relevant to our goals, and of these 64 papers that claim data are available or can be made available upon reasonable request, we were able to reproduce the results of only 7, an astonishingly small number. With most papers that have a data availability statement claiming that the raw data are available upon reasonable request, contacting the authors fails to receive a response. When data are claimed to be available in a supplemental materials section of the paper, the materials available are often not data. When raw data is indeed provided, the data is often unusable to reproduce the authors’ results, either due to missing variables or corrupted files.</p>
<p>Second, on the inadequacies of the Homser-Lemeshow goodness-of-fit test, we substantiated the theoretical concerns of its lack of power by demonstrating that the Homser-Lemeshow test failed to detect the improper models proposed by the authors in published research. In all 4 models that were ultimately tested, we were able to build a significantly better model, which we verified based on chi-square tests using the difference in residual deviance of our new proposed model and the authors’ original model. With such prevalence of the Hosmer-Lemeshow goodness-of-fit test in many fields of research, we demonstrated that its continued usage is indeed a substantial problem, improperly certifying a misspecified model and the subsequent conclusions.</p>
</sec>
</body>
<back>
<ref-list id="j_nejsds81_reflist_001">
<title>References</title>
<ref id="j_nejsds81_ref_001">
<label>[1]</label><mixed-citation publication-type="chapter"><string-name><surname>Akaike</surname>, <given-names>H.</given-names></string-name> (<year>1973</year>). <chapter-title>Information theory as an extension of the maximum likelihood principle</chapter-title>. In <string-name><surname>Petrov</surname> <given-names>BN</given-names></string-name> and <string-name><surname>Csaki</surname> <given-names>F</given-names></string-name>. <source>Second International Symposium on Information Theory</source>. <publisher-name>Akademiai Kiado</publisher-name>, <publisher-loc>Budapest</publisher-loc>, pp. <fpage>276</fpage>–<lpage>281</lpage>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=0483125">MR0483125</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds81_ref_002">
<label>[2]</label><mixed-citation publication-type="other"><string-name><surname>Allison</surname>, <given-names>P.</given-names></string-name> (2013). <italic>Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression</italic>.</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_003">
<label>[3]</label><mixed-citation publication-type="journal"><string-name><surname>Campos</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Rocha</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Willers</surname>, <given-names>D.</given-names></string-name> and <string-name><surname>Silva</surname>, <given-names>D.</given-names></string-name> (<year>2016</year>). <article-title>Characteristics of Patients with Smear-Negative Pulmonary Tuberculosis (TB) in a Region with High TB and HIV Prevalence</article-title>. <source>PLoS ONE</source> <volume>11</volume>(<issue>1</issue>).</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_004">
<label>[4]</label><mixed-citation publication-type="journal"><string-name><surname>Chuard</surname>, <given-names>P. J. C.</given-names></string-name>, <string-name><surname>Vrtílek</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Head</surname>, <given-names>M. L.</given-names></string-name> and <string-name><surname>Jennions</surname>, <given-names>M. D.</given-names></string-name> (<year>2019</year>). <article-title>Evidence that nonsignificant results are sometimes preferred: Reverse P-hacking or selective reporting?</article-title> <source>PLoS Biol</source> <volume>17</volume>(<issue>1</issue>).</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_005">
<label>[5]</label><mixed-citation publication-type="book"><string-name><surname>Faraway</surname>, <given-names>J. J.</given-names></string-name> (<year>2004</year>) <source>Extending the Linear Model with R</source>. <publisher-name>Chapman and Hall/CRC</publisher-name>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=2192856">MR2192856</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds81_ref_006">
<label>[6]</label><mixed-citation publication-type="journal"><string-name><surname>Federer</surname>, <given-names>L. M.</given-names></string-name>, <string-name><surname>Belter</surname>, <given-names>C. W.</given-names></string-name>, <string-name><surname>Joubert</surname>, <given-names>D. J.</given-names></string-name>, <string-name><surname>Livinski</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Lu</surname>, <given-names>Y.-L.</given-names></string-name>, <string-name><surname>Snyders</surname>, <given-names>L. N.</given-names></string-name>, <etal>et al.</etal> (<year>2018</year>). <article-title>Data sharing in PLOS ONE: An analysis of Data Availability Statements</article-title>. <source>PLoS ONE</source> <volume>13</volume>(<issue>5</issue>).</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_007">
<label>[7]</label><mixed-citation publication-type="journal"><string-name><surname>Fiar</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Greiner</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Huber</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Katok</surname>, <given-names>E.</given-names></string-name> and <string-name><surname>Ozkes</surname>, <given-names>A. I.</given-names></string-name> (<year>2023</year>). <article-title>Reproducibility in Management Science</article-title>. <source>Management Science</source> <volume>70</volume> <fpage>1115</fpage>–<lpage>1125</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/70(3):1343-1356" xlink:type="simple">https://doi.org/70(3):1343-1356</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_008">
<label>[8]</label><mixed-citation publication-type="journal"><string-name><surname>Gebeyehu</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Nigatu</surname>, <given-names>D.</given-names></string-name> and <string-name><surname>Engidawork</surname>, <given-names>E.</given-names></string-name> (<year>2019</year>). <article-title>Helicobacter pylori eradication rate of standard triple therapy and factors affecting eradication rate at Bahir Dar city administration, Northwest Ethiopia: A prospective follow up study</article-title>. <source>PLoS ONE</source> <volume>14</volume>(<issue>6</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1371/journal.pone.0217645" xlink:type="simple">https://doi.org/10.1371/journal.pone.0217645</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_009">
<label>[9]</label><mixed-citation publication-type="journal"><string-name><surname>Hosmer</surname>, <given-names>D. W.</given-names></string-name> and <string-name><surname>Lemesbow</surname>, <given-names>S.</given-names></string-name> (<year>1980</year>). <article-title>Goodness of fit tests for the multiple logistic regression model</article-title>. <source>Communications in Statistics – Theory and Methods</source> <volume>9</volume>(<issue>10</issue>) <fpage>1043</fpage>–<lpage>1069</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/03610928008827941" xlink:type="simple">https://doi.org/10.1080/03610928008827941</ext-link>. <uri>https://www.tandfonline.com/doi/pdf/10.1080/03610928008827941</uri>.</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_010">
<label>[10]</label><mixed-citation publication-type="book"><string-name><surname>Hosmer</surname>, <given-names>D. W.</given-names></string-name>, <string-name><surname>Lemeshow</surname>, <given-names>S.</given-names></string-name> and <string-name><surname>Sturdivant</surname>, <given-names>R. X.</given-names></string-name> (<year>2013</year>). <source>Applied Logistic Regression</source>. <publisher-name>John Wiley &amp; Sons, Inc.</publisher-name> <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1002/9781118596333.ch21" xlink:type="simple">https://doi.org/10.1002/9781118596333.ch21</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3287463">MR3287463</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds81_ref_011">
<label>[11]</label><mixed-citation publication-type="journal"><string-name><surname>Kibi</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Shaholli</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Barletta</surname>, <given-names>V. I.</given-names></string-name>, <string-name><surname>Vezza</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Gelardini</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Ardizzone</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Grassucci</surname>, <given-names>D.</given-names></string-name> and <string-name><surname>La Torre</surname>, <given-names>G.</given-names></string-name> (<year>2023</year>). <article-title>Knowledge, Attitude, and Behavior toward COVID-19 Vaccination in Young Italians</article-title>. <source>Vaccines</source> <volume>11</volume>(<issue>1</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3390/vaccines11010183" xlink:type="simple">https://doi.org/10.3390/vaccines11010183</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_012">
<label>[12]</label><mixed-citation publication-type="journal"><string-name><surname>Lai</surname>, <given-names>X.</given-names></string-name> and <string-name><surname>Liu</surname>, <given-names>L.</given-names></string-name> (<year>2018</year>). <article-title>A simple test procedure in standardizing the power of Hosmer–Lemeshow test in large data sets</article-title>. <source>Journal of Statistical Computation and Simulation</source> <volume>88</volume>(<issue>13</issue>) <fpage>2463</fpage>–<lpage>2472</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/00949655.2018.1467912" xlink:type="simple">https://doi.org/10.1080/00949655.2018.1467912</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3818450">MR3818450</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds81_ref_013">
<label>[13]</label><mixed-citation publication-type="journal"><string-name><surname>Lu</surname>, <given-names>C.</given-names></string-name> and <string-name><surname>Yang</surname>, <given-names>Y.</given-names></string-name> (<year>2018</year>). <article-title>On assessing binary regression models based on ungrouped data</article-title>. <source>Biometrics</source> <volume>75</volume>(<issue>1</issue>) <fpage>5</fpage>–<lpage>12</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1111/biom.12969" xlink:type="simple">https://doi.org/10.1111/biom.12969</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=3953702">MR3953702</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds81_ref_014">
<label>[14]</label><mixed-citation publication-type="journal"><string-name><surname>Mithra</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Unnikrishnan</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>T</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Holla</surname>, <given-names>R.</given-names></string-name> and <string-name><surname>Rathi</surname>, <given-names>P.</given-names></string-name> (<year>2021</year>). <article-title>Paternal Involvement in and Sociodemographic Correlates of Infant and Young Child Feeding in a District in Coastal South India: A Cross-Sectional Study</article-title>. <source>Frontiers in Public Health</source> <volume>9</volume>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3389/fpubh.2021.661058" xlink:type="simple">https://doi.org/10.3389/fpubh.2021.661058</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_015">
<label>[15]</label><mixed-citation publication-type="journal"><string-name><surname>Peterer</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Ossendorf</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Jensen</surname>, <given-names>K. O.</given-names></string-name>, <etal>et al.</etal> (<year>2019</year>). <article-title>Implementation of new standard operating procedures for geriatric trauma patients with multiple injuries: a single level I trauma centre study</article-title>. <source>BMC Geriatr</source> <volume>19</volume>(<issue>359</issue>).</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_016">
<label>[16]</label><mixed-citation publication-type="journal"><string-name><surname>Tedersoo</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Küngas</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Oras</surname>, <given-names>E.</given-names></string-name>, <etal>et al.</etal> (<year>2021</year>). <article-title>Data sharing practices and data availability upon request differ across scientific disciplines</article-title>. <source>Sci Data</source> <volume>8</volume>(<issue>192</issue>).</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_017">
<label>[17]</label><mixed-citation publication-type="chapter"><string-name><surname>VanDerHeyden</surname>, <given-names>N.</given-names></string-name> and <string-name><surname>Cox</surname>, <given-names>T. B.</given-names></string-name> (<year>2008</year>). <chapter-title>Chapter 6 – Trauma Scoring</chapter-title>. In <string-name><given-names>J. A.</given-names> <surname>Asensio</surname></string-name> and <string-name><given-names>D. D.</given-names> <surname>Trunkey</surname></string-name>, eds. <source>Current Therapy of Trauma and Surgical Critical Care</source> <fpage>26</fpage>–<lpage>32</lpage> <publisher-name>Mosby</publisher-name>, <publisher-loc>Philadelphia</publisher-loc>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/B978-0-323-04418-9.50010-2" xlink:type="simple">https://doi.org/10.1016/B978-0-323-04418-9.50010-2</ext-link>. <uri>https://www.sciencedirect.com/science/article/pii/B9780323044189500102</uri>.</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_018">
<label>[18]</label><mixed-citation publication-type="journal"><string-name><surname>Wang</surname>, <given-names>J.-L.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>F.-L.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>M.-S.</given-names></string-name> and <string-name><surname>He</surname>, <given-names>Y.</given-names></string-name> (<year>2021</year>). <article-title>Normal cerebrospinal fluid protein and associated clinical characteristics in children with tuberculous meningitis</article-title>. <source>Annals of Medicine</source> <volume>53</volume>(<issue>1</issue>) <fpage>885</fpage>–<lpage>889</lpage>. <comment>PMID: 34124971</comment>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/07853890.2021.1937692" xlink:type="simple">https://doi.org/10.1080/07853890.2021.1937692</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_019">
<label>[19]</label><mixed-citation publication-type="journal"><string-name><surname>Wasserstein</surname>, <given-names>R. L.</given-names></string-name> and <string-name><surname>Lazar</surname>, <given-names>N. A.</given-names></string-name> (<year>2016</year>). <article-title>The ASA Statement on p-Values: Context, Process, and Purpose</article-title>. <source>The American Statistician</source> <volume>70</volume>(<issue>2</issue>) <fpage>129</fpage>–<lpage>133</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/00031305.2016.1154108" xlink:type="simple">https://doi.org/10.1080/00031305.2016.1154108</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_020">
<label>[20]</label><mixed-citation publication-type="journal"><string-name><surname>Woolston</surname>, <given-names>C.</given-names></string-name> (<year>2015</year>). <article-title>Psychology journal bans P values</article-title>. <source>Nature</source> <volume>519</volume>(<issue>9</issue>). <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1038/519009f" xlink:type="simple">https://doi.org/10.1038/519009f</ext-link>.</mixed-citation>
</ref>
<ref id="j_nejsds81_ref_021">
<label>[21]</label><mixed-citation publication-type="journal"><string-name><surname>Zhang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Ding</surname>, <given-names>J.</given-names></string-name> and <string-name><surname>Yang</surname>, <given-names>Y.</given-names></string-name> (<year>2021</year>). <article-title>Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning</article-title>. <source>Journal of the American Statistical Association</source> <volume>118</volume>(<issue>542</issue>) <fpage>1115</fpage>–<lpage>1125</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/01621459.2021.1979010" xlink:type="simple">https://doi.org/10.1080/01621459.2021.1979010</ext-link>. <ext-link ext-link-type="uri" xlink:href="https://mathscinet.ams.org/mathscinet-getitem?mr=4595481">MR4595481</ext-link></mixed-citation>
</ref>
<ref id="j_nejsds81_ref_022">
<label>[22]</label><mixed-citation publication-type="journal"><string-name><surname>Zhu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Chen</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Lv</surname>, <given-names>H.</given-names></string-name>, <etal>et al.</etal> (<year>2019</year>). <article-title>Epidemiology of low-energy lower extremity fracture in Chinese populations aged 50 years and above</article-title>. <source>PLoS ONE</source> <volume>14</volume>(<issue>1</issue>).</mixed-citation>
</ref>
</ref-list>
</back>
</article>
