^{∘}and

The authors appreciate the associate editor and two referees for their invaluable feedback during the review process. Their expertise and insights enriched the quality of the work.

Categorical data are prevalent in almost all research fields and business applications. Their statistical analysis and inference often rely on probit/logistic regression models. For these common models, however, there is no universally adopted measure for performing goodness-of-fit analysis. To this end, [

Categorical data are prevalent in all areas, including economics, marketing, finance, psychology, and clinical studies. To analyze categorical data, the probit or logit models are often used to make inferences. To perform model assessment and comparison, researchers often rely on goodness-of-fit measures, such as

It can approximate the OLS

It has the interpretation of the explained proportion of variance.

It maintains the monotonicity property between nested models, which means that a larger model should have a larger

[

The goals of this paper are (i) developing an R package to implement [

Specifically, we first develop an R package to implement the surrogate

Second, to provide practical guidance for categorical data modeling, we use the developed R package to demonstrate how it can be used jointly with other R packages developed for variable screening/selection and model diagnostics (

Third, the comparability of the surrogate

Our

We briefly review the surrogate

The categorical response

To construct a goodness-of-fit

[

[

To make inferences for the surrogate

It is also worth noting that [

We develop an R package

An illustration of a workflow for modeling categorical data. Grey boxes show statistical analysis steps that should be carried out before our goodness-of-fit analysis. Light blue boxes contain the main functions in the SurrogateRsq package. Orange boxes highlight inference outcomes produced by our SurrogateRsq package.

In empirical studies, goodness-of-fit analysis should be used jointly with other statistical tools, such as variable screening/selection and model diagnostics, in the model-building and refining process. In this section, we discuss how to follow the workflow in Figure

In

In

In

In this section, we demonstrate how to use our

Our analysis of the wine rating data follows the workflow discussed in Section

To start, we use the function

As the number of explanatory variables is small, we use the exhaustive search method to select variables.

The selection results of the exhaustive search method for the red wine analysis.

Figure

We remark that if the number of explanatory variables is (moderately) large, we can use the step-wise selection method or regularization methods (e.g., with an L1, elastic net, minimax concave, or SCAD penalty). An example code is attached in the supplementary materials.

We conduct diagnostics of the model with variables selected in the preview step. For this purpose, we use surrogate residuals [

Plots of surrogate residual versus

Among all the residual-vs-covariate plots, we find that the residual-vs-

Model development for the red wine by variable selection and model diagnostics.

Model | Naive | Selected | + sulphates |
+ sulphates |

fixed.acidity | 0.026 | |||

(0.028) | ||||

volatile.acidity | −1.868 |
−1.722 |
−1.534 |
−1.491 |

(0.213) | (0.180) | (0.183) | (0.183) | |

citric.acid | −0.337 | |||

(0.256) | ||||

residual.sugar | 0.011 | |||

(0.021) | ||||

chlorides | −3.234 |
−3.488 |
−2.965 |
−2.604 |

(0.733) | (0.699) | (0.707) | (0.715) | |

free.sulfur.dioxide | 0.010 |
0.011 |
0.010 |
0.010 |

(0.004) | (0.004) | (0.004) | (0.004) | |

total.sulfur.dioxide | −0.007 |
−0.008 |
−0.007 |
−0.007 |

(0.001) | (0.001) | (0.001) | (0.001) | |

density | −6.679 |
|||

(0.538) | ||||

pH | −0.754 |
−0.780 |
−0.969 |
−1.028 |

(0.277) | (0.205) | (0.208) | (0.209) | |

sulphates | 1.589 |
1.570 |
5.937 |
15.147 |

(0.195) | (0.193) | (0.678) | (2.591) | |

sulphates |
−2.515 |
−12.397 |
||

(0.374) | (2.707) | |||

sulphates |
3.092 |
|||

(0.839) | ||||

alcohol | 0.481 |
0.479 |
0.475 |
0.472 |

(0.032) | (0.031) | (0.031) | (0.031) |

Table

In this subsection, we use our developed

First of all, we use the function

This function provides a point estimate of the surrogate

We can also use the same function

The result shows that the surrogate

The package

We apply the function

In the ranking table above, the contributions of

The output table above shows that the factor

One of the motives of [

Percentage contributions and ranks of the physicochemical variables in the analysis of the red wine and white wine samples.

Surrogate |
Surrogate |
||||

Variable | Contribution | Ranking | Contribution | Ranking | |

alcohol | 25.80% | 1 | 77.16% | 1 | |

sulphates (& higher-order terms) | 13.82% | 2 | 0.51% | 5 | |

volatile.acidity | 7.12% | 3 | 20.39% | 2 | |

total.sulfur.dioxide | 3.52% | 4 | |||

pH | 2.78% | 5 | 0.06% | 7 | |

chlorides | 1.21% | 6 | |||

free.sulfur.dioxide | 0.96% | 7 | 1.42% | 4 | |

residual.sugar | 5.34% | 3 | |||

fixed.acidity | 0.32% | 6 | |||

sulphates |
6.19% |

By comparing the result in the two panels (red versus white wine) of Table

In this paper, we have developed the R package

We use the red wine data to examine the computational time of the functions in our package. Table

Computational time estimates of the functions in

Function: |
n=1,597 | n=3,000 | n=6,000 | n=12,000 |

0.048 | 0.067 | 0.123 | 0.213 | |

0.001 | 0.002 | 0.004 | 0.007 | |

Function: |
||||

241.95 | 403.73 | 777.29 | 1421.41 | |

149.40 | 258.97 | 528.27 | 977.23 | |

35.68 | 60.17 | 116.40 | 211.93 | |

21.54 | 35.85 | 118.11 | 144.95 |

If software developers want to build or modify this package for their specific scientific inquiries, they can modify one or all of the three components of our package. First, what we really need as an input for the functions in our

In this section, we provide the sample codes for variable selection using the step-wise selection method and the regularization method with an elastic net penalty.

The step-wise selection method starts with a null model (

We also use the function

Figure

Plots of surrogate residuals versus each of the explanatory variables for the full model after adding the squared and cubic terms of sulphates.

^{2}Measures for Some Common Limited Dependent Variable Models