Pub. online: 17 June 2024
Type: Software
Open Access
1
The authors appreciate the associate editor and two referees for their invaluable feedback during the review process. Their expertise and insights enriched the quality of the work.
Accepted
9 May 2024
9 May 2024
Published
17 June 2024
17 June 2024
Abstract
Categorical data are prevalent in almost all research fields and business applications. Their statistical analysis and inference often rely on probit/logistic regression models. For these common models, however, there is no universally adopted measure for performing goodness-of-fit analysis. To this end, [26] proposed a so-called surrogate ${R^{2}}$ that resembles the ordinary least square (OLS) ${R^{2}}$ for linear regression models. The surrogate ${R^{2}}$ used the notion of surrogacy, namely, generating a continuous response S and using it as a surrogate of the original categorical response Y [24, 25, 8]. In this paper, we develop an R package $\mathbf{SurrogateRsq}$ to implement the surrogate ${R^{2}}$ method [43]. The package is compatible with existing model fitting functions (e.g., glm(), polr(), clm(), and vglm()), and its features are exhibited in a wine rating analysis. Our package can be used jointly with other R packages developed for variable selection and model diagnostics so as to form a complete model development process. This process is summarized and demonstrated in a categorical-data-modeling workflow that practitioners can follow. To exemplify an extended utility of the surrogate-${R^{2}}$-based goodness-of-fit analysis, we also use this package to illustrate how to compare different empirical models trained from different samples in the wine rating analysis. The result suggests that the package allows us to evaluate comparability across multiple samples/models/studies that address the same or similar scientific or business questions.
References
Analytics, R. and Weston, S. (2015). foreach: Provides foreach looping construct for R. R package version 1.4.3. https://CRAN.R-project.org/package=foreach.
Anderson, D. and Kurtz, T. Continuous time Markov chain models for chemical reaction networks. http://www.math.wisc.edu/~kurtz/papers/AndKurJuly10.pdf. Accessed 27 July 2010.
Blanchet, J., Leder, K. and Glynn, P. (2009). Efficient Simulation of Light-Tailed Sums: an Old-Folk Song Sung to a Faster New Tune... In Monte Carlo and Quasi-Monte Carlo Methods (P. L’ Ecuyer and A. B. Owen, eds.) Springer, Berlin. https://doi.org/10.1007/978-3-642-04107-5_13. MR2743897
Blanchet, J., Leder, K. and Shi, Y. (2011). Analysis of a splitting estimator for rare event probabilities in Jackson networks. Stochastic Systems 1 306–339. https://doi.org/10.1214/11-SSY026. MR2949543
Breheny, P. (2013). ncvreg: Regularization paths for scad-and mcp-penalized regression models. R package version 2 6–0. https://pbreheny.github.io/ncvreg/.
Breheny, P. and Breheny, M. P. (2014). Package ‘grpreg’. https://pbreheny.github.io/grpreg/.
Cheng, C., Wang, R. and Zhang, H. (2021). Surrogate Residuals for Discrete Choice Models. Journal of Computational and Graphical Statistics 30(1) 67–77. https://doi.org/10.1080/10618600.2020.1775618.
Christensen, R. H. B. (2019). ordinal—Regression Models for Ordinal Data. R package version 2019.12-10. https://CRAN.R-project.org/package=ordinal. http://www2.uaem.mx/r-mirror/web/packages/ordinal/.
Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J. (2009). Modeling Wine Preferences by Data Mining from Physicochemical Properties. Decision Support Systems 47(4) 547–553. https://doi.org/10.1016/j.dss.2009.05.016.
Cox, D. R. and Wermuth, N. (1992). A Comment on the Coefficient of Determination for Binary Responses. The American Statistician 46(1) 1–4. https://doi.org/10.1080/00031305.1992.10475836.
Cox, D. and Snell, E. (1989) Analysis of Binary Data 32. https://doi.org/10.1201/9781315137391.
Cragg, J. G. and Uhler, R. S. (1970). The Demand for Automobiles. The Canadian Journal of Economics 3(3) 386–406. https://doi.org/10.2307/133656.
Efron, B. (1978). Regression and ANOVA with Zero-one Data: Measures of Residual Variation. Journal of the American Statistical Association 73(361) 113–121. https://doi.org/10.1080/01621459.1978.10480013. MR0501624
Efron, B. and Tibshirani, R. J. An Introduction to the Bootstrap. Springer US. http://link.springer.com/10.1007/978-1-4899-4541-9.
Fan, J. and Lv, J. (2008). Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5) 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x@10.1111/(ISSN)1467-9868.TOP_SERIES_B_RESEARCH. https://doi.org/10.1111/j.1467-9868.2008.00674.x. MR2530322
Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1) 1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/.
Greenwell, B. M., McCarthy, A. J., Boehmke, B. C. and Liu, D. (2018). Residuals and Diagnostics for Binary and Ordinal Regression Models: An Introduction to the sure Package. The R Journal 10(1) 381–394. https://doi.org/10.32614/RJ-2018-004.
Hagle, T. M. and Mitchell, G. E. (1992). Goodness-of-Fit Measures for Probit and Logit. American Journal of Political Science 36(3) 762–784. https://doi.org/10.2307/2111590.
Harrell Jr, F. E. (2019). rms: Regression Modeling Strategies. R package version 5.1-4. https://CRAN.R-project.org/package=rms.
Hu, B., Shao, J. and Palta, M. (2006). Pseudo-R${^{2}}$ in Logistic Regression Model. Statistica Sinica 16(3) 847–860. https://www.jstor.org/stable/24307577.
Laitila, T. (1993). A Pseudo-R${^{2}}$ Measure for Limited and Qualitative Dependent Variable Models. Journal of Econometrics 56(3) 341–356. https://doi.org/10.1016/0304-4076(93)90125-O. https://doi.org/10.1016/0304-4076(93)90125-O. MR1219168
Li, S., Zhu, X., Chen, Y. and Liu, D. (2021). PAsso: an R Package for Assessing Partial Association between Ordinal Variables. The R Journal 13(2) 135. https://doi.org/10.32614/RJ-2021-088.
Liu, D. and Zhang, H. (2018). Residuals and Diagnostics for Ordinal Regression Models: A Surrogate Approach. Journal of the American Statistical Association 113(522) 845–854. https://doi.org/10.1080/01621459.2017.1292915. https://doi.org/10.1080/01621459.2017.1292915. MR3832231
Liu, D., Li, S., Yu, Y. and Moustaki, I. (2021). Assessing Partial Association Between Ordinal Variables: Quantification, Visualization, and Hypothesis Testing. Journal of the American Statistical Association 116(534) 955–968. https://doi.org/10.1080/01621459.2020.1796394. https://doi.org/10.1080/01621459.2020.1796394. MR4270036
Liu, D., Zhu, X., Greenwell, B. and Lin, Z. (2023). A new goodness-of-fit measure for probit models: Surrogate R2. British Journal of Mathematical and Statistical Psychology 76(1) 192–210. https://doi.org/10.1111/bmsp.12289.
Liu, I. and Agresti, A. (2005). The Analysis of Ordered Categorical Data: An Overview and a Survey of Recent Developments (with discussion). Test 14(1) 1–73. https://doi.org/10.1007/BF02595397.
Lumley, T. and Lumley, M. T. (2013). Package ‘leaps’. Regression subset selection. Thomas Lumley Based on Fortran Code by Alan Miller. Available online: http://CRAN.R-project.org/package=leaps (Accessed on 18 March 2018). https://cran.r-project.org/web/packages/leaps/index.html.
McFadden, D. (1973). Conditional Logit Analysis of Qualitative Choice Behavior. In Frontiers in Econometrics (P. Zarembka, ed.) 105–142. https://eml.berkeley.edu/reprints/mcfadden/zarembka.pdf.
McKelvey, R. D. and Zavoina, W. (1975). A Statistical Model for the Analysis of Ordinal Level Dependent Variables. Journal of Mathematical Sociology 4(1) 103–120. https://doi.org/10.1080/0022250X.1975.9989847. https://doi.org/10.1080/0022250x.1975.9989847. MR0400610
Nagelkerke, N. J. (1991). A Note on a General Definition of the Coefficient of Determination. Biometrika 78(3) 691–692. https://doi.org/10.1093/biomet/78.3.691. https://doi.org/10.1093/biomet/78.3.691. MR1130937
Ripley, B., Venables, B., Bates, D. M., Hornik, K., Gebhardt, A., Firth, D. and Ripley, M. B. (2013). Package ‘mass’. CRAN R 538 113–120. http://www.stats.ox.ac.uk/pub/MASS4/.
Saldana, D. F. and Feng, Y. (2018). SIS: An R package for sure independence screening in ultrahigh-dimensional statistical models. Journal of Statistical Software 83 1–25. https://doi.org/10.18637/jss.v083.i02.
Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2011). Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of Statistical Software 39(5) 1. https://doi.org/10.18637/jss.v039.i05.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1) 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x. MR1379242
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1) 267–288. https://www.jstor.org/stable/2346178. MR1379242
Tjur, T. (2009). Coefficients of Determination in Logistic Regression Models—A New Proposal: The Coefficient of Discrimination. The American Statistician 63(4) 366–372. https://doi.org/10.1198/tast.2009.08210. https://doi.org/10.1198/tast.2009.08210. MR2751755
Veall, M. R. and Zimmermann, K. F. (1996). Pseudo-R2 Measures for Some Common Limited Dependent Variable Models. Journal of Economic Surveys 10(3) 241–259. https://doi.org/10.1111/j.1467-6419.1996.tb00013.x.
Wurm, M. J., Rathouz, P. J. and Hanlon, B. M. (2021). Regularized Ordinal Regression and the ordinalNet R Package. Journal of Statistical Software 99 1–42. https://doi.org/10.18637/jss.v099.i06.
Yee, T. W. et al. (2010). The VGAM Package for Categorical Data Analysis. Journal of Statistical Software 32(10) 1–34. https://doi.org/10.18637/jss.v032.i10.
Zheng, B. and Agresti, A. (2000). Summarizing the Predictive Power of a Generalized Linear Model. Statistics in Medicine 19(13) 1771–1781. https://doi.org/10.1002/1097-0258(20000715)19:13<1771::AID-SIM485>3.0.CO;2-P.
Zhu, X., Lin, Z. and Liu, D. (2024). SurrogateRsq: Goodness-of-Fit Analysis for Categorical Data using the Surrogate R-Squared. R package version 0.2.1.9000. https://xiaorui.site/SurrogateRsq/.
Zhu, X., Li, S., Chen, Y. and Liu, D. (2020). PAsso: an R Package for Assessing Partial Association between Ordinal Variables. R package Version 0.1.9. https://xiaorui.site/PAsso/.
Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association 101(476) 1418–1429. https://doi.org/10.1198/016214506000000735. https://doi.org/10.1198/016214506000000735. MR2279469
Zou, H. and Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2) 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x. https://doi.org/10.1111/j.1467-9868.2005.00503.x. MR2137327