The New England Journal of Statistics in Data Science logo


  • Help
Login Register

  1. Home
  2. Issues
  3. Volume 3, Issue 1 (2025)
  4. SurrogateRsq: An R Package for Categoric ...

The New England Journal of Statistics in Data Science

Submit your article Information Become a Peer-reviewer
  • Article info
  • Full article
  • Related articles
  • More
    Article info Full article Related articles

SurrogateRsq: An R Package for Categorical Data Goodness-of-Fit Analysis Using the Surrogate R2
Volume 3, Issue 1 (2025), pp. 94–105
Xiaorui Zhu 1   Zewei Lin 1   Dungang Liu 1     All authors (4)

Authors

 
Placeholder
https://doi.org/10.51387/24-NEJSDS67
Pub. online: 17 June 2024      Type: Software Tutorial And/or Review      Open accessOpen Access
Area: Software

1 The authors appreciate the associate editor and two referees for their invaluable feedback during the review process. Their expertise and insights enriched the quality of the work.

Accepted
9 May 2024
Published
17 June 2024

Abstract

Categorical data are prevalent in almost all research fields and business applications. Their statistical analysis and inference often rely on probit/logistic regression models. For these common models, however, there is no universally adopted measure for performing goodness-of-fit analysis. To this end, [26] proposed a so-called surrogate ${R^{2}}$ that resembles the ordinary least square (OLS) ${R^{2}}$ for linear regression models. The surrogate ${R^{2}}$ used the notion of surrogacy, namely, generating a continuous response S and using it as a surrogate of the original categorical response Y [24, 25, 8]. In this paper, we develop an R package $\mathbf{SurrogateRsq}$ to implement the surrogate ${R^{2}}$ method [43]. The package is compatible with existing model fitting functions (e.g., glm(), polr(), clm(), and vglm()), and its features are exhibited in a wine rating analysis. Our package can be used jointly with other R packages developed for variable selection and model diagnostics so as to form a complete model development process. This process is summarized and demonstrated in a categorical-data-modeling workflow that practitioners can follow. To exemplify an extended utility of the surrogate-${R^{2}}$-based goodness-of-fit analysis, we also use this package to illustrate how to compare different empirical models trained from different samples in the wine rating analysis. The result suggests that the package allows us to evaluate comparability across multiple samples/models/studies that address the same or similar scientific or business questions.

References

[1] 
Analytics, R. and Weston, S. (2015). foreach: Provides foreach looping construct for R. R package version 1.4.3. https://CRAN.R-project.org/package=foreach.
[2] 
Anderson, D. and Kurtz, T. Continuous time Markov chain models for chemical reaction networks. http://www.math.wisc.edu/~kurtz/papers/AndKurJuly10.pdf. Accessed 27 July 2010.
[3] 
Blanchet, J., Leder, K. and Glynn, P. (2009). Efficient Simulation of Light-Tailed Sums: an Old-Folk Song Sung to a Faster New Tune... In Monte Carlo and Quasi-Monte Carlo Methods (P. L’ Ecuyer and A. B. Owen, eds.) Springer, Berlin. https://doi.org/10.1007/978-3-642-04107-5_13. MR2743897
[4] 
Blanchet, J., Leder, K. and Shi, Y. (2011). Analysis of a splitting estimator for rare event probabilities in Jackson networks. Stochastic Systems 1 306–339. https://doi.org/10.1214/11-SSY026. MR2949543
[5] 
Breheny, P. (2013). ncvreg: Regularization paths for scad-and mcp-penalized regression models. R package version 2 6–0. https://pbreheny.github.io/ncvreg/.
[6] 
Breheny, P. and Breheny, M. P. (2014). Package ‘grpreg’. https://pbreheny.github.io/grpreg/.
[7] 
Chao, X., Miyazawa, M. and Pinedo, M. (1999) Queueing Networks: Customers, Signals and Product Form Solutions. Wiley, New York.
[8] 
Cheng, C., Wang, R. and Zhang, H. (2021). Surrogate Residuals for Discrete Choice Models. Journal of Computational and Graphical Statistics 30(1) 67–77. https://doi.org/10.1080/10618600.2020.1775618.
[9] 
Christensen, R. H. B. (2019). ordinal—Regression Models for Ordinal Data. R package version 2019.12-10. https://CRAN.R-project.org/package=ordinal. http://www2.uaem.mx/r-mirror/web/packages/ordinal/.
[10] 
Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J. (2009). Modeling Wine Preferences by Data Mining from Physicochemical Properties. Decision Support Systems 47(4) 547–553. https://doi.org/10.1016/j.dss.2009.05.016.
[11] 
Cox, D. R. and Wermuth, N. (1992). A Comment on the Coefficient of Determination for Binary Responses. The American Statistician 46(1) 1–4. https://doi.org/10.1080/00031305.1992.10475836.
[12] 
Cox, D. and Snell, E. (1989) Analysis of Binary Data 32. https://doi.org/10.1201/9781315137391.
[13] 
Cragg, J. G. and Uhler, R. S. (1970). The Demand for Automobiles. The Canadian Journal of Economics 3(3) 386–406. https://doi.org/10.2307/133656.
[14] 
Efron, B. (1978). Regression and ANOVA with Zero-one Data: Measures of Residual Variation. Journal of the American Statistical Association 73(361) 113–121. https://doi.org/10.1080/01621459.1978.10480013. MR0501624
[15] 
Efron, B. and Tibshirani, R. J. An Introduction to the Bootstrap. Springer US. http://link.springer.com/10.1007/978-1-4899-4541-9.
[16] 
Fan, J. and Lv, J. (2008). Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5) 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x@10.1111/(ISSN)1467-9868.TOP_SERIES_B_RESEARCH. https://doi.org/10.1111/j.1467-9868.2008.00674.x. MR2530322
[17] 
Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1) 1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/.
[18] 
Greenwell, B. M., McCarthy, A. J., Boehmke, B. C. and Liu, D. (2018). Residuals and Diagnostics for Binary and Ordinal Regression Models: An Introduction to the sure Package. The R Journal 10(1) 381–394. https://doi.org/10.32614/RJ-2018-004.
[19] 
Hagle, T. M. and Mitchell, G. E. (1992). Goodness-of-Fit Measures for Probit and Logit. American Journal of Political Science 36(3) 762–784. https://doi.org/10.2307/2111590.
[20] 
Harrell Jr, F. E. (2019). rms: Regression Modeling Strategies. R package version 5.1-4. https://CRAN.R-project.org/package=rms.
[21] 
Hu, B., Shao, J. and Palta, M. (2006). Pseudo-R${^{2}}$ in Logistic Regression Model. Statistica Sinica 16(3) 847–860. https://www.jstor.org/stable/24307577.
[22] 
Laitila, T. (1993). A Pseudo-R${^{2}}$ Measure for Limited and Qualitative Dependent Variable Models. Journal of Econometrics 56(3) 341–356. https://doi.org/10.1016/0304-4076(93)90125-O. https://doi.org/10.1016/0304-4076(93)90125-O. MR1219168
[23] 
Li, S., Zhu, X., Chen, Y. and Liu, D. (2021). PAsso: an R Package for Assessing Partial Association between Ordinal Variables. The R Journal 13(2) 135. https://doi.org/10.32614/RJ-2021-088.
[24] 
Liu, D. and Zhang, H. (2018). Residuals and Diagnostics for Ordinal Regression Models: A Surrogate Approach. Journal of the American Statistical Association 113(522) 845–854. https://doi.org/10.1080/01621459.2017.1292915. https://doi.org/10.1080/01621459.2017.1292915. MR3832231
[25] 
Liu, D., Li, S., Yu, Y. and Moustaki, I. (2021). Assessing Partial Association Between Ordinal Variables: Quantification, Visualization, and Hypothesis Testing. Journal of the American Statistical Association 116(534) 955–968. https://doi.org/10.1080/01621459.2020.1796394. https://doi.org/10.1080/01621459.2020.1796394. MR4270036
[26] 
Liu, D., Zhu, X., Greenwell, B. and Lin, Z. (2023). A new goodness-of-fit measure for probit models: Surrogate R2. British Journal of Mathematical and Statistical Psychology 76(1) 192–210. https://doi.org/10.1111/bmsp.12289.
[27] 
Liu, I. and Agresti, A. (2005). The Analysis of Ordered Categorical Data: An Overview and a Survey of Recent Developments (with discussion). Test 14(1) 1–73. https://doi.org/10.1007/BF02595397.
[28] 
Lumley, T. and Lumley, M. T. (2013). Package ‘leaps’. Regression subset selection. Thomas Lumley Based on Fortran Code by Alan Miller. Available online: http://CRAN.R-project.org/package=leaps (Accessed on 18 March 2018). https://cran.r-project.org/web/packages/leaps/index.html.
[29] 
McFadden, D. (1973). Conditional Logit Analysis of Qualitative Choice Behavior. In Frontiers in Econometrics (P. Zarembka, ed.) 105–142. https://eml.berkeley.edu/reprints/mcfadden/zarembka.pdf.
[30] 
McKelvey, R. D. and Zavoina, W. (1975). A Statistical Model for the Analysis of Ordinal Level Dependent Variables. Journal of Mathematical Sociology 4(1) 103–120. https://doi.org/10.1080/0022250X.1975.9989847. https://doi.org/10.1080/0022250x.1975.9989847. MR0400610
[31] 
Nagelkerke, N. J. (1991). A Note on a General Definition of the Coefficient of Determination. Biometrika 78(3) 691–692. https://doi.org/10.1093/biomet/78.3.691. https://doi.org/10.1093/biomet/78.3.691. MR1130937
[32] 
Pant, S., Blaauw, D., Zolotov, V., Sundareswaran, S. and Panda, R. (2004). A stochastic approach to power grid analysis. In Proceedings of the 41st annual Design Automation Conference. DAC ’04 171–176. ACM, New York.
[33] 
Ripley, B., Venables, B., Bates, D. M., Hornik, K., Gebhardt, A., Firth, D. and Ripley, M. B. (2013). Package ‘mass’. CRAN R 538 113–120. http://www.stats.ox.ac.uk/pub/MASS4/.
[34] 
Saldana, D. F. and Feng, Y. (2018). SIS: An R package for sure independence screening in ultrahigh-dimensional statistical models. Journal of Statistical Software 83 1–25. https://doi.org/10.18637/jss.v083.i02.
[35] 
Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2011). Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of Statistical Software 39(5) 1. https://doi.org/10.18637/jss.v039.i05.
[36] 
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1) 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x. MR1379242
[37] 
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1) 267–288. https://www.jstor.org/stable/2346178. MR1379242
[38] 
Tjur, T. (2009). Coefficients of Determination in Logistic Regression Models—A New Proposal: The Coefficient of Discrimination. The American Statistician 63(4) 366–372. https://doi.org/10.1198/tast.2009.08210. https://doi.org/10.1198/tast.2009.08210. MR2751755
[39] 
Veall, M. R. and Zimmermann, K. F. (1996). Pseudo-R2 Measures for Some Common Limited Dependent Variable Models. Journal of Economic Surveys 10(3) 241–259. https://doi.org/10.1111/j.1467-6419.1996.tb00013.x.
[40] 
Wurm, M. J., Rathouz, P. J. and Hanlon, B. M. (2021). Regularized Ordinal Regression and the ordinalNet R Package. Journal of Statistical Software 99 1–42. https://doi.org/10.18637/jss.v099.i06.
[41] 
Yee, T. W. et al. (2010). The VGAM Package for Categorical Data Analysis. Journal of Statistical Software 32(10) 1–34. https://doi.org/10.18637/jss.v032.i10.
[42] 
Zheng, B. and Agresti, A. (2000). Summarizing the Predictive Power of a Generalized Linear Model. Statistics in Medicine 19(13) 1771–1781. https://doi.org/10.1002/1097-0258(20000715)19:13<1771::AID-SIM485>3.0.CO;2-P.
[43] 
Zhu, X., Lin, Z. and Liu, D. (2024). SurrogateRsq: Goodness-of-Fit Analysis for Categorical Data using the Surrogate R-Squared. R package version 0.2.1.9000. https://xiaorui.site/SurrogateRsq/.
[44] 
Zhu, X., Li, S., Chen, Y. and Liu, D. (2020). PAsso: an R Package for Assessing Partial Association between Ordinal Variables. R package Version 0.1.9. https://xiaorui.site/PAsso/.
[45] 
Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association 101(476) 1418–1429. https://doi.org/10.1198/016214506000000735. https://doi.org/10.1198/016214506000000735. MR2279469
[46] 
Zou, H. and Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2) 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x. https://doi.org/10.1111/j.1467-9868.2005.00503.x. MR2137327

Full article Related articles PDF XML
Full article Related articles PDF XML

Copyright
© 2025 New England Statistical Society
by logo by logo
Open access article under the CC BY license.

Keywords
Categorical data analysis Goodness-of-fit measure Logistic regression Model comparison Probit model Surrogate method Surrogate residual

Metrics
since December 2021
287

Article info
views

161

Full article
views

114

PDF
downloads

25

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

The New England Journal of Statistics in Data Science

  • ISSN: 2693-7166
  • Copyright © 2021 New England Statistical Society

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer
Powered by PubliMill  •  Privacy policy