Subdata Selection With a Large Number of Variables

Singh, Rakhi; Stufken, John

doi:10.51387/23-NEJSDS36

The New England Journal of Statistics in Data Science

Subdata Selection With a Large Number of Variables

Volume 1, Issue 3 (2023), pp. 426–438

Rakhi Singh John Stufken

https://doi.org/10.51387/23-NEJSDS36

Pub. online: 15 June 2023 Type: Methodology Article

Open Access

Area: Statistical Methodology

Accepted
18 May 2023

Published
15 June 2023

Abstract

Subdata selection from big data is an active area of research that facilitates inferences based on big data with limited computational expense. For linear regression models, the optimal design-inspired Information-Based Optimal Subdata Selection (IBOSS) method is a computationally efficient method for selecting subdata that has excellent statistical properties. But the method can only be used if the subdata size, k, is at last twice the number of regression variables, p. In addition, even when $k\ge 2p$, under the assumption of effect sparsity, one can expect to obtain subdata with better statistical properties by trying to focus on active variables. Inspired by recent efforts to extend the IBOSS method to situations with a large number of variables p, we introduce a method called Combining Lasso And Subdata Selection (CLASS) that, as shown, improves on other proposed methods in terms of variable selection and building a predictive model based on subdata when the full data size n is very large and the number of variables p is large. In terms of computational expense, CLASS is more expensive than recent competitors for moderately large values of n, but the roles reverse under effect sparsity for extremely large values of n.

Supplementary material

Supplementary Material

The Supplementary Material is available online and contains more performance results corresponding to the cases in Table 1.

References

[1]

Ai, M., Yu, J., Zhang, H. and Wang, H. (2019). Optimal subsampling algorithms for big data regressions. Statistica Sinica. https://doi.org/https://doi.org/10.5705/ss.202018.0439

[2]

Ai, M., Wang, F., Yu, J. and Zhang, H. (2021). Optimal subsampling for large-scale quantile regression. Journal of Complexity 62 101512. https://doi.org/10.1016/j.jco.2020.101512. MR4174536

[3]

Bach, F. R. (2008). Bolasso: model consistent lasso estimation through the bootstrap. In Proceedings of the 25th international conference on Machine learning 33–40.

[4]

Buza, K. (2014). Feedback prediction for blogs. In Data analysis, machine learning and knowledge discovery 145–152 Springer.

[5]

Cai, L. and Zhu, Y. (2015). The challenges of data quality and data quality assessment in the big data era. Data science journal 14.

[6]

Chen, X. and Xie, M. -G. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica 24(4) 1655–1684. MR3308656

[7]

Cheng, Q., Wang, H. and Yang, M. (2020). Information-based optimal subdata selection for big data logistic regression. Journal of Statistical Planning and Inference 209 112–122. https://doi.org/10.1016/j.jspi.2020.03.004. MR4096258

[8]

Derezinski, M., Warmuth, M. K. and Hsu, D. J. (2018). Leveraged volume sampling for linear regression. Advances in Neural Information Processing Systems 31.

[9]

Drineas, P. and Mahoney, M. W. (2016). RandNLA: randomized numerical linear algebra. Communications of the ACM 59(6) 80–90.

[10]

Drineas, P., Mahoney, M. W. and Muthukrishnan, S. (2006). Sampling algorithms for l2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm 1127–1136. https://doi.org/10.1145/1109557.1109682. MR2373840

[11]

Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5) 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x. MR2530322

[12]

Fan, Y., Liu, Y. and Zhu, L. (2021). Optimal subsampling for linear quantile regression models. Canadian Journal of Statistics 49(4) 1039–1057. https://doi.org/10.1002/cjs.11590. MR4349634

[13]

Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of statistics 42(5) 1693. https://doi.org/10.1214/14-AOS1220. MR3257627

[14]

Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1) 1–22. https://doi.org/10.18637/jss.v033.i01.

[15]

Fu, W. and Knight, K. (2000). Asymptotics for lasso-type estimators. The Annals of Statistics 28(5) 1356–1378. https://doi.org/10.1214/aos/1015957397. MR1805787

[16]

Han, L., Tan, K. M., Yang, T. and Zhang, T. (2020). Local uncertainty sampling for large-scale multiclass logistic regression. The Annals of Statistics 48(3) 1770–1788. https://doi.org/10.1214/19-AOS1867. MR4124343

[17]

Hedayat, A. S., Sloane, N. J. A. and Stufken, J. (1999) Orthogonal arrays: theory and applications. Springer Science & Business Media. https://doi.org/10.1007/978-1-4612-1478-6. MR1693498

[18]

Joseph, V. R. and Mak, S. (2021). Supervised compression of big data. Statistical Analysis and Data Mining: The ASA Data Science Journal 14(3) 217–229. https://doi.org/10.1002/sam.11508. MR4303067

[19]

Joseph, V. R. and Vakayil, A. (2022). Split: An optimal method for data splitting. Technometrics 64(2) 166–176. https://doi.org/10.1080/00401706.2021.1921037. MR4410911

[20]

Kiefer, J. and Wolfowitz, J. (1960). The equivalence of two extremum problems. Canadian Journal of Mathematics 12 363–366. https://doi.org/10.4153/CJM-1960-030-4. MR0117842

[21]

Kleiner, A., Talwalkar, A., Sarkar, P. and Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B: Statistical Methodology 76(4) 795–816. https://doi.org/10.1111/rssb.12050. MR3248677

[22]

Lin, N. and Xi, R. (2011). Aggregated estimating equation estimation. Statistics and Its Interface 4(1) 73–83. https://doi.org/10.4310/SII.2011.v4.n1.a8. MR2775250

[23]

Ma, P. and Sun, X. (2015). Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics 7(1) 70–76. https://doi.org/10.1002/wics.1324. MR3348722

[24]

Ma, P., Mahoney, M. W. and Yu, B. (2015). A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research 16(1) 861–911. MR3361306

[25]

Mahoney, M. W. (2011). Randomized algorithms for matrices and data. arXiv preprint arXiv:1104.5557.

[26]

Mak, S. and Joseph, V. R. (2018). Support points. The Annals of Statistics 46(6A) 2562–2592. https://doi.org/10.1214/17-AOS1629. MR3851748

[27]

Meinshausen, N. (2007). Relaxed lasso. Computational Statistics & Data Analysis 52(1) 374–393. https://doi.org/10.1016/j.csda.2006.12.019. MR2409990

[28]

Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(4) 417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x. MR2758523

[29]

Meng, C., Xie, R., Mandal, A., Zhang, X., Zhong, W. and Ma, P. (2020). LowCon: A design-based subsampling approach in a misspecified linear model. Journal of Computational and Graphical Statistics 1–32. https://doi.org/10.1080/10618600.2020.1844215. MR4313470

[30]

Meng, C., Zhang, X., Zhang, J., Zhong, W. and Ma, P. (2020). More efficient approximation of smoothing splines via space-filling basis selection. Biometrika 107(3) 723–735. https://doi.org/10.1093/biomet/asaa019. MR4138986

[31]

Muecke, N., Reiss, E., Rungenhagen, J. and Klein, M. (2022). Data-splitting improves statistical performance in overparameterized regimes. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (G. Camps-Valls, F. J. R. Ruiz and I. Valera, eds.). Proceedings of Machine Learning Research 151 10322–10350. PMLR. https://proceedings.mlr.press/v151/muecke22a.html.

[32]

Schifano, E. D., Wu, J., Wang, C., Yan, J. and Chen, M. -H. (2016). Online updating of statistical inference in the big data setting. Technometrics 58(3) 393–403. https://doi.org/10.1080/00401706.2016.1142900. MR3520668

[33]

Shao, L., Song, S. and Zhou, Y. (2022). Optimal subsampling for large-sample quantile regression with massive data. Canadian Journal of Statistics. MR4595236

[34]

Song, Q. and Liang, F. (2015). A split-and-merge Bayesian variable selection approach for ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B: Statistical Methodology 77 947–972. https://doi.org/10.1111/rssb.12095. MR3414135

[35]

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1) 267–288. MR1379242

[36]

Ting, D. and Brochu, E. (2018). Optimal subsampling with influence functions. In Advances in neural information processing systems 3650–3659.

[37]

Vakayil, A. and Joseph, V. R. (2021). Data Twinning. arXiv preprint arXiv:2110.02927. MR4501911

[38]

Wang, C., Chen, M. -H., Schifano, E., Wu, J. and Yan, J. (2016). Statistical methods and computing for big data. Statistics and its interface 9(4) 399. https://doi.org/10.4310/SII.2016.v9.n4.a1. MR3553369

[39]

Wang, H. (2019). More Efficient Estimation for Logistic Regression with Optimal Subsamples. Journal of Machine Learning Research 20(132) 1–59. MR4002886

[40]

Wang, H. and Ma, Y. (2020). Optimal subsampling for quantile regression in big data. Biometrika. https://doi.org/10.1093/biomet/asaa043. MR4226192

[41]

Wang, H., Yang, M. and Stufken, J. (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association 114(525) 393–405. https://doi.org/10.1080/01621459.2017.1408468. MR3941263

[42]

Wang, H., Zhu, R. and Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113(522) 829–844. https://doi.org/10.1080/01621459.2017.1292914. MR3832230

[43]

Wang, L., Elmstedt, J., Wong, W. K. and Xu, H. (2021). Orthogonal subsampling for big data linear regression. The Annals of Applied Statistics 15(3) 1273–1290. https://doi.org/10.1214/21-aoas1462. MR4316648

[44]

Wang, X., Yang, M. and Li, W. (2022). Efficient Data Reduction Strategies for Big Data and High-Dimensional LASSO Regressions. in preparation.

[45]

Xue, Y., Wang, H., Yan, J. and Schifano, E. D. (2020). An online updating approach for testing the proportional hazards assumption with streams of survival data. Biometrics 76(1) 171–182. https://doi.org/10.1111/biom.13137. MR4098553

[46]

Yao, Y. and Wang, H. (2019). Optimal subsampling for softmax regression. Statistical Papers 60(2) 235–249. https://doi.org/10.1007/s00362-018-01068-6. MR3969047

[47]

Yao, Y. and Wang, H. (2021). A review on optimal subsampling methods for massive datasets. Journal of Data Science 19(1) 151–172.

[48]

Yao, Y. and Wang, H. (2021). A Selective Review on Statistical Techniques for Big Data. Modern Statistical Methods for Health Research 223–245. https://doi.org/10.1007/978-3-030-72437-5_11. MR4367515

[49]

Yu, J. and Wang, H. (2022). Subdata selection algorithm for linear model discrimination. Statistical Papers 1–24. https://doi.org/10.1007/s00362-022-01299-8. MR4512216

[50]

Yu, J., Wang, H., Ai, M. and Zhang, H. (2020). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association 1–12. https://doi.org/10.1080/01621459.2020.1773832. MR4399084

[51]

Yuan, M. and Lin, Y. (2007). On the non-negative garrotte estimator. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(2) 143–161. https://doi.org/10.1111/j.1467-9868.2007.00581.x. MR2325269

[52]

Zhang, H. and Wang, H. (2021). Distributed subdata selection for big data via sampling-based approach. Computational Statistics & Data Analysis 153. https://doi.org/10.1016/j.csda.2020.107072. MR4144200

[53]

Zhang, T., Ning, Y. and Ruppert, D. (2021). Optimal sampling for generalized linear models under measurement constraints. Journal of Computational and Graphical Statistics 30(1) 106–114. https://doi.org/10.1080/10618600.2020.1778483. MR4235968

[54]

Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. The Journal of Machine Learning Research 7 2541–2563. MR2274449

Full article Related articles

Open access article under the CC BY license.

Keywords

Effect Sparsity Optimal Design Prediction Subsampling Variable Selection

Funding

JS gratefully acknowledges support through NSF grants DMS-1935729 and DMS-2304767.

Metrics

since December 2021

978

Article info
views

1119

Full article
views

349

PDF
downloads

XML
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file