Subdata Selection With a Large Number of Variables
Volume 1, Issue 3 (2023), pp. 426–438
Pub. online: 15 June 2023
Type: Statistical Methodology
Open Access
Accepted
18 May 2023
18 May 2023
Published
15 June 2023
15 June 2023
Abstract
Subdata selection from big data is an active area of research that facilitates inferences based on big data with limited computational expense. For linear regression models, the optimal design-inspired Information-Based Optimal Subdata Selection (IBOSS) method is a computationally efficient method for selecting subdata that has excellent statistical properties. But the method can only be used if the subdata size, k, is at last twice the number of regression variables, p. In addition, even when $k\ge 2p$, under the assumption of effect sparsity, one can expect to obtain subdata with better statistical properties by trying to focus on active variables. Inspired by recent efforts to extend the IBOSS method to situations with a large number of variables p, we introduce a method called Combining Lasso And Subdata Selection (CLASS) that, as shown, improves on other proposed methods in terms of variable selection and building a predictive model based on subdata when the full data size n is very large and the number of variables p is large. In terms of computational expense, CLASS is more expensive than recent competitors for moderately large values of n, but the roles reverse under effect sparsity for extremely large values of n.
Supplementary material
Supplementary MaterialThe Supplementary Material is available online and contains more performance results corresponding to the cases in Table 1.
References
Ai, M., Yu, J., Zhang, H. and Wang, H. (2019). Optimal subsampling algorithms for big data regressions. Statistica Sinica. https://doi.org/https://doi.org/10.5705/ss.202018.0439
Ai, M., Wang, F., Yu, J. and Zhang, H. (2021). Optimal subsampling for large-scale quantile regression. Journal of Complexity 62 101512. https://doi.org/10.1016/j.jco.2020.101512. MR4174536
Chen, X. and Xie, M. -G. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica 24(4) 1655–1684. MR3308656
Cheng, Q., Wang, H. and Yang, M. (2020). Information-based optimal subdata selection for big data logistic regression. Journal of Statistical Planning and Inference 209 112–122. https://doi.org/10.1016/j.jspi.2020.03.004. MR4096258
Drineas, P., Mahoney, M. W. and Muthukrishnan, S. (2006). Sampling algorithms for l2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm 1127–1136. https://doi.org/10.1145/1109557.1109682. MR2373840
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5) 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x. MR2530322
Fan, Y., Liu, Y. and Zhu, L. (2021). Optimal subsampling for linear quantile regression models. Canadian Journal of Statistics 49(4) 1039–1057. https://doi.org/10.1002/cjs.11590. MR4349634
Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of statistics 42(5) 1693. https://doi.org/10.1214/14-AOS1220. MR3257627
Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1) 1–22. https://doi.org/10.18637/jss.v033.i01.
Fu, W. and Knight, K. (2000). Asymptotics for lasso-type estimators. The Annals of Statistics 28(5) 1356–1378. https://doi.org/10.1214/aos/1015957397. MR1805787
Han, L., Tan, K. M., Yang, T. and Zhang, T. (2020). Local uncertainty sampling for large-scale multiclass logistic regression. The Annals of Statistics 48(3) 1770–1788. https://doi.org/10.1214/19-AOS1867. MR4124343
Hedayat, A. S., Sloane, N. J. A. and Stufken, J. (1999) Orthogonal arrays: theory and applications. Springer Science & Business Media. https://doi.org/10.1007/978-1-4612-1478-6. MR1693498
Joseph, V. R. and Mak, S. (2021). Supervised compression of big data. Statistical Analysis and Data Mining: The ASA Data Science Journal 14(3) 217–229. https://doi.org/10.1002/sam.11508. MR4303067
Joseph, V. R. and Vakayil, A. (2022). Split: An optimal method for data splitting. Technometrics 64(2) 166–176. https://doi.org/10.1080/00401706.2021.1921037. MR4410911
Kiefer, J. and Wolfowitz, J. (1960). The equivalence of two extremum problems. Canadian Journal of Mathematics 12 363–366. https://doi.org/10.4153/CJM-1960-030-4. MR0117842
Kleiner, A., Talwalkar, A., Sarkar, P. and Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B: Statistical Methodology 76(4) 795–816. https://doi.org/10.1111/rssb.12050. MR3248677
Lin, N. and Xi, R. (2011). Aggregated estimating equation estimation. Statistics and Its Interface 4(1) 73–83. https://doi.org/10.4310/SII.2011.v4.n1.a8. MR2775250
Ma, P. and Sun, X. (2015). Leveraging for big data regression. Wiley Interdisciplinary Reviews: Computational Statistics 7(1) 70–76. https://doi.org/10.1002/wics.1324. MR3348722
Ma, P., Mahoney, M. W. and Yu, B. (2015). A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research 16(1) 861–911. MR3361306
Mahoney, M. W. (2011). Randomized algorithms for matrices and data. arXiv preprint arXiv:1104.5557.
Mak, S. and Joseph, V. R. (2018). Support points. The Annals of Statistics 46(6A) 2562–2592. https://doi.org/10.1214/17-AOS1629. MR3851748
Meinshausen, N. (2007). Relaxed lasso. Computational Statistics & Data Analysis 52(1) 374–393. https://doi.org/10.1016/j.csda.2006.12.019. MR2409990
Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(4) 417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x. MR2758523
Meng, C., Xie, R., Mandal, A., Zhang, X., Zhong, W. and Ma, P. (2020). LowCon: A design-based subsampling approach in a misspecified linear model. Journal of Computational and Graphical Statistics 1–32. https://doi.org/10.1080/10618600.2020.1844215. MR4313470
Meng, C., Zhang, X., Zhang, J., Zhong, W. and Ma, P. (2020). More efficient approximation of smoothing splines via space-filling basis selection. Biometrika 107(3) 723–735. https://doi.org/10.1093/biomet/asaa019. MR4138986
Muecke, N., Reiss, E., Rungenhagen, J. and Klein, M. (2022). Data-splitting improves statistical performance in overparameterized regimes. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (G. Camps-Valls, F. J. R. Ruiz and I. Valera, eds.). Proceedings of Machine Learning Research 151 10322–10350. PMLR. https://proceedings.mlr.press/v151/muecke22a.html.
Schifano, E. D., Wu, J., Wang, C., Yan, J. and Chen, M. -H. (2016). Online updating of statistical inference in the big data setting. Technometrics 58(3) 393–403. https://doi.org/10.1080/00401706.2016.1142900. MR3520668
Shao, L., Song, S. and Zhou, Y. (2022). Optimal subsampling for large-sample quantile regression with massive data. Canadian Journal of Statistics. MR4595236
Song, Q. and Liang, F. (2015). A split-and-merge Bayesian variable selection approach for ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B: Statistical Methodology 77 947–972. https://doi.org/10.1111/rssb.12095. MR3414135
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1) 267–288. MR1379242
Wang, C., Chen, M. -H., Schifano, E., Wu, J. and Yan, J. (2016). Statistical methods and computing for big data. Statistics and its interface 9(4) 399. https://doi.org/10.4310/SII.2016.v9.n4.a1. MR3553369
Wang, H. (2019). More Efficient Estimation for Logistic Regression with Optimal Subsamples. Journal of Machine Learning Research 20(132) 1–59. MR4002886
Wang, H. and Ma, Y. (2020). Optimal subsampling for quantile regression in big data. Biometrika. https://doi.org/10.1093/biomet/asaa043. MR4226192
Wang, H., Yang, M. and Stufken, J. (2019). Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association 114(525) 393–405. https://doi.org/10.1080/01621459.2017.1408468. MR3941263
Wang, H., Zhu, R. and Ma, P. (2018). Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113(522) 829–844. https://doi.org/10.1080/01621459.2017.1292914. MR3832230
Wang, L., Elmstedt, J., Wong, W. K. and Xu, H. (2021). Orthogonal subsampling for big data linear regression. The Annals of Applied Statistics 15(3) 1273–1290. https://doi.org/10.1214/21-aoas1462. MR4316648
Xue, Y., Wang, H., Yan, J. and Schifano, E. D. (2020). An online updating approach for testing the proportional hazards assumption with streams of survival data. Biometrics 76(1) 171–182. https://doi.org/10.1111/biom.13137. MR4098553
Yao, Y. and Wang, H. (2019). Optimal subsampling for softmax regression. Statistical Papers 60(2) 235–249. https://doi.org/10.1007/s00362-018-01068-6. MR3969047
Yao, Y. and Wang, H. (2021). A Selective Review on Statistical Techniques for Big Data. Modern Statistical Methods for Health Research 223–245. https://doi.org/10.1007/978-3-030-72437-5_11. MR4367515
Yu, J. and Wang, H. (2022). Subdata selection algorithm for linear model discrimination. Statistical Papers 1–24. https://doi.org/10.1007/s00362-022-01299-8. MR4512216
Yu, J., Wang, H., Ai, M. and Zhang, H. (2020). Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association 1–12. https://doi.org/10.1080/01621459.2020.1773832. MR4399084
Yuan, M. and Lin, Y. (2007). On the non-negative garrotte estimator. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(2) 143–161. https://doi.org/10.1111/j.1467-9868.2007.00581.x. MR2325269
Zhang, H. and Wang, H. (2021). Distributed subdata selection for big data via sampling-based approach. Computational Statistics & Data Analysis 153. https://doi.org/10.1016/j.csda.2020.107072. MR4144200
Zhang, T., Ning, Y. and Ruppert, D. (2021). Optimal sampling for generalized linear models under measurement constraints. Journal of Computational and Graphical Statistics 30(1) 106–114. https://doi.org/10.1080/10618600.2020.1778483. MR4235968
Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. The Journal of Machine Learning Research 7 2541–2563. MR2274449