Subdata Selection With a Large Number of Variables
Volume 1, Issue 3 (2023), pp. 426–438
Pub. online: 15 June 2023 Type: Statistical Methodology Open Access
18 May 2023
18 May 2023
15 June 2023
15 June 2023
Subdata selection from big data is an active area of research that facilitates inferences based on big data with limited computational expense. For linear regression models, the optimal design-inspired Information-Based Optimal Subdata Selection (IBOSS) method is a computationally efficient method for selecting subdata that has excellent statistical properties. But the method can only be used if the subdata size, k, is at last twice the number of regression variables, p. In addition, even when $k\ge 2p$, under the assumption of effect sparsity, one can expect to obtain subdata with better statistical properties by trying to focus on active variables. Inspired by recent efforts to extend the IBOSS method to situations with a large number of variables p, we introduce a method called Combining Lasso And Subdata Selection (CLASS) that, as shown, improves on other proposed methods in terms of variable selection and building a predictive model based on subdata when the full data size n is very large and the number of variables p is large. In terms of computational expense, CLASS is more expensive than recent competitors for moderately large values of n, but the roles reverse under effect sparsity for extremely large values of n.
Supplementary materialSupplementary Material
The Supplementary Material is available online and contains more performance results corresponding to the cases in Table 1.
Ai, M., Yu, J., Zhang, H. and Wang, H. (2019). Optimal subsampling algorithms for big data regressions. Statistica Sinica. https://doi.org/https://doi.org/10.5705/ss.202018.0439
Chen, X. and Xie, M. -G. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica 24(4) 1655–1684. MR3308656
Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1) 1–22. https://doi.org/10.18637/jss.v033.i01.
Ma, P., Mahoney, M. W. and Yu, B. (2015). A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research 16(1) 861–911. MR3361306
Mahoney, M. W. (2011). Randomized algorithms for matrices and data. arXiv preprint arXiv:1104.5557.
Muecke, N., Reiss, E., Rungenhagen, J. and Klein, M. (2022). Data-splitting improves statistical performance in overparameterized regimes. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (G. Camps-Valls, F. J. R. Ruiz and I. Valera, eds.). Proceedings of Machine Learning Research 151 10322–10350. PMLR. https://proceedings.mlr.press/v151/muecke22a.html.
Shao, L., Song, S. and Zhou, Y. (2022). Optimal subsampling for large-sample quantile regression with massive data. Canadian Journal of Statistics. MR4595236
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1) 267–288. MR1379242
Wang, H. (2019). More Efficient Estimation for Logistic Regression with Optimal Subsamples. Journal of Machine Learning Research 20(132) 1–59. MR4002886
Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. The Journal of Machine Learning Research 7 2541–2563. MR2274449