The Supplementary Material is available online and contains more performance results corresponding to the cases in Table

Subdata selection from big data is an active area of research that facilitates inferences based on big data with limited computational expense. For linear regression models, the optimal design-inspired Information-Based Optimal Subdata Selection (IBOSS) method is a computationally efficient method for selecting subdata that has excellent statistical properties. But the method can only be used if the subdata size,

Unprecedented advancements in modern information technologies have resulted in an exponential growth of data and massive datasets. Data sizes are now measured in terabytes (TB) or petabytes (PB) and not in mere megabytes (MB) or gigabytes (GB). Big data facilitates and incentivizes data-driven decisions in almost every area of science, industry, and government. Given the challenges that big data presents due to its volume, variety, and complexity, extracting high-quality information from big data is a prerequisite for understanding the data meaningfully [

Some statistical methods for analyzing big data include bags of little bootstraps by [

The current literature on subdata selection is rapidly growing. Much of the relevant literature focuses on identifying subdata that yields precise estimates of parameters in a given statistical model, for example, for linear regression [see,

In addition, model-free subdata selection methods also exist. For example, one could mirror the population distribution in the subdata [

While subdata selection methods focus on data reduction by drastically reducing the number of observations

In Section

Let

When using the full data and model (

Subdata of size

For high-dimensional data (i.e., large

For a linear regression model, current subdata selection methods can be broadly classified into two categories:

Probabilistic methods. The

Deterministic methods: These methods, some of which draw inspiration from the optimal design literature, aim to select subdata of size

In what follows, we use IBOSS as a subdata selection method owing to its computational and statistical superiority.

Since IBOSS attempts to select at least 2 data points for each variable, it can only be applied if

We therefore develop a subdata selection method that mitigates these challenges for high-dimensional data.

With the ultimate goal of prediction, our method first screens variables to identify the active variables, and then performs the subdata selection using only the identified variables. Finally, a linear regression model with only the variables identified as active is fitted using the subdata and OLS estimation. Algorithm

CLASS.

In the next two subsections, we provide evidence for the superior performance of CLASS.

We perform repeated applications of LASSO in step 3 of Algorithm

The variable selection component of Algorithm

In the previous section, we demonstrated that the selected variables contain (a) true active variables with a very large probability and (b) some inactive variables with a positive probability. In Steps 9–10 of Algorithm

IBOSS subdata attempts to maximize the determinant of the information matrix of the model based on the selected variables. Since we apply IBOSS only using the selected variables, IBOSS subdata for CLASS gives a larger determinant of the information matrix corresponding to these selected variables than a method that uses IBOSS on all variables. If the selected variables are precisely the active variables, again indexed by

Theorem 5 in [

For a multivariate normal or lognormal distribution of the variables, this would then imply that for the overall mean,

The discussion up to this point has focused on variable selection and parameter estimation rather than prediction even though our goal is good prediction. However, the strong variable selection properties of our method (see Section

First, howsoever large the full data is, variable selection can be done in a feasible time since CLASS runs LASSO on small subsets of the full data. Second, CLASS does better at selecting active variables correctly and in not identifying inactive variables as active than LASSO on full data or than other competing subdata selection methods. The superior performance remains true irrespective of whether the variables are correlated. These claims are validated via simulations in the next section. Third, since CLASS employs IBOSS only on the selected variables for obtaining the subdata and since the active variables are almost always among the selected variables, subdata corresponding to CLASS gives a larger determinant for the information matrix for the active variables than the subdata obtained from competing methods. As a result, CLASS leads to better parameter estimation for the selected variables and prediction.

The computational complexity of LASSO for data of size

In this section, we compare the performance of CLASS with competing methods through simulation studies. The comparison focuses on variable selection and prediction accuracy for test data.

Data are generated from the linear model (

Simulation scenarios, with

500 | 10, 25, | |||

500 | 10, 25, | Normal, | ||

50 | LogNormal | |||

5000 | Normal, | |||

75 | LogNormal | |||

The simulation scenarios that we considered are summarized in Table

Variable selection: For variable selection we considered average power and average error. Power is defined as the proportion of active variables being correctly identified as active, whereas error is defined as the proportion of inactive variables that are incorrectly declared as active. High power and low error are desired. Therefore, a method with higher power and lower error is preferred.

Prediction accuracy: We used the mean squared error (MSE) for test data,

Variable selection performance for

We compare CLASS to four other approaches:

Fitting the linear regression model with all

Fitting the linear regression model with all

Fitting the linear regression model with all

Fitting the linear regression model with all

For each scenario in Table

MSE for

Figures

For Figures

Variable selection performance for

MSE for

In Figure

Variable selection and MSE for

Finally, Figure

Variable selection and MSE for

For

CPU times (seconds) for different

Full | UNI | IBOSS | SIS(100)-IBOSS | CLASS | |

2.35 | 0.54 | 0.98 | 0.58 | 51.82 | |

33.96 | 0.53 | 1.81 | 0.91 | 50.39 | |

361.66 | 0.54 | 9.21 | 3.88 | 51.36 |

CPU times (seconds) for different

Full | UNI | IBOSS | SIS(100)-IBOSS | CLASS | |

100 | 16.78 | 0.19 | 1.95 | 2.31 | 18.31 |

250 | 55.03 | 0.42 | 4.79 | 2.95 | 43.11 |

500 | 361.66 | 0.54 | 9.21 | 3.88 | 51.36 |

CPU times (seconds) for

UNI | IBOSS | SIS(80)-IBOSS | CLASS | |

0.10 | 32.22 | 32.83 | 25.86 |

As suggested by one of the reviewers, since CLASS is computationally slower than the other subdata selection methods in Tables

MSE and variable selection performance for different subdata methods with approximately equal CPU times, different

Method | Time (s) | MSE | Power | Error | |

For |
|||||

UNI | 90000 | 63.13 | 93.00642 | 0.9892 | 0.1957 |

IBOSS | 60000 | 54.53 | 53.00148 | 0.9932 | 0.2010 |

SIS(100)-IBOSS | 75000 | 53.73 | 49.58365 | 0.9936 | 0.2005 |

CLASS | 1000 | 50.13 | 0.084747 | 0.9998 | 0.0000 |

For |
|||||

UNI | 80000 | 55.58 | 110.0874 | 0.9906 | 0.1932 |

IBOSS | 50000 | 59.72 | 103.6516 | 0.9800 | 0.1970 |

SIS(100)-IBOSS | 70000 | 56.03 | 96.07541 | 0.9808 | 0.1963 |

CLASS | 1000 | 52.47 | 0.000104 | 1 | 0.0000 |

MSE and variable selection performance for different subdata methods with approximately equal CPU times, different

Method | Time (s) | MSE | Power | Error | |

For |
|||||

UNI | 90000 | 53.74 | 0.001053 | 1 | 0 |

IBOSS | 60000 | 51.60 | 0.000843 | 1 | 0 |

SIS(100)-IBOSS | 75000 | 46.89 | 0.000663 | 1 | 0 |

CLASS | 1000 | 43.56 | 0.044859 | 1 | 0 |

For |
|||||

UNI | 80000 | 48.97 | 0.000687 | 1 | 0.0000 |

IBOSS | 50000 | 69.27 | 0.001016 | 1 | 0.0000 |

SIS(100)-IBOSS | 70000 | 64.64 | 0.000679 | 1 | 0.0002 |

CLASS | 1000 | 44.90 | 0.045218 | 1 | 0.0000 |

MSPE over 100 random training-test splits on the Blog Feedback data.

Full | IBOSS | CLASS |

902.04 | 1043.65 | 1003.25 |

Similar to [

With a very large number of observations

In this work, under the assumption of effect sparsity, we propose a method, CLASS, that attempts to do just that. We first devise a variable selection method that uses small uniform random samples of the full data to conduct multiple LASSO runs. As demonstrated, our variable selection approach is better than applying Lasso to IBOSS subdata, SIS(

Due to the repeated applications of LASSO, CLASS takes a larger computing time than the competing subdata selection methods. However, if