Construction of Supersaturated Designs with Small Coherence for Variable Selection

The supersaturated design is often used to discover important factors in an experiment with a large number of factors and a small number of runs. We propose a method for constructing supersaturated designs with small coherence. Such designs are useful for variable selection methods such as the Lasso. Examples are provided to illustrate the proposed method.


INTRODUCTION
Modern experiments can involve a large number of factors. Assuming the effect sparsity principle [3], an experiment with high-dimensional inputs only has a small number of significant factors. Two-level supersaturated designs are often used to identify important factors in screening experiments. Data from such a design can be modeled by a linear model [33] given by y = Xβ + , (1.1) where X is the n × p design matrix at two levels −1 and +1, denoted by − and +, y is the response vector of the n observations, β is the regression vector of p coefficients, the error vector follows the multivariate normal distribution N n (0, σ 2 I) and I is the n × n identity matrix. In X, the first column is the intercept with all +'s and the remaining columns represent the p − 1 factors. We call X balanced if all its factors consist of an equal number of − and +. If p = n, X is a saturated design since all the degrees of freedom are used to estimate β. If p > n, X is a supersaturated design since there are not enough degrees of freedom to estimate all components of β. Construction of supersaturated designs that can accommodate a large number of factors relative to the number of runs has been gaining more and more attention [5,16,30,26]. Popular criteria for comparing supersaturated designs include minimax [2], E(s 2 ) [18,27] and mean square correlation [8].
It is increasingly common to use modern variable selection methods to analyze data from a supersaturated design and identify important factors [18,32,17]. For example, when the number of variables is greater than the number of observations, Osborne, Presnell and Turlach [22] advised to * Corresponding author. use the Lasso [28] for an initial selection, and then conduct a best subset selection (or similar method) on the variables selected by the Lasso. We focus on discussing and demonstrating our method for the Lasso given its popularity. The proposed designs also work with other penalized regression methods. For variable selection methods such as the Lasso, Zhao and Yu [34] showed that their performance depends more critically on the worst case column correlation than the average column correlation of the design matrix. More details about this point can be found in Appendix B.
Motivated by the importance of controlling the worst case column correlation, we propose a new method to construct supersaturated designs according to the coherence criterion [7], which measures the worst case column correlation. The coherence of an n × p matrix X is where the subscript i denotes the ith column, ·, · is the dot product and · is the L 2 -norm. Whenever there is no confusion, hereinafter we will drop the symbol X in μ(X).
The smaller μ is, the closer the matrix is to orthogonal. If the columns of X are centered, μ is reduced to the maximum absolute column correlation of X, which is equivalent to the correlation criterion used in [19]. Note that where s max = max 1≤i<j≤p |x i x j | is the value used in the minimax criterion [2]. Different from s max , μ(X) reflects the sample size n in its definition. Because the sample size matters in many data analysis procedures, we will construct designs under μ(X) instead of s max (we will also compare designs with different sample sizes in Section 3.1).
Our construction method allows some columns of the design matrix to be unbalanced. Factor balance in a supersaturated design guarantees an accurate estimate of the intercept but unbalance provides other significant benefits in constructing designs. Being unbalanced provides flexibility in controlling the column correlations between every pair of the main effects to achieve a lower average column correlation [29]. Sacrificing the precision of the estimate of the intercept can better estimate the main effects. In addition, allowing unbalance enables us to construct designs with more flexible sizes.
The remainder of the article will unfold as follows. In Section 2, we detail our construction method. In Section 3, we use several examples to demonstrate the advantage of the constructed designs against some benchmark designs for a variable selection problem. We conclude the paper and provide some discussions in Section 4.

Notation and Definitions
Throughout, "#Balance" denotes the number of balanced columns of a design. Let 1 n denote the n-dimensional unit vector with +1's. Let I denote the identity matrix. For a matrix A with entries in {−1, +1} and an even number of rows, let A * denote a matrix obtained by changing the signs of entries in all odd rows and A * * denote a matrix obtained by changing the signs of entries in all even rows, where * and * * are two matrix operators. Note that (−A) * * = A * . An n × n Hadamard matrix H is a matrix with entries in {−1, +1} and H H = nI, where n can only be 1, 2 or a multiple of 4.

Construction Steps
Our construction method expands a 6m × p supersaturated design D 0 with μ ≤ 1/3 to a 12m × 4p design with μ ≤ 1/3, where m, p are positive integers with 6m < p. Partition D 0 as where U and L have 4m and 2m rows, respectively. The proposed method has three steps.
• Step 1: Use two copies of D 0 to obtain a matrix • Step 2: Expand D 1 to obtain a matrix (2.2) • Step 3: Expand D 2 to obtain a matrix If D 1 is viewed as a block matrix with four rows, Step 2 applies the operator * to the second row and changes the signs of all entries in the fourth row to obtain D 2 . If D 2 is viewed as a block matrix with four rows and two columns, Step 3 applies the operator * * to the first and fourth rows and applies the operator * to the third row to obtain D 3 . It can be proved that the coherence of D 1 , D 2 and D 3 is no greater than 1/3 by Lemmas 1, 2, 3 and Theorem 1 in Appendix A.
Since the design D 3 from the construction is a 12m × 4p design with μ ≤ 1/3, repeating the above procedure multiple times with D 0 being D 3 produces a class of designs with (2 k × 6m) rows, (4 k × p) columns and μ ≤ 1/3 for every positive integer k.
The only requirement for D 0 is to be a 6m × p two-level supersaturated design with μ ≤ 1/3. For a given pair of m and p, different choices of D 0 yield different forms of D 3 but all of them have guaranteed small coherence. As suggested by Chen and Lin [6] and Liu, Ruan and Dean [20], a supersaturated design with coherence of 1/3 can identify most of the important factors.

Examples
We now provide several examples for the proposed method with different choices of D 0 . In the following examples, we show the first several designs constructed by the proposed method and choose D 0 to have as many columns as possible. Example 1. Let D 0 be a 6 × 16 design given by where − and + denote −1 and +1, respectively, and its coherence is 1/3. The coherence of the above design attains the lower bound derived in [31]. The sizes, coherence values and numbers of the balanced columns of the first several designs constructed by the proposed method are given in Table 1.

Example 2. Let
where D 01 is the 24 × 23 Plackett and Burman design [23], D 02 is formed by taking the two-order interaction terms of D 01 and D 03 is formed by taking the three-order interaction terms of D 01 . Then D 0 is a 24 × 2048 design with μ = 1/3.  The first 277 columns of this design form the design in [32]. The sizes, coherence values and numbers of the balanced columns of the first several designs constructed by the proposed method are given in Table 2.
Example 3. Let D 0 be the 30 × 59 design from Lin [18], where an additional column 1 30 is added as the first column. This design has μ = 0.2. The sizes, coherence values and numbers of the balanced columns of the first several designs constructed by the proposed method are given in Table 3. This example illustrates that our method can use a design D 0 with μ < 1/3.
Whenever necessary, one can select some columns of the designs constructed for use. Selecting a subset of columns could remain coherence the same or further decrease it, but will never increase coherence. For example, for a given p, first, one can select as many balanced columns as possible. If there are no sufficient balanced columns, then one can select some unbalanced columns to achieve p columns. In addition, one may also select the columns according to an additional criterion such as E(s 2 ).

Generalization
We now generalize the proposed method to obtain a supersaturated design with μ smaller than 1/3. Recall that our method partitions D 0 into two parts. The upper part U has 6mr rows and the lower part L has 6m(1−r) rows with a partition ratio r = 2/3. We generalize the original construction by using a different partition ratio and stopping at Step 2. Suppose that D 0 is an n × p two-level supersaturated design with μ ≤ t/n for t ∈ {2, 4, . . . , n/2} and an even n. Partition D 0 as in (2.1), where U has 2t rows and L has n − 2t rows with r = 2t/n. Define D 2 as in (2.2). Then D 2 is the supersaturated design constructed by our generalized method. It can be proved that the coherence of D 2 constructed here is no greater than t/n by Theorem 2 in Appendix A.
The generalization indicates the possibility of expanding any two-level supersaturated design with μ ≤ 1/2 and an even number of runs while retaining its coherence. For the original construction, coherence must be no greater than 1/3 and the initial design must have 6m runs but the initial design with p columns can be expanded to a design with 4p columns and μ ≤ 1/3. The generalization only requires that the coherence is no greater than 1/2 and the initial design has an even number of runs but the initial design with p columns can only be expanded to a design with 2p columns and the same coherence. In addition, the number of balanced columns can be fewer than the number of runs for the designs constructed by the generalization. Here is an example.  Table 4.

SIMULATION STUDY
In this section, we compare the proposed designs with four popular classes of supersaturated designs: Lin's designs [18], Wu's designs [32], the Bayesian D-optimal supersaturated designs (Jones, Lin, and Nachtsheim 2008) and the UE(s 2 )-optimal designs [15] for the Lasso problem by simulations. These designs are denoted by "LIN", "WU", "BAYES" and "JM" respectively and our proposed designs are denoted by "Proposed".
Lin's designs are constructed from the Plackett and Burman design [23]. According to Bulutoglu and Cheng [4], Lin's designs are E(s 2 )-optimal among all the balanced twolevel supersaturated designs. Wu's designs can be obtained by appending interaction columns to a Hadamard matrix. Bulutoglu and Cheng [4] showed that in certain cases, Wu's designs are E(s 2 )-optimal among all the balanced two-level supersaturated designs. The Bayesian D-optimal supersaturated design is obtained by the coordinate exchange algorithm [9,14]. Any n rows of a p × p Hadamard matrix form a type of UE(s 2 )-optimal design [15,5].  We simulate data from the linear model in (1.1) with different designs and active coefficients, where the error vector follows the standard multivariate normal distribution N n (0, I). We use the Lasso to select the active factors and fix the intercept term β 1 as active. We repeat the above data generation and variable selection procedure N = 300 times. We use the "cv.glmnet" function [12] in the R software [24] and calculateβ with "s=lambda.min".
We conduct simulations using the eight active coefficients settings in Table 5 and denote them as Case 1, . . . , Case 8, respectively. We use "#Active" to denote the number of active factors. For each group of active coefficients, we consider both the case where they take large values, and the case where they take small values. For example, for β 1 , β 21 and β 23 , we consider both the case where they take values of 1, 5 and 10, and the case where they take values of 0.1, 1 and 1.5. We let the regression coefficients of all inactive factors be zero and do not write them in Table 5. We neither use a single setting of active coefficients for all the comparisons, nor use all settings in Table 5 for each comparison, because of the following reasons. First, the designs under consideration have different numbers of columns and some designs do not have enough columns for a given setting, which makes the setting not applicable. Second, in practice, the more factors we are considering, the more active factors there tends to be, so we use settings with more active coefficients for designs with large numbers of columns, and settings with fewer active coefficients for designs with small numbers of columns.
For each design in our simulations, we will show its size, coherence, and number of balanced columns. Although E(s 2 ) is not our focused criterion, we will also show the E(s 2 ) of each design for reference. We use the following four criteria to compare variable selection and model fitting accuracy of each design:

Mean Squared Error (MSE)
Smaller values of AFDR, AMR, MSE and EME are desirable. Note that AFDR and AMR are computed based on the number of discovered or active coefficients, not the number of discovered or active factors.

Comparison with Lin's Design
We obtain the 12 × 27 proposed design by selecting the intercept, the 24 balanced columns and the first two unbalanced columns of the 12 × 64 design in Example 1. We use Lin's 14 × 27 design directly. The result is shown in Table 6, where and in other tables below, the smaller values of the four criteria are in boldface. Table 6 indicates that the proposed design outperforms Lin's design in terms of variable selection and parameter estimation with the Lasso. In addition, the proposed design has fewer runs than Lin's design.
To verify that this result is not due to some better property of the selected active factors in one design versus another, we also conducted a follow-up comparison with randomly selected active factors. More specifically, for a given setting in Table 5, we randomly select an #Active number of active factors with the same values given in the setting, and repeat 10 times to get 10 sets of active factors. Then, for each set of randomly selected active factors, we repeat the above data generation and variable selection procedure N = 30 times for the proposed design and Lin's design. The final criteria are averaged over the criteria from the 10 sets of active factors, and they can be found in Table 7. According to Table 7, the proposed design still generally outperforms Lin's design, even with randomly selected active factors. To save computational cost and keep our paper concise, we will omit such a verification for the following comparisons.

Comparison with Wu's Designs
We use the 12 × 64, 24 × 256 proposed designs in Example 1 directly. We obtain the 48 × 300 proposed design by selecting the intercept and the first 299 balanced columns of the 48 × 8192 design in Example 2. We obtain Wu's 12 × 64, 24×256 and 48×300 designs by selecting the first 64 columns of Wu's 12×67 design, the first 256 columns of Wu's 24×277 design and the first 300 columns of Wu's 48 × 1129 design, respectively. The result shown in Table 8 indicates that the proposed designs are generally better than Wu's designs in terms of variable selection and parameter estimation with the Lasso. In almost all cases, the proposed designs have better performance while in the first two cases, Wu's designs perform relatively better.

Comparison with the Bayesian D-Optimal Supersaturated Designs
We use the 12 × 64 proposed design in Example 1 directly. We obtain the 24 × 253 proposed design by selecting the intercept, the 168 balanced columns and the first 84 unbalanced columns of the 24 × 256 design in Example 1. We first search a 12 × 64 and a 24 × 256 Bayesian Doptimal supersaturated design by the JMP ® software [13] with 500 random starting designs. We use the 12 × 64 design directly. We remove the 22th, 134th and 209th columns of the 24 × 256 design to obtain the 24 × 253 design used in the simulations since the original searched 24×256 Bayesian D-optimal supersaturated design has μ = 0.83. The result shown in Table 9 indicates that the proposed designs generally outperform the Bayesian D-optimal supersaturated designs in terms of variable selection and parameter estimation with the Lasso.

Comparison with the U E(s 2 )-Optimal Designs
According to Jones and Majumdar [15], for given n and p, there are a large number of UE(s 2 )-optimal designs. In practical scenarios, people may randomly generate only one UE(s 2 )-optimal design and use it to screen important factors. In our comparison, we want to mimic such a scenario, i.e., we want to know whether people will have a higher chance (> 0.5) of getting a worse design (compared with our design) if they generate a UE(s 2 )-optimal design in this way. Therefore, we randomly generate 500 UE(s 2 )-optimal designs and for each of them, repeat the simulation procedure N = 30 times (rather than N = 300 in comparisons with other competing designs). We take the median values for the four criteria introduced in the beginning of the section over the 500 randomly generated designs and compare these median values with the corresponding criteria of the proposed design. If the AFDR of a proposed design is less than the median AFDR (MAFDR) of all the 500 UE(s 2 )-optimal designs, it means the AFDR of the proposed design is less than that of the UE(s 2 )-optimal design with more than half probability. The coherence, E(s 2 ) and number of balanced columns in the table are also the corresponding medians. In addition, it is possible that a generated UE(s 2 )-optimal design has μ = 1 and we will avoid using these designs in the simulations, because in practice if people get a bad design with μ = 1, they will probably regenerate another one.
We use the 12 × 64 and 24 × 256 proposed designs in Example 1 directly. We obtain the UE(s 2 )-optimal designs by randomly sampling 12 rows of a 64 × 64 Hadamard matrix 500 times and 24 rows of a 256 × 256 Hadamard matrix 500 times, respectively. The result shown in Table 10 indicates  that with more than half probability, the proposed designs outperform the UE(s 2 )-optimal designs in terms of variable selection and parameter estimation with the Lasso. Furthermore, one may want to see the distributions of the criteria of 500 UE(s 2 )-optimal designs. We take Case 8 for an example and show the distributions of the criteria in Figure 1.

Summary of Comparisons
To help people better understand how the proposed design performs compared to competing designs with the same size and under the same active coefficient setting, we collect the comparisons between designs with exactly the same size and active coefficient setting all together in Table 11. Lin's design is not of size 12 × 64, so it is not included in the table (it is actually impossible to construct a design of this size with Lin's method).
According to Table 11, the proposed design significantly outperforms the Bayesian D-optimal design and the UE(s 2 )optimal design. In this scenario, Wu's design performs slightly better than the proposed design, but there are also scenarios where the proposed design performs better than Wu's design (see Table 8).
In summary, for Wu's design, the proposed design is generally comparable to it, because of the same coherence they have. For Lin's design, the Bayesian D-optimal design and the UE(s 2 )-optimal design, the proposed design generally outperforms them, except that it sometimes tends to select more factors and gives a larger AFDR. However, the proposed design always gives a smaller AMR when compared to these three designs. In practice, a small AMR is more important than a small AFDR, because missing active factors is a more serious problem than falsely discovering inactive factors. This is because we can conduct follow-up experiments to screen out spurious factors, but we can never detect active factors that were already removed in the screening phase [21].

CONCLUSION AND DISCUSSION
We have proposed a method for constructing supersaturated designs with small coherence. The constructed designs are allowed to be unbalanced to achieve more flexible sample sizes. The proposed method uses direct constructions and it entails no intensive computing even for large p. Since the proposed method can efficiently construct a supersaturated design with a large number of columns, it can be applied to high-dimensional variable selection problems in marketing, biology, engineering and other areas.
Here are possible directions for future work. First, the proposed method can expand a 6m × p design D 0 with μ ≤ 1/3 to a larger design with μ ≤ 1/3. One may be interested in finding a special D 0 such that the expanded design has coherence strictly smaller than 1/3 or using a different construction method to reduce the coherence upper bound. Second, beyond the Lasso, the proposed design can be applicable to other variable selection and high-dimensional problems where controlling coherence of the design matrix is important. Moreover, many new powerful variable selection techniques have emerged. For example, Fan and Lv [10] and Fan, Feng and Song [11] proposed the sure independence screening and nonparametric independence screening for ultrahigh dimensional variable selection. Another example is the best subset regression. Shen, Pan, Zhu and Zhou [25] provided theoretical support for the use of best subset regression. Bertsimas, King and Mazumder [1] proposed a mixed integer optimization approach for the best subset selection problem, where they showed that coherence plays a role in parameter specification of their approach. In future work, our proposed designs will be applied to other penalized variable selection methods with results reported elsewhere. Proof. Throughout, for two columns d i and d j of a twolevel design D with n rows, we call |d i d j |/n their absolute column correlation. Since D 1 is obtained by two copies of D 0 , the absolute column correlations of any two columns of D 1 are the same as those of D 0 . Thus, the coherence of D 1 equals that of D 0 , which is no greater than 1/3.

Lemma 3.
The coherence of D 2 constructed in Section 2.2 is no greater than 1/3.

Proof.
Let d (i) denote the ith column of D 2 and d (i) j denote the jth entry of d (i) . Put where C 1 and C 2 are 12m × p matrices.
Pick two arbitrary columns of D 2 .
(i) Suppose that both of them are from C 1 or C 2 . By Lemma 2, μ(C 1 ) is ≤ 1/3. By Lemmas 1 and 2, μ(C 2 ) is ≤ 1/3. Thus, the absolute column correlation of the two columns is ≤ 1/3. (ii) Suppose that one of the two columns is the ith column d (i) of C 1 and the other column is the jth column d (p+j) of C 2 with 1 ≤ i, j ≤ p. Partition the two columns as where A and B have 8m and 4m rows, respectively. By Step 2 in Section 2.2, Since the absolute dot products of the two columns in these blocks are 0, the absolute dot product of the two columns in B is 0. Thus, the absolute dot product of d (i) and d (p+j) is ≤ 4m and their absolute column correlation is ≤ 4m/12m = 1/3.

Theorem 1.
The coherence of D 3 constructed in Section 2.2 is no greater than 1/3.

Proof.
Let d (i) denote the ith column of D 3 and d (i) j denote the jth entry of d (i) . Put where C 1 , C 2 , C 3 and C 4 are 12m × p matrices. Pick two arbitrary columns of D 3 .
(i) Suppose that both of them are from [C 1 , C 2 ] or [C 3 , C 4 ]. By Lemma 3, the coherence of [C 1 , C 2 ] is ≤ 1/3. By Lemmas 1 and 3, the coherence of [C 3 , C 4 ] is ≤ 1/3. Thus, the absolute column correlation of these two columns is ≤ 1/3. (ii) Suppose that one of the two columns is the ith column d (i) of C 1 and the other column is the jth column d (2p+j) of C 3 with 1 ≤ i, j ≤ p. Partition these columns as where A and B have 8m and 4m rows, respectively. By Step 3 in Section 2.2,