1 Introduction
In the modern computer age, deep learning has been a powerful tool for big data analysis, which utilizes various sophisticated computer models to learn hidden patterns or intricate relationships among a large number of variables from monumental amounts of complex and challenging data. The performance of a deep learning technique, however, depends heavily on hyperparameters – parameters that control the learning process whose values are predetermined by the user in advance of the process of training the learning model. An inappropriate hyperparameter setting can result in poor performance of the learning process [13]. Not only do possible hyperparameters vary algorithm to algorithm, architecture to architecture, and dataset to dataset, but also those most important hyperparameters can be unique across different domains [3, 43]. These call for investigations on hyperparameter optimization.
Generally, work in the literature on hyperparameter optimization has been discussed under both model-free and model-based frameworks. Model-based hyperparameter optimization targets tuning hyperparameters by finding the best possible approximation to the true learning algorithm, while model-free methods consider this optimization problem without making any parametric assumptions. Relevant work along the line of model-based approaches can be found in [4, 9, 19, 26, 30, 34, 39, 46, 51]. There has been extensive work under the model-free framework – see, for example, manual search, grid search, random search [3] and orthogonal array tuning method [50].
For more sophisticated learning algorithms, however, the above hyperparameter optimization approaches, either model-free or model-based methods, may be too computationally expensive to afford. This draws our attention to expeditiously explore the relationship between hyperparameters and the response containing measures of the performances of models learned by an algorithm with different hyperparameter combinations using a set of data collected through testing a minimal number of hyperparameter combinations. It alludes to a primary goal of design and analysis of experiments – executing efficient experiments to collect informative data for studying the relationship between multiple input variables and an output variable. As such, it is natural to consider design strategies to efficiently collect data for exploring the surface of the performances of a hyperparameter-controlled learning algorithm, which gives rise to the question of which design to use for this purpose, as diverse designs are available in the literature on experimental design.
In this paper, we investigate the above question by comparing different types of designs in terms of their ability to efficiently collect informative data to study how hyperparameters influence the performances of a learning algorithm controlled by them. In order to break the bottleneck of computational limitations, we use a deep neural network model with three hidden layers and apply this comparison to a popular dataset, the MNIST dataset. Five hyperparameters and various factorial and space-filling designs for selecting hyperparameter combinations are considered. We choose the Kriging model to describe the complex relationship between the hyperparameters of a learning algorithm and the test accuracy that measures the performance of such a learning algorithm in the study. The results show that the performance of the 32-run strong orthogonal array surpasses the performances of all other comparable designs.
This paper is organized as follows. In Section 2, we provide preparation for the comparison, reviewing (deep) neural networks, hyperparameters, experimental designs and the Kriging model. Section 3 compares various designs by applying them to the MNIST dataset. We end this paper with some discussion in Section 4.
2 Preliminaries
Suppose that $\mathcal{A}$ is a machine learning algorithm with k hyperparameters. Let ${\Lambda _{i}}$ be the domain of the i-th hyperparameter and $\boldsymbol{\Lambda }={\Lambda _{1}}\times \cdots \times {\Lambda _{k}}$. We write values of a combination of k hyperparameters into a vector x. For any $x\in \boldsymbol{\Lambda }$, a learning algorithm based on the hyperparameter setting x is denoted by ${\mathcal{A}_{x}}$. Given a set of data D, we denote a measure of the performance of a model learned by algorithm ${\mathcal{A}_{x}}$ on data D by y. One commonly used measure of the performance of a learning algorithm model is test accuracy which assesses how accurate the prediction of test data from the model based on the training dataset is compared to the observed value. The higher the test accuracy, the better the performance of the algorithm.
In order to find an appropriate hyperparameter combination, it is desirable to explore the relationship between hyperparameter combination x and test accuracy y. The complexity of the algorithm ${\mathcal{A}_{x}}$ results in obtaining the accuracy y being computationally expensive. Therefore, a design strategy is needed to efficiently collect informative data for building a statistical model to specify the relationship between hyperparameter combination x and test accuracy y. Our goal here is to compare different types of experimental designs for doing this job.
Neural Networks and Deep Neural Networks
Neural networks [31], also known as artificial neural networks, are one of the most well-known algorithms used to recognize patterns and solve common problems in statistics, data science and machine learning. A neural network is comprised of an input layer, one or more hidden layers and an output layer, with several units called neurons in each layer. Neurons in the input layer bring the initial data into the network for further processing, while the result from this network is produced by neurons in the output layer. Each neuron in hidden layers connects to another via receiving input produced by neurons in the preceding layer and producing the output to neurons in the next layer. Neural network models, in general, are very flexible, which, on the other hand, brings challenges in terms of constructing the best network structures. See [29] for more details on neural networks.
In this day and age, problems, such as image classification, object detection, or natural language processing task, become increasingly complex. This calls for deep neural networks. A deep neural network refers to an artificial neural network with more than one hidden layer [2, 37]. As an example, consider a deep neural network with three hidden layers, which is shown in Figure 1. The number of neurons in the hidden layers is one of the common hyperparameters of interest. The number of neurons in the input layer is equal to the number of features in the data, while the number of neurons in the output layer depends on the type of problems we solve. For example, very often, there is one neuron in the output layer for a regression problem. If a multi-class classification problem is considered, one common choice is to use one neuron per a class.
Hyperparameters
Hyperparameters for neural networks are those variables whose values determine the network structure and/or the way a network is learned. The number of neurons in the hidden layers of Figure 1 is an example of a hyperparameter. Typically, a neural network requires us to determine values for a set of different hyperparameters including, but not limited to, learning rate, batch size, number of layers, dropout rate, number of epochs, number of training iterations, normalization, pooling, momentum and weight initialization. Those hyperparameter settings drastically affect the success and accuracy of the network. As a consequence, finding an appropriate hyperparameter combination is increasingly important.
Design of Experiments
Design of experiments is a systematic and efficient method that allows scientists and engineers to explore the relationship between multiple input variables and output variables. Traditional physical experiments favor various factorial designs for studying systems or processes [45]. But they are no longer proper when the systems or processes become complex, which calls for computer experiments and space-filling designs, a family of designs appropriate for computer experiments [10, 36].
Factorial Designs At the early stage of an investigation, due to the lack of enough prior knowledge, the experimenter conducts a screening experiment involving as many variables as possible and the primary goal is to identify some important factors from them using a first-order model. Two-level factorial designs [7, 33], thanks to their simple structure and nice statistical properties, are commonly used for this purpose. The experimenter often proceeds to the next stage by capturing the curvature in the response surface, which requires the consideration of three-level factorial designs or composite designs such as orthogonal array composite designs (OACDs) [48].
Space-Filling Designs A space-filling design refers to a design that scatters its design points uniformly over the whole design region. It is tremendously complicated to find a space-filling design that provides good coverage of the entire input space, especially in the high-dimensional input space. Instead, it is natural and reasonable to consider a design that enjoys the space-filling property in low-dimensional projections. This idea dates back to Latin hypercube designs (LHDs) proposed by [32]. Such designs are orthogonal arrays of strength one and enjoy the maximum space-filling property in all univariate projections – there is exactly one observation in any interval formed through dividing the range of an input variable into the same number of equally spaced intervals as the run size. The space-filling property of an LHD may be further evaluated by other optimality criteria such as orthogonality [5], a distance criterion [20] and a discrepancy criterion [11]. There has been extensive work along this line – see, for example, [12, 21, 27].
Strong orthogonal arrays (SOAs) introduced and studied by [17] are the other class of space-filling designs in the literature with a focus on low-dimensional projections. An SOA of strength t achieves the space-filling properties in all $g\le t$ dimensions. Such an array does well as a comparable orthogonal array of strength t in all t dimensions but is more space-filling in all $1\le g\le t-1$ dimensions than the latter. It can also be made to be an LHD through level expansion to achieve the maximum space-filling property in all one-dimensional projections. As a result, this SOA-based LHD enjoys the benefits of better space-filling properties than the comparable ordinary LHD in all $2\le g\le t$ dimensions. More developments on SOAs can be found in [15, 18, 28, 38, 42].
Kriging Model
The Kriging model originated in the areas of geosciences [23] and is now popular in computer experiments for building a surrogate model for complicated computer models. In a Kriging model, the response is referred to as a realization of a Gaussian process, and the predicted response at a target point can be represented by a weighted average of the responses at observed points. For more details on Kriging models, we refer to [8, 14, 22, 35, 44].
As most learning algorithms are stochastic, instead of using the Universal Kriging model for the deterministic case in computer experiments, we consider the Universal Kriging model with a noise term
where ${\textstyle\sum _{i=1}^{m}}{\beta _{i}}{f_{i}}(x)$ is the mean function with ${f_{i}}$ being the ith basis function and ${\beta _{i}}$ being the ith coefficient, $Z(x)$ is a second-order stationary random process with constant mean 0 and the covariance matrix given by $\operatorname{Cov}[Z(u),Z(v)]={\sigma ^{2}}{\textstyle\prod _{i=1}^{k}}R\left(\left|{u_{i}}-{v_{i}}\right|\right)$, where $u=({u_{1}},\dots ,{u_{k}})$ and $v=({v_{1}},\dots ,{v_{k}})$ are two points (runs), $R(\cdot )$ is a spatial correlation function and ${\sigma ^{2}}$ is the variance of the random process, and $\epsilon \sim N(0,{\tau ^{2}})$ is independent of $Z(x)$. In this study, we adopt the Ordinary Kriging model with a noise term, where the mean function in equation (2.1) is a constant. We choose the Matern correlation function with parameter $\nu =5/2$ in the study based on the conclusion drawn by [47] that the Matern correlation function provides a good balance between differentiability and smoothness. The explicit form is given by
\[ R({h_{i}};{\theta _{i}})=\left(1+\frac{\sqrt{5}{h_{i}}}{{\theta _{i}}}+\frac{5{h_{i}^{2}}}{3{\theta _{i}^{2}}}\right)\exp \left(-\frac{\sqrt{5}{h_{i}}}{{\theta _{i}}}\right),\]
where ${h_{i}}=|{u_{i}}-{v_{i}}|$ and ${\theta _{i}}$ is the unknown hyperparameter which can be estimated using the maximum likelihood estimation method.3 Comparison Study Based on MNIST Dataset
Due to computational limitations, to compare the performance of various designs, we explore the test accuracy surface on hyperparameters by considering a deep neural network model with three hidden layers, as shown in Figure 1, for a manageable dataset, MNIST.
3.1 Setups
MNIST Dataset The MNIST dataset (Modified National Institute of Standards and Technology database) is a subset of a larger dataset available at NIST, the National Institute of Standards and Technology, and has served as a canonical training dataset for many learning techniques and pattern recognition methods. It is a dataset of handwritten digits, first developed and released by [25], containing 70000 grayscale images of the 10 digits that are $28\times 28$ pixels in width and height. Some examples are demonstrated in Figure 2. These images come with labels and thus are used for supervised learning tasks. MNIST is user-friendly, as images in this set have been split into a training set of 60,000 images, and a test set of 10,000 images. In this paper, we further consider a subset of the training set as a validation set. Each time, we randomly take out 15% of the full training set to serve as a validation set.
Hyperparameters We consider five common hyperparameters: learning rate, number of epochs, batch size, dropout rate and number of neurons in a hidden layer. These five hyperparameters are numerical factors: learning rate and dropout rate are continuous factors and number of epochs, batch size and number of neurons are discrete factors. The domain of each hyperparameter is given in Table 1.
Table 1
Domains of hyperparameters.
Learning Rate | Number of Epochs | Batch Size | Dropout Rate | Number of units |
[0.0001, 0.01] | [1, 32] | [16, 128] | [0.5, 0.8] | [32, 256] |
Designs Various factorial designs and space-filling designs of 5 columns corresponding to five hyperparameters are examined. More specifically, we use three factorials: the ${2^{5}}$ and ${3^{5}}$ full factorials and an OACD of 34 runs combining the ${2_{V}^{5-1}}$ factorial with a three-level 18-run orthogonal array. Two full factorials are generated using the R package DoE.base [16]. The 34-run OACD can be constructed using Table 2 in [48] and is given in the supplementary material. In addition to three factorials, this study includes a 243-run random LHD generated using the R package lhs [6] and several 32-run space-filling designs – an SOA of strength three, its corresponding LHD and four other types of LHDs: a random LHD, a maximin LHD, a maximum projection LHD and a uniform LHD, which are generated using R packages lhs, SLHD [1], MaxPro [21], and UniDOE [49], respectively. There are 32 levels in each column of these LHDs while the SOA of strength three has 8 levels and is listed in the supplementary material. An LHD based on this SOA can be obtained by expanding 8 levels to 32 levels following [17]. The two largest designs, ${3^{5}}$ factorial and 243-run random LHD, serve as the benchmarks in comparison with designs of small run sizes.
The design matrices of the above designs in terms of natural units are provided in the supplementary material for reference, where the lowest and highest levels are given in Table 1 for each factor. We linearly interpolate other levels within the range. For discrete variables, we further round the levels to the nearest integers.
Data Collection All neural network implementations are built in TensorFlow: https://www.tensorflow.org. There is a stochastic component to the achieved accuracy, as images are shuffled for each model trained. The network based on the training dataset is further assessed on the test dataset of 10,000 images using test accuracy. The test accuracy is appended in the tables of design matrices with natural units in the supplementary material. We also record the cross-entropy loss of each hyperparameter combination in the column right before the test accuracy column in these tables for those interested parties.
3.2 Analysis and Results
We evaluate designs by considering their ability to collect informative data for building a statistical model that specifies the relationship between hyperparameters and test accuracy. In our comparison, the Kriging model is an appropriate consideration, as it can compensate for the effects of data clustering and give a better estimation of prediction error. More specifically, we use the Ordinary Kriging model with a noise term. As test accuracy has values with $(0,1)$, we consider the Arcsine Transformation for test accuracy when building the model.
Essentially, the model is built using a set of data including hyperparameter combinations selected by a design we consider and their corresponding accuracy, which is then expected to be evaluated on the data comprising as many diverse hyperparameter combinations as possible over the whole hyperparameter region and their accuracy. This, however, is infeasible especially for the continuous hyperparameters. In this paper, we consider two extreme cases – generating test data from the ${3^{5}}$ factorial and 243-run random LHD. The random LHD enjoys the maximum space-filling property in all one dimensions, while the ${3^{5}}$ factorial covers the entire 5-dimensional input space in a uniform fashion. The root mean square error (RMSE) values for predicting the mean test accuracy on these two designs are listed in Table 2. Table 2 also includes results on the hybrid design combining the ${3^{5}}$ factorial with 243-run random LHD.
Table 2
Test RMSE values on ${3^{5}}$ factorial, 243-run random LHD and their hybrid, respectively, where Maxpro LHD represents Maximum Projection LHD.
Training Designs | Testing Designs | ||
${3^{5}}$ Factorial | 243-run Random LHD | ${3^{5}}$ Factorial + 243-run Random LHD | |
${2^{5}}$ factorial | 0.131 | 0.223 | 0.183 |
34-run OACD | 0.092 | 0.112 | 0.102 |
${3^{5}}$ Factorial | 0.035 | 0.090 | 0.068 |
32-run SOA | 0.118 | 0.068 | 0.097 |
32-run SOA LHD | 0.164 | 0.083 | 0.130 |
32-run Maximin LHD | 0.175 | 0.067 | 0.133 |
32-run Maxpro LHD | 0.148 | 0.072 | 0.116 |
32-run Uniform LHD | 0.200 | 0.076 | 0.151 |
32-run Random LHD | 0.160 | 0.097 | 0.132 |
243-run Random LHD | 0.144 | 0.027 | 0.104 |
When tested on the ${3^{5}}$ factorial, from Table 2, it can be seen that a three-level design, 34-run OACD, outperforms all other small designs. This design performs as well as we expected because this 34-run OACD is a design comprising 34 runs taken from the ${3^{5}}$ factorial. In other words, a subset of the test data is used to train the Kriging model in this case. Surprisingly, although the SOA is tested on a different type of design, the ${3^{5}}$ factorial, it dramatically outclasses all other small designs except the OACD. Notably, both the OACD and SOA have smaller RMSE than the 243-run LHD when tested on the ${3^{5}}$ factorial. Moreover, the model on the ${3^{5}}$ factorial produces the smallest test RMSE, because the test RMSE is indeed the training RMSE.
Considering the test data generated from the 243-run random LHD, the results in Table 2 clearly show that all factorials are worse than those space-filling designs, and that, when the Kriging model is built on the 243-run random LHD, the test RMSE is the training RMSE and thus is the smallest. Though the maximin LHD produces as the smallest test RMSEs as the SOA in this case, its performance when tested on the ${3^{5}}$ factorial is not as well as that of the SOA – the SOA performs exceedingly well compared to the other small space-filling designs.
Both the OACD and SOA perform excessively well when tested on the same type of design, the ${3^{5}}$ factorial and 243-run random LHD, respectively. But their performances tested on the different types of designs considerably differ. The SOA is still very competitive on the ${3^{5}}$ factorial, as it produces the smallest test RMSE among all comparable designs except the OACD. On the 243-run random LHD, however, the performance of OACD is the worst except for the performance of the ${2^{5}}$ factorial. The SOA is a clear winner among all small designs we considered. The numerical results in the last column of Table 2 further support the above conclusion. In other words, for the setting we considered, this SOA allows us to efficiently collect the most informative data for building a Kriging model that specifies the relationship between hyperparameters and test accuracy. The SOA of strength three we use is indeed optimal in some sense – Theorem 4 of [40] implies that this SOA is optimal under the uniform projection criterion. Moreover, it enjoys better space-filling properties in all two dimensions than an ordinary SOA of strength three. Notably, the model built on the ${2^{5}}$ design produces an extremely large test RMSE value when tested on the hybrid design, which implies that this two-level design is inadequate for building a Kriging model to capture the true surface of test accuracy.
Figure 3 and Figure 4 also sustain that the performance of this SOA is superb. Figure 3 displays the density plots of the observed test accuracy values for all designs. Each density plot in the first two rows of Figure 3 corresponds to a type of design that is used to generate the training data. The density curve of observed test accuracy values for the hybrid design combining the ${3^{5}}$ factorial and 243-run random LHD is present in the last plot of Figure 3. It is bimodal with a higher peak within $[0.8,1]$ and a lower peak of no more than 0.4. Clearly, only the SOA correctly captures this feature while all others fail to do so. We also provide the density curves of observed test accuracy values for the ${3^{5}}$ factorial and 243-run random LHD, respectively, in the last row for reference. Figure 4 gives the histograms of the test errors which are the differences from predictions of test accuracy on the hybrid design to the observed values. The histograms of those LHDs are skewed to the left, which implies that the models based on these designs tend to overpredict the test accuracy on the ${3^{5}}$ factorial, while models from the ${2^{5}}$ factorial and OACD tend to underpredict the test accuracy on the 243-run LHD, as their histograms are skewed to the right. Only the SOA is superior because its histogram depicts an approximately normal distribution that is symmetric around 0.
Figure 4
Histograms of test errors, differences between the observed test accuracy and the predicted test accuracy.
The significant advantage in building a Kriging model of the SOA brings us to take a closer look at these small designs themselves. For all designs, each column is rescaled to $[-1,1]$, and we then calculate the Euclidean distance from each design point to the center of the design and make a histogram with a density curve for the distances of each design. These plots are given in Figure 5. Histograms of the distances for the ${3^{5}}$ factorial, 243-run random LHD and their hybrid design provided in the last row of Figure 5 serve as benchmarks for comparison. Only the SOA captures the feature of the hybrid design that the density plot has a light lower tail but a heavy upper tail, while all other designs fail to do so. Remarkably, though a space-filling design is expected to scatter its design points uniformly over the design region, these space-filling designs we considered seem not as space-filling as expected because Figure 5 shows that they do not provide points that are close to the centers or at corners of the regions.
In addition to using the Kriging model to determine the relationship between hyperparameters and test accuracy, we also consider the second-order model. Although the findings from this study are similar to the above, the second-order model is incapable of adequately capturing the test accuracy surface, as the test RMSE values are consistently larger than those obtained using the Kriging model for all cases listed in Table 2.
4 Conclusion and Discussion
This paper compared small designs in exploring the relationship between hyperparameters and test accuracy. We considered five numerical hyperparameters: learning rate, number of epochs, batch size, dropout rate and number of neurons in a hidden layer. Various factorials and space-filling designs for selecting combinations of these hyperparameters were examined. We evaluated the performance of a design via building a Kriging model that describes the relationship between hyperparameters and test accuracy. The comparison is made based on the MNIST dataset, and a deep neural network model with three hidden layers was our learning algorithm. Under the settings we considered, the comparison demonstrates that the 32-run SOA is the best choice for exploring the test accuracy surface based on the hyperparameters we used.
To further explore the usefulness of SOAs in hyperparameter optimization, more investigations are needed. For example, we may set up simulations for evaluating the performance of various designs, similar to the simulation in [42], use other datasets for the implementation, like the CIFAR-10 dataset [24], or consider broader comparison settings such as more designs, e.g., uniform projection designs [41], more hyperparameters, e.g., the number of training iterations, normalization and weight initialization, and so on. The present paper centers on examining the effect of various designs on the performance of hyperparameter tuning, while the primary goal of optimizing hyperparameters is to find an optimal hyperparameter combination that maximizes the overall performance of a learning algorithm. One may wish to consider a model-based method using a carefully selected design in the initial step of the hyperparameter tuning process in the follow-up work. Moreover, in this study, the SOA that outperforms all other comparable designs has 8 levels, while all other space-filling designs have 32 levels and factorials have no more than 3 levels. It would be interesting to consider a problem as to how many levels of a design are appropriate for use in the investigation of the test accuracy surface over hyperparameters. We leave all these to future research.