1 Introduction
In many applications, both quantitative and qualitative responses are often collected for evaluating the quality of the system. Often, the two types of responses are mutually dependent. We call such a system with both types of quality responses quantitative-qualitative system. Such systems are widely encountered in practice [27, 26, 25]. In [25], the authors studied an experiment of the lapping stage of the wafer manufacturing process. The qualitative response is the conformity of the site total indicator reading (STIR) of the wafer, which has two possible outcomes: whether or not the STIR of a wafer is within the tolerance. The quantitative response is the total thickness variation (TTV) of the wafer. [26] focused on the birth records and examined the mutual dependency of birth weight and preterm birth. The birth weight of an infant is a quantitative outcome and the preterm birth is a binary indicator of whether an infant is born before 36 gestational weeks. The two types of outcomes are correlated as an infant is usually underweight if the infant is born preterm. In [27], two case studies of quantitative-qualitative systems from material sciences and gene expressions are illustrated. In the gene expression study, the qualitative response has three possible outcomes: healthy individuals, patients with Crohn’s disease, and patients with Ulcerative colitis.
This work is motivated by a study of an etching process in a wafer manufacturing process. In the production of silicon wafers, the silicon ingot is sliced into wafers in fine geometry parameters. Inevitably, this step leaves scratches on the wafers’ surface. An etching process is used to improve the surface finish, during which the wafers are submerged in the container of etchant for chemical reaction. The quality of the wafers after etching is measured by two response variables: the total thickness variation of the wafer (TTV) and the binary judgment that whether the wafer has cloudy stains in its appearance. The two responses measure the quality from different but connected aspects. There is a hidden dependency between the continuous TTV and binary judgment of stains. To improve the etching quality, expensive experiments are to be carried out to reveal this hidden dependency and to model the quality-process relationship. Therefore, an ideal experimental design for continuous and binary responses should be able to extract such useful information with economic run size.
The classic design of experiments methods mainly focus on experiments with a single continuous response. There have been various methods developed for a single discrete response too, including [40, 44, 38, 43, 42]. For multiple responses, [11] proposed the seminal work for continuous responses, [8] developed a design method for bivariate binary responses modeled by Copula functions. In the case of mixed types of responses, the literature is very scarce. A naive design method is to combine the two designs that are separately constructed for each type of response. However, such a naive strategy could be reasonable for one type of response but problematic for the other by ignoring the dependency between the types of responses, as shown in Example 1.
Example 1.
Denote by Y and Z a continuous response and a binary response, respectively. Assume that the true model of the binary response Z is $\mathbb{E}(Z|x)=\pi (x)=\exp (1+x)/(1+\exp (1+x))$. The true model of Y is related to Z in the form $Y|Z=z\sim N(1-(1-z){x^{2}},0.{3^{2}})$. Thus, $\mathbb{E}(Y|Z=1)=1$ and $\mathbb{E}(Y|Z=0)=1-{x^{2}}$. Using the naive design method, a 14-point design is constructed, which consists of an 8-point local D-optimal design for the model of Z with $\log (\pi (x)/(1-\pi (x))={\eta _{0}}+{\eta _{1}}x$, and a 6-point D-optimal design for linear regression with a quadratic model of x. Given the design, we generate the responses from the true models of Y and Z. Figure 1 (a)-(c) show the data $({x_{i}},{y_{i}},{z_{i}})$ from the 8-point, 6-point and their combined 14-point design, respectively.
Figure 1
(a) Observations from design for Z; (b) Observations from design for Y; (c) Observations from the combined design. Dashed line “- - -” denotes $\mathbb{E}(Y|Z=1)=1$; solid line “—” denotes $\mathbb{E}(Y|Z=0)=1-{x^{2}}$; point “+” denotes $({x_{i}},{y_{i}})$ with ${z_{i}}=1$; point “o” denotes $({x_{i}},{y_{i}})$ with ${z_{i}}=0$.
In this example, there is a strong dependency between the two responses since the true underlying models of $\mathbb{E}(Y|Z)$ are different when $Z=1$ and $Z=0$. In both designs for a single response shown in Figure 1 (a) and (b), the design points are balanced and reasonably distributed for the targeted response. However, since there are no Y observations for $Z=0$ at $x=1.0$ shown in Figure 1 (c), the quadratic model for $Y|Z=0$ is not estimable. Clearly, the combined design is not suitable here. Note that this problem is not caused by outliers, since all the points for $Z=1$ (with “+”) are varied around $Y=1$ and the points for $Z=0$ (with “o”) are around $Y=1-{x^{2}}$. In fact $P(Z=0|x=1)=0.12$, which is relatively small. Thus it is less likely to observe Y with $Z=0$ at $x=1.0$. A simple solution is to add more replications at $x=1.0$, but it is not clear how many replications should be sufficient. It becomes more difficult to spot a direct solution when the experiments get more complicated.
Such experiments call for new experimental design methods to account for both continuous and binary responses. Note that under the experimental design framework, the linear model is often considered for modeling the continuous response, and the generalized linear model (GLM) is often considered for modeling the qualitative response. A joint model must be developed to incorporate both types of responses. Compared to the classic design methods for linear models or GLMs, design for the joint model is more challenging due to the following aspects. First, the design criterion for the joint model is more complicated, as the joint model is more complicated than the separate ones. Second, experimental design for the GLM itself is more difficult than that for the linear model, which is naturally inherited by the design for the joint model. Third, efficient design construction algorithms are needed to handle the complexity of the design criterion based on the joint model. [23] proposed an A-optimal design for the experiments with both quantitative and qualitative responses. The A-optimality was derived under a Bayesian framework proposed in [25]. Although [23] addressed the three challenges to a degree, the A-optimality is not a commonly used criterion. More importantly, only informative prior is considered, which circumvented some difficulties brought by noninformative prior of the parameters.
In this paper, we choose the most commonly used D-optimal design criterion and propose a novel Bayesian design method for the continuous and binary responses. The proposed method considers both cases of noninformative priors and informative priors. With the noninformative priors, the Bayesian framework is equivalent to the frequentist approach. In this case, we also establish some regularity conditions on the experimental run sizes. With the informative priors, we develop the D-optimal design using conjugate priors. The derived design criterion has meaningful interpretations in terms of the D-optimality criteria for the models of both continuous and binary responses. Moreover, we develop an efficient point-exchange algorithm to construct the proposed designs. The construction algorithm can be applied to more general settings other than factorial designs.
The rest of the paper is organized as follows. Section 2 reviews the general Bayesian quantitative-qualitative (QQ) model and the optimal design criterion. The Bayesian D-optimal design criterion is derived using noninformative prior distributions in Section 3. In Section 4, the design criterion is derived with conjugate informative priors. Efficient algorithms for constructing optimal designs are elaborated in Section 5. One artificial example and the etching experimental design are shown in Section 6. Section 7 concludes this paper with some discussions.
2 General Bayesian QQ Model and Design
We first review the general Bayesian QQ model introduced in [25] and focus on the scenario that Y is a continuous response and Z is a binary response. The input variable $\boldsymbol{x}={({x_{1}},\dots ,{x_{p}})^{\prime }}\in {\mathbb{R}^{p}}$ contains p dimensions. Denote the data as $({\boldsymbol{x}_{i}},{y_{i}},{z_{i}}),i=1,\dots ,n$, where ${y_{i}}\in \mathbb{R}$ and ${z_{i}}\in \{0,1\}$. The vectors $\boldsymbol{y}={({y_{i}},\dots ,{y_{n}})^{\prime }}$ and $\boldsymbol{z}={({z_{1}},\dots ,{z_{n}})^{\prime }}$ are the vectors of response observations. To jointly model the continuous response Y and the binary response Z given $\boldsymbol{x}$, consider the joint probability of $Y|Z$ and Z. The conditional model on $Y|Z$ is assumed to be a linear regression model, while the model of Z is a logistic regression model. Specifically, we consider joint modeling of Y and Z as follows,
where $\boldsymbol{f}(\boldsymbol{x})=\left({f_{1}}(\boldsymbol{x}),\dots ,{f_{q}}(\boldsymbol{x})\right)$ contains q modeling effects including the intercept, the main, interaction and quadratic effects, etc., and $\boldsymbol{\eta }={({\eta _{1}},\dots ,{\eta _{q}})^{\prime }}$ is a vector of parameter coefficients. Conditioning on $Z=z$, the quantitative variable Y has the distribution
where ${\boldsymbol{\beta }^{(i)}}={({\beta _{1}^{(i)}},\dots ,{\beta _{q}^{(i)}})^{\prime }}$, $i=1,2$ are the corresponding coefficients of the model effects. The parameter ${\mu _{0}}$ is the mean and ${\sigma ^{2}}$ is the noise variance. The above conditional model (2.2) indicates that $Y|Z=1\sim N({\mu _{0}}+\boldsymbol{f}{(x)^{\prime }}{\boldsymbol{\beta }^{(1)}},{\sigma ^{2}})$ and $Y|Z=0\sim N({\mu _{0}}+\boldsymbol{f}{(\boldsymbol{x})^{\prime }}{\boldsymbol{\beta }^{(2)}},{\sigma ^{2}})$. We assume the same variance ${\sigma ^{2}}$ for the two conditional distributions of $Y|Z=1$ and 0. The design method developed in the paper can be easily adapted to the case with different variances.
(2.1)
\[\begin{aligned}{}Z& =\left\{\begin{array}{c@{\hskip10.0pt}l}1,& \hspace{2.5pt}\text{with probability}\hspace{2.5pt}\pi (\boldsymbol{x})\\ {} 0,& \hspace{2.5pt}\text{with probability}\hspace{2.5pt}1-\pi (\boldsymbol{x})\end{array}\right.\\ {} & \hspace{2.5pt}\text{with}\hspace{2.5pt}\pi (\boldsymbol{x},\boldsymbol{\eta })=\frac{\exp (\boldsymbol{f}{(\boldsymbol{x})^{\prime }}\boldsymbol{\eta })}{1+\exp (\boldsymbol{f}{(\boldsymbol{x})^{\prime }}\boldsymbol{\eta })},\end{aligned}\](2.2)
\[ Y|Z=z\sim N({\mu _{0}}+z\boldsymbol{f}{(\boldsymbol{x})^{\prime }}{\boldsymbol{\beta }^{(1)}}+(1-z)\boldsymbol{f}{(\boldsymbol{x})^{\prime }}{\boldsymbol{\beta }^{(2)}},{\sigma ^{2}}),\]The association between the two responses Y and Z is represented using the conditional model $Y|Z$. When the two linear models for $Y|Z=0$ and $Y|Z=1$ are different, i.e., ${\boldsymbol{\beta }^{(1)}}\ne {\boldsymbol{\beta }^{(2)}}$, then it is important to take account of the influence of the qualitative response Z when modeling the quantitative response Y. Let $\boldsymbol{X}={({\boldsymbol{x}_{1}},\dots ,{\boldsymbol{x}_{n}})^{\prime }}$ be the $n\times p$ design matrix with ${\boldsymbol{x}_{i}}$ as the ith design point. Based on the CB model, we can express the sampling distributions as
\[\begin{aligned}{}\boldsymbol{y}|\boldsymbol{z},& {\boldsymbol{\beta }^{(1)}},{\boldsymbol{\beta }^{(2)}},{\mu _{0}},{\sigma ^{2}},\boldsymbol{X}\sim \\ {} & N({\mu _{0}}\mathbf{1}+{\boldsymbol{V}_{1}}\boldsymbol{F}{\boldsymbol{\beta }^{(1)}}+{\boldsymbol{V}_{2}}\boldsymbol{F}{\boldsymbol{\beta }^{(2)}},{\sigma ^{2}}{\boldsymbol{I}_{n}}),\end{aligned}\]
\[\begin{aligned}{}& \boldsymbol{z}|\boldsymbol{\eta },\boldsymbol{X}\sim \text{Bernoulli}(\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta }))\hspace{2.5pt}\text{for}\hspace{2.5pt}i=1,\dots ,n,\hspace{2.5pt}\text{and}\hspace{2.5pt}\\ {} & p(\boldsymbol{z}|\boldsymbol{\eta },\boldsymbol{X})\propto \exp \left\{{\sum \limits_{i=1}^{n}}\left({z_{i}}\boldsymbol{f}{({\boldsymbol{x}_{i}})^{\prime }}\boldsymbol{\eta }-\log (1+{e^{\boldsymbol{f}{({\boldsymbol{x}_{i}})^{\prime }}\boldsymbol{\eta }}})\right)\right\},\end{aligned}\]
where $p(\cdot )$ denotes a general density function. Here ${\boldsymbol{V}_{1}}=\text{diag}\{{z_{1}},\dots ,{z_{n}}\}$ is a diagonal matrix, ${\boldsymbol{I}_{n}}$ is the $n\times n$ identity matrix and ${\boldsymbol{V}_{2}}={\boldsymbol{I}_{n}}-{\boldsymbol{V}_{1}}$, $\boldsymbol{F}$ is the model matrix with the ith row as $\boldsymbol{f}{({\boldsymbol{x}_{i}})^{\prime }}$, and $\mathbf{1}$ is a vector of ones.Denote $p({\boldsymbol{\beta }^{(1)}})$, $p({\boldsymbol{\beta }^{(2)}})$, and $p(\boldsymbol{\eta })$ as the prior distributions of the parameters ${\boldsymbol{\beta }^{(1)}}$, ${\boldsymbol{\beta }^{(2)}}$, and $\boldsymbol{\eta }$. Note that we focus on the estimation accuracy of the three groups of parameters. The mean ${\mu _{0}}$ and variance ${\sigma ^{2}}$ are considered nuisance parameters and thus excluded from the optimal design criterion. In this work, we assume that the priors of ${\boldsymbol{\beta }^{(1)}}$, ${\boldsymbol{\beta }^{(2)}}$, and $\boldsymbol{\eta }$ are independent. Under this assumption, the conditional posterior distribution of $\boldsymbol{\eta }$, ${\boldsymbol{\beta }^{(1)}}$, and ${\boldsymbol{\beta }^{(2)}}$ are also independent as explained in Sections 3 and 4. Under the Bayesian framework, the conditional posterior distribution of the parameters $({\boldsymbol{\beta }^{(1)}},{\boldsymbol{\beta }^{(2)}},\boldsymbol{\eta })$ can be derived as
Using (2.3) we develop the general Bayesian optimal design criterion. Let ψ be a criterion function on the conditional posterior distribution of the parameters. For example, it can be the Shannon information (or equivalently, the Kullback-Leibler distance), A/I-optimality [16], or other design criteria. However, $\psi (\cdot )$ cannot be directly used as the final optimal design criterion because its value depends on the random parameters $({\boldsymbol{\beta }^{(1)}},{\boldsymbol{\beta }^{(2)}},\boldsymbol{\eta })$ and the experimental outputs $(\boldsymbol{y},\boldsymbol{z})$ that are not yet observed. The randomness of $({\boldsymbol{\beta }^{(1)}},{\boldsymbol{\beta }^{(2)}},\boldsymbol{\eta })$ can be removed by calculating the mean of ψ with respect to these parameters. The uncertainty of $(\boldsymbol{y},\boldsymbol{z})$ can be removed by calculating the mean $\mathbb{E}(\mathbb{E}(\psi |\boldsymbol{y},\boldsymbol{z}))$. Therefore, the general Bayesian optimal design criterion on the design matrix $\boldsymbol{X}$ is
It is well-known that the Bayesian D-optimal design is equivalent to the Shannon information criterion [3], omitting the constant terms to $\boldsymbol{X}$. The criterion function $\psi (\cdot )$ of Shannon information is $\log (\cdot )$. Next, we develop the Bayesian D-optimal design criteria (2.4) under different prior distributions.
(2.3)
\[\begin{aligned}{}& p({\boldsymbol{\beta }^{(1)}},{\boldsymbol{\beta }^{(2)}},\boldsymbol{\eta }|\boldsymbol{y},\boldsymbol{z},{\mu _{0}},{\sigma ^{2}},\boldsymbol{X})\\ {} & \propto p(\boldsymbol{y}|\boldsymbol{z},{\boldsymbol{\beta }^{(1)}},{\boldsymbol{\beta }^{(2)}},{\mu _{0}},{\sigma ^{2}},\boldsymbol{X})p({\boldsymbol{\beta }^{(1)}})p({\boldsymbol{\beta }^{(2)}})p(\boldsymbol{z}|\boldsymbol{\eta },\boldsymbol{X})p(\boldsymbol{\eta }).\end{aligned}\](2.4)
\[\begin{aligned}{}& \Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})\\ {} =& \int p(\boldsymbol{y},\boldsymbol{z}|{\mu _{0}},{\sigma ^{2}},\boldsymbol{X})\times \\ {} & \left(\int \psi (p({\boldsymbol{\beta }^{(1)}},{\boldsymbol{\beta }^{(2)}},\boldsymbol{\eta }|\boldsymbol{y},\boldsymbol{z},{\mu _{0}},{\sigma ^{2}},\boldsymbol{X}))\times \right.\\ {} & \left.p({\boldsymbol{\beta }^{(1)}},{\boldsymbol{\beta }^{(2)}},\boldsymbol{\eta }|\boldsymbol{y},\boldsymbol{z},{\mu _{0}},{\sigma ^{2}},\boldsymbol{X})\mathrm{d}{\boldsymbol{\beta }^{(1)}}\mathrm{d}{\boldsymbol{\beta }^{(2)}}\mathrm{d}\boldsymbol{\eta }\right)\mathrm{d}\boldsymbol{y}\mathrm{d}\boldsymbol{z}.\end{aligned}\]3 Optimal Design under Noninformative Priors
When lacking domain knowledge or proper historical data, experimenters often favor the frequentist approach as no priors need to be specified. The frequentist approach can be seen as the Bayesian approach using noninformative priors. In this section, we derive the optimal design criterion and the regularity conditions for noninformative priors.
3.1 Design Criterion
Assume the non-informative priors $p({\boldsymbol{\beta }^{(i)}})\propto 1$ for $i=1,2$ and $p(\boldsymbol{\eta })\propto 1$. The conditional posterior distribution in (2.3) is the same as the joint distribution of the data. It can be further factorized into
Conditioning on $\boldsymbol{z}$, the posterior distributions of $({\boldsymbol{\beta }^{1}},{\boldsymbol{\beta }^{2}})$ and $\boldsymbol{\eta }$ are independent. Note that the noninformative prior $p(\boldsymbol{\eta })\propto 1$ is proper because it leads to proper posterior $p(\boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X})$. Under the noninformative priors, the Bayesian estimation is identical to the frequentist estimation.
\[\begin{aligned}{}p({\boldsymbol{\beta }^{(1)}},{\boldsymbol{\beta }^{(2)}},& \boldsymbol{\eta }|\boldsymbol{y},\boldsymbol{z},{\mu _{0}},{\sigma ^{2}},\boldsymbol{X})\\ {} & \propto p(\boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X}){\prod \limits_{i=1}^{2}}p({\boldsymbol{\beta }^{(i)}}|\boldsymbol{y},\boldsymbol{z},{\mu _{0}},{\sigma ^{2}},\boldsymbol{X}),\end{aligned}\]
with the posterior distributions (3.1)
\[\begin{aligned}{}& {\boldsymbol{\beta }^{(i)}}|\boldsymbol{y},\boldsymbol{z},{\mu _{0}},{\sigma ^{2}},\boldsymbol{X}\sim \\ {} & N\left({({\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}\boldsymbol{F})^{-1}}{\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}(\boldsymbol{y}-{\mu _{0}}\mathbf{1}),{\sigma ^{2}}{({\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}\boldsymbol{F})^{-1}}\right)\end{aligned}\](3.2)
\[\begin{aligned}{}& \hspace{2.5pt}\text{for}\hspace{2.5pt}i=1,2,\hspace{2.5pt}\text{and}\hspace{2.5pt}\\ {} & p(\boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X})\propto \exp \left\{{\sum \limits_{i=1}^{n}}\left({z_{i}}\boldsymbol{f}{({\boldsymbol{x}_{i}})^{\prime }}\boldsymbol{\eta }-\log (1+{e^{\boldsymbol{f}{({\boldsymbol{x}_{i}})^{\prime }}\boldsymbol{\eta }}})\right)\right\}.\end{aligned}\]Using the posterior distributions (3.1)–(3.2) and the criterion function $\psi (\cdot )=\log (\cdot )$ in the general Bayesian optimal design criterion (2.4), we obtain the Bayesian D-optimal design criterion (3.3).
The derivation is in Supplement S1. The first additive term in (3.3) is exactly the Bayesian D-optimal design criterion for GLMs. Unfortunately, its exact integration is not tractable. The common approach in experimental design for GLMs is to use a normal approximation for the posterior distribution $p(\boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X})$ [3, 28]. Such an approximation leads to
where $\boldsymbol{I}(\boldsymbol{\eta }|\boldsymbol{X})$ is the Fisher information matrix. We can easily show that
(3.3)
\[\begin{aligned}{}& \Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})={\mathbb{E}_{\boldsymbol{z},\boldsymbol{\eta }}}\left\{\log (p(\boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X}))\right\}\\ {} & +\frac{1}{2}{\sum \limits_{i=1}^{2}}{\mathbb{E}_{\boldsymbol{\eta }}}{\mathbb{E}_{\boldsymbol{z}|\boldsymbol{\eta }}}\left\{\log \det \{({\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}\boldsymbol{F})\}\right\}+\text{constant}.\end{aligned}\](3.4)
\[ {\mathbb{E}_{\boldsymbol{z},\boldsymbol{\eta }}}\left\{\log (p(\boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X}))\right\}\approx {\mathbb{E}_{\boldsymbol{\eta }}}\{\log \det \boldsymbol{I}(\boldsymbol{\eta }|\boldsymbol{X})\}+\text{constant},\]
\[\begin{aligned}{}\boldsymbol{I}(\boldsymbol{\eta }|\boldsymbol{X})& =-{\mathbb{E}_{\boldsymbol{z}}}\left(\frac{{\partial ^{2}}l(\boldsymbol{z},\boldsymbol{\eta }|\boldsymbol{X})}{\partial \boldsymbol{\eta }\partial {\boldsymbol{\eta }^{T}}}\right)\\ {} & ={\sum \limits_{i=1}^{n}}\boldsymbol{f}({\boldsymbol{x}_{i}})\boldsymbol{f}{({\boldsymbol{x}_{i}})^{\prime }}\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta })(1-\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta }))={\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F},\end{aligned}\]
where ${\boldsymbol{W}_{0}}$ is a diagonal weight matrix
\[ {\boldsymbol{W}_{0}}=\text{diag}\{\pi ({\boldsymbol{x}_{1}},\boldsymbol{\eta })(1-\pi ({\boldsymbol{x}_{1}},\boldsymbol{\eta })),\dots ,\pi ({\boldsymbol{x}_{n}},\boldsymbol{\eta })(1-\pi ({\boldsymbol{x}_{n}},\boldsymbol{\eta }))\}.\]
Omitting the irrelevant constant, we approximate the exact criterion $\Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})$ in (3.3) as follows.
(3.5)
\[\begin{aligned}{}\Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})& \approx {\mathbb{E}_{\boldsymbol{\eta }}}\{\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F})\}\\ {} +& \frac{1}{2}{\sum \limits_{i=1}^{2}}{\mathbb{E}_{\boldsymbol{\eta }}}{\mathbb{E}_{\boldsymbol{z}|\boldsymbol{\eta }}}\left\{\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}\boldsymbol{F})\right\}.\end{aligned}\]To construct the optimal design, we consider maximizing the approximated $\Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})$ in (3.5). But this is not trivial, because it involves the expectation on ${Z_{i}}$’s in the second additive term. To overcome this challenge, [18] constructed optimal designs by simulating samples from the joint distribution of responses and the unknown parameters. But this method can be computationally expensive for even slightly larger dimensions of experimental factors. Instead of simulating ${Z_{i}}$’s, we derive the following Theorem 1 that gives a tractable upper bound $Q(\boldsymbol{X})$. Thus we propose using the upper bound $Q(\boldsymbol{X})$ as the optimal criterion.
Theorem 1.
Assume that the matrices ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F}$, ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{1}}\boldsymbol{F}$, and ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{2}}\boldsymbol{F}$ are all nonsingular. Omitting the irrelevant constant, an upper bound of the approximated $\Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})$ is
where ${\boldsymbol{W}_{1}}=\textit{diag}\{\pi ({\boldsymbol{x}_{1}},\boldsymbol{\eta }),\dots ,\pi ({\boldsymbol{x}_{n}},\boldsymbol{\eta })\}$ and ${\boldsymbol{W}_{2}}={\boldsymbol{I}_{n}}-{\boldsymbol{W}_{1}}$.
(3.6)
\[ Q(\boldsymbol{X})={\mathbb{E}_{\boldsymbol{\eta }}}\left\{\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F})+\frac{1}{2}{\sum \limits_{i=1}^{2}}\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F})\right\},\]The proof of Theorem 1 is in Supplement S1. Note that Theorem 1 requires that ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}$ for $i=0,1,2$ are all nonsingular. Obviously ${\boldsymbol{W}_{0}}={\boldsymbol{W}_{1}}{\boldsymbol{W}_{2}}$. It is easy to see that ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F}$ is nonsingular if and only if both ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{1}}\boldsymbol{F}$ and ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{2}}\boldsymbol{F}$ are nonsingular.
The matrices ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{1}}\boldsymbol{F}$ and ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{2}}\boldsymbol{F}$ involve the responses ${Z_{i}}$’s that are not yet observed at the experimental design stage. We can only choose the experimental run size and the design points to avoid the singularity problem with a larger probability for given values of $\boldsymbol{\eta }$. Once the run size is chosen, the design points can be optimally arranged by maximizing $Q(\boldsymbol{X})$. The weight matrix ${\boldsymbol{W}_{1}}$ (or ${\boldsymbol{W}_{2}}$) gives more weight to the feasible design points that are more likely to lead to $Z=1$ (or $Z=0$) observations so that the parameters ${\boldsymbol{\beta }^{(1)}}$ (or ${\boldsymbol{\beta }^{(2)}}$) of the linear model $Y|Z=1$ (or $Y|Z=0$) are more likely to be estimable. Next, we introduce some regularity conditions on the run size and number of replications to alleviate the singularity problem.
3.2 Regularity Conditions
Let m be the number of distinct design points in the design matrix $\boldsymbol{X}$, ${n_{i}}$ be the number of repeated point ${\boldsymbol{x}_{i}}$ in $\boldsymbol{X}$ for $i=1,\dots ,m$. Thus $n={\textstyle\sum _{i=1}^{m}}{n_{i}}$ and ${n_{i}}\ge 1$ for $i=1,\dots ,m$. First, it is necessary that $m\ge q$ for the linear regression model to be estimable under the noninformative priors. The if-and-only-if condition for ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F}$ to be nonsingular is that $\text{rank}({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F})\ge q$. If $m\ge q$ and $\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta })\in (0,1)$ for $i=1,\dots ,m$, then ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}$ for $i=0,1,2$ are all nonsingular and thus positive definite. To make sure $\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta })\in (0,1)$ for $i=1,\dots ,m$, it is sufficient to assume that $\boldsymbol{\eta }$ is finitely bounded. This condition is typically used for the frequentist D-optimal design for GLMs. For instance, [44] suggested using the centroids of the finite bounded space of $\boldsymbol{\eta }$ to develop the local D-optimal design for GLMs. [13] clustered different local D-optimal designs with $\boldsymbol{\eta }$ randomly sampled from its bounded space.
The if-and-only-if condition for ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{1}}\boldsymbol{F}$ and ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{2}}\boldsymbol{F}$ being nonsingular is
where ${Z_{ij}}$ is the jth random binary response at the unique design point ${\boldsymbol{x}_{i}}$ and $I(\cdot )$ is the indicator function. In the following, we discuss how to choose sample sizes ${n_{i}}$ and n under two scenarios (i) $m=q$ and (ii) $m\gt q$.
(3.7)
\[ {\sum \limits_{i=1}^{m}}I\Bigg({\sum \limits_{j=1}^{{n_{i}}}}{Z_{ij}}\gt 0\Bigg)\ge q\hspace{1em}\text{and}\hspace{1em}{\sum \limits_{i=1}^{m}}I\Bigg({\sum \limits_{j=1}^{{n_{i}}}}{Z_{ij}}\lt {n_{i}}\Bigg)\ge q,\]Proposition 1.
Assume $m=q$. Both ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{1}}\boldsymbol{F}$ and ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{2}}\boldsymbol{F}$ are nonsingular if and only if $I(0\lt {\textstyle\sum _{j=1}^{{n_{i}}}}{Z_{ij}}\lt {n_{i}})=1$ for $i=1,2,\dots ,m$. For any given $\kappa \in (0,1)$, a sufficient condition on ${n_{i}}$ for $\Pr (0\lt {\textstyle\sum _{j=1}^{{n_{i}}}}{Z_{ij}}\lt {n_{i}})\ge \kappa $ is
for $i=1,2,\dots ,m$, and a necessary condition is
for $i=1,2,\dots ,m$.
(3.8)
\[ {n_{i}}\ge 1+\left\lceil \frac{\log (1-\kappa )}{\log \left(\max \left\{\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta }),1-\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta })\right\}\right)}\right\rceil \](3.9)
\[ {n_{i}}\ge \left\lceil \frac{2\log \left(\frac{1-\kappa }{2}\right)}{\log \pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta })+\log (1-\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta }))}\right\rceil \]Proposition 2.
Assume $m\gt q$. To make both ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{1}}\boldsymbol{F}$ and ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{2}}\boldsymbol{F}$ nonsingular with large probability, or equivalently,
\[ \mathbb{E}\left\{{\sum \limits_{i=1}^{m}}I\Bigg({\sum \limits_{j=1}^{{n_{i}}}}{Z_{ij}}\gt 0\Bigg)\right\}\ge q\]
and
\[ \mathbb{E}\left\{{\sum \limits_{i=1}^{m}}I\Bigg({\sum \limits_{j=1}^{{n_{i}}}}{Z_{ij}}\lt {n_{i}}\Bigg)\right\}\ge q,\]
-
(i) a sufficient condition is
(3.10)
\[ {n_{0}}\ge \max \left\lceil \left\{1,\frac{\log (1-q/m)}{\log (1-{\pi _{\min }})},\frac{\log (1-q/m)}{\log {\pi _{\max }}}\right\}\right\rceil ,\](3.11)
\[ n\ge \left\lceil m\cdot \max \left\{1,\frac{\log (1-q/m)}{\log (1-{\pi _{\min }})},\frac{\log (1-q/m)}{\log {\pi _{\max }}}\right\}\right\rceil ,\]
Proposition 1 gives a sufficient condition on the lower bound of ${n_{i}}$ when saturated design ($m=q$) is used. Under the sufficient condition, the nonsingularity of $\boldsymbol{F}{\boldsymbol{V}_{1}}\boldsymbol{F}$ and $\boldsymbol{F}{\boldsymbol{V}_{2}}\boldsymbol{F}$ holds with a probability larger than ${\kappa ^{m}}$. For Example 1 in Section 1, suppose that if the possible values of x can only be $-1,0,1$, then $m=q=3$. Let $\boldsymbol{\eta }={(1,1)^{\prime }}$. If $\kappa =0.5$, then the numbers of replications for $x=-1,0,1$ need to satisfy ${n_{1}}\ge 2$, ${n_{2}}\ge 4$, and ${n_{3}}\ge 7$, respectively. If $\kappa =0.9$, then ${n_{1}}\ge 5$, ${n_{2}}\ge 9$, and ${n_{3}}\ge 20$. Proposition 1 is useful in Step 1 to construct the initial design in Algorithm 2 in Section 5.
Proposition 2 provides one sufficient condition and one necessary condition when $m\gt q$ on the smallest number of replications and the overall run size. But these conditions only examine the nonsingularity of the two matrices with large probability, which is weaker than Proposition 1. For given $\boldsymbol{\eta }$ value, Algorithm 2 in Section 5.1 can return the local D-optimal design. Proposition 2 can be useful to check the local D-optimal design, as ${\pi _{\min }}$ and ${\pi _{\max }}$ depend on $\boldsymbol{\eta }$. Take the artificial example in Section 6.1 for instance. The local D-optimal design for $\rho =0$ (Table S1 in Supplement S2) has $m=50$ unique design points and there are $q=22$ effects. According to Proposition 2, the sufficient condition requires ${n_{0}}\ge 7$ and the necessary condition requires ${n_{0}}\ge 1$. The local D-optimal design in Table S1 only satisfies the necessary condition. To meet the sufficient condition n has to be much larger. For the global optimal design considering all possible $\boldsymbol{\eta }$ values, Proposition 2 can provide some guidelines for the design construction when $\boldsymbol{\eta }$ is varied in a relatively small range.
4 Optimal Design Under Conjugate Priors
When prior information for parameters ${\boldsymbol{\beta }^{(1)}}$, ${\boldsymbol{\beta }^{(2)}}$, and $\boldsymbol{\eta }$ is available, it would be desirable to consider the optimal design under the informative priors. In this section, we detail the proposed Bayesian D-optimal design using the conjugate priors.
4.1 Design Criterion
For the parameters ${\boldsymbol{\beta }^{(1)}}$ and ${\boldsymbol{\beta }^{(2)}}$, the conjugate priors are normal distribution since $Y|Z$ follows normal distribution. Thus we consider their priors as
\[ {\boldsymbol{\beta }^{(1)}}\sim N(\mathbf{0},{\tau ^{2}}{\boldsymbol{R}_{1}}),\hspace{1em}{\boldsymbol{\beta }^{(2)}}\sim N(\mathbf{0},{\tau ^{2}}{\boldsymbol{R}_{2}}).\]
where ${\tau ^{2}}$ is the prior variance and ${\boldsymbol{R}_{i}}$ is the prior correlation matrix of ${\boldsymbol{\beta }^{(i)}}$ for $i=1,2$. Here we use the same prior variance ${\tau ^{2}}$ only for simplicity. The matrix ${\boldsymbol{R}_{i}}$ can be specified flexibly such as using ${({\boldsymbol{F}^{\prime }}\boldsymbol{F})^{-1}}$, or those in [20] for factorial designs.For the parameters $\boldsymbol{\eta }$, we choose the conjugate prior derived in [4]. It takes the form
where $D(s,\boldsymbol{b})$ is the distribution with parameters $(s,\boldsymbol{b})$. Here s is a scalar factor and $\boldsymbol{b}\in {(0,1)^{n}}$ is the marginal mean of $\boldsymbol{z}$ as shown in [10]. The value of $\boldsymbol{b}$ can be interpreted as a prior prediction (or guess) for $\mathbb{E}(\boldsymbol{Z})$. Based on the priors for $({\boldsymbol{\beta }^{(1)}},{\boldsymbol{\beta }^{(2)}},\boldsymbol{\eta })$ we can derive the posteriors as follows.
(4.1)
\[\begin{aligned}{}\boldsymbol{\eta }& \sim D(s,\boldsymbol{b})\\ {} & \propto \exp \left\{{\sum \limits_{i=1}^{n}}s\left({b_{i}}\boldsymbol{f}{({\boldsymbol{x}_{i}})^{\prime }}\boldsymbol{\eta }-\log (1+{e^{\boldsymbol{f}{({\boldsymbol{x}_{i}})^{\prime }}\boldsymbol{\eta }}})\right)\right\},\end{aligned}\]Proposition 3.
For priors ${\boldsymbol{\beta }^{(i)}}\sim N(\mathbf{0},{\tau ^{2}}{\boldsymbol{R}_{i}})$, $i=1,2$, and $\boldsymbol{\eta }\sim D(s,\boldsymbol{b})$, the posterior distributions of ${\boldsymbol{\beta }^{(1)}}$, ${\boldsymbol{\beta }^{(2)}}$ and $\boldsymbol{\eta }$ are independent of each other with the following forms,
\[ {\boldsymbol{\beta }^{(i)}}|\boldsymbol{y},\boldsymbol{z},{\mu _{0}},{\sigma ^{2}},\boldsymbol{X}\sim N\left({\boldsymbol{H}_{i}^{-1}}{\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}(\boldsymbol{y}-{\mu _{0}}\mathbf{1}),{\sigma ^{2}}{\boldsymbol{H}_{i}^{-1}}\right),\]
for $i=1,2$ and
\[ \boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X}\sim D\left(1+s,\frac{\boldsymbol{z}+s\boldsymbol{b}}{1+s}\right),\]
where ${\boldsymbol{H}_{i}}={\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}\boldsymbol{F}+\rho {\boldsymbol{R}_{i}^{-1}}$ with $\rho =\frac{{\sigma ^{2}}}{{\tau ^{2}}}$.
The proof of Proposition 3 can be derived following the standard Bayesian framework, thus is omitted. To derive the Bayesian D-optimal design criterion, we take the posterior distributions in Proposition 3 to (2.4) and set $\psi (\cdot )=\log (\cdot )$. The derivation is very similar to that in (3.3), and thus we obtain the exact design criterion as
As the integration of ${\mathbb{E}_{\boldsymbol{z},\boldsymbol{\eta }}}\left\{\log (p(\boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X}))\right\}$ is not tractable, we adopt the same normal approximation of the posterior distribution $p(\boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X})$ as in (3.4). A straightforward calculation leads to getting Fisher information matrix $\boldsymbol{I}(\boldsymbol{\eta }|\boldsymbol{X})=(1+s){\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F}$. Thus we have
Disregarding the constant, we can approximate the exact $\Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})$ by
(4.2)
\[\begin{aligned}{}& \Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})={\mathbb{E}_{\boldsymbol{z},\boldsymbol{\eta }}}\left\{\log (p(\boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X}))\right\}\\ {} +& \frac{1}{2}{\sum \limits_{i=1}^{2}}{\mathbb{E}_{\boldsymbol{\eta }}}{\mathbb{E}_{\boldsymbol{z}|\boldsymbol{\eta }}}\left\{\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}\boldsymbol{F}+\rho {\boldsymbol{R}_{i}^{-1}})\right\}+\text{constant}.\end{aligned}\](4.3)
\[ {\mathbb{E}_{\boldsymbol{z},\boldsymbol{\eta }}}\left\{\log (p(\boldsymbol{\eta }|\boldsymbol{z},\boldsymbol{X}))\right\}\approx {\mathbb{E}_{\boldsymbol{\eta }}}\{\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F})\}+\text{constant}.\]
\[\begin{aligned}{}\Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})& \approx {\mathbb{E}_{\boldsymbol{\eta }}}\{\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F})\}\\ {} & +\frac{1}{2}{\sum \limits_{i=1}^{2}}{\mathbb{E}_{\boldsymbol{\eta }}}{\mathbb{E}_{\boldsymbol{z}|\boldsymbol{\eta }}}\left\{\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}\boldsymbol{F}+\rho {\boldsymbol{R}_{i}^{-1}})\right\}.\end{aligned}\]
The following Theorem 2 gives an upper bound of the approximated criterion $\Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})$ to avoid the integration with respect to $\boldsymbol{z}$, which plays the same role as Theorem 1.Theorem 2.
Assume that the prior distributions of ${\boldsymbol{\beta }^{(i)}}$ are ${\boldsymbol{\beta }^{(i)}}\sim N(\mathbf{0},{\tau ^{2}}{\boldsymbol{R}_{i}})$ for $i=1,2$ and $\boldsymbol{\eta }$ has either the conjugate prior $\boldsymbol{\eta }\sim D(s,\boldsymbol{b})$ or the noninformative prior $p(\boldsymbol{\eta })\propto 1$. If ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F}$ is nonsingular, an upper bound of the approximated $\Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})$ is
(4.4)
\[\begin{aligned}{}& Q(\boldsymbol{X})=\\ {} & {\mathbb{E}_{\boldsymbol{\eta }}}\left(\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F})+\frac{1}{2}{\sum \limits_{i=1}^{2}}\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}+\rho {\boldsymbol{R}_{i}^{-1}})\right).\end{aligned}\]For the same argument as in Section 3, we use the upper bound in (4.4) as the optimal design criterion. Note that since $\rho {\boldsymbol{R}_{i}^{-1}}$ is added to ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}\boldsymbol{F}$ and ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}$, ${\boldsymbol{F}^{\prime }}{\boldsymbol{V}_{i}}\boldsymbol{F}+\rho {\boldsymbol{R}_{i}^{-1}}$ and ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}+\rho {\boldsymbol{R}_{i}^{-1}}$ are nonsingular. The derivation of $\Psi (\boldsymbol{X}|{\mu _{0}},{\sigma ^{2}})$ and $Q(\boldsymbol{X})$ only needs ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F}$ to be nonsingular, which requires $m\ge q$ and $\boldsymbol{\eta }$ to be finitely bounded as in Section 3.
4.2 Interpretation
Note that the criterion in (4.4) has a similar formulation with $Q(\boldsymbol{X})$ in (3.6). The only difference is that (3.6) does not involve $\rho {\boldsymbol{R}_{i}^{-1}}$. For consistency, we use the formula (4.4) as the design criterion $Q(\boldsymbol{X})$ for both cases. When noninformative priors for ${\boldsymbol{\beta }^{(1)}}$ and ${\boldsymbol{\beta }^{(2)}}$ are used, we set $\rho =0$. From another point of view, as ${\tau ^{2}}\to \infty $, $\rho \to 0$, the variances in the priors $p({\boldsymbol{\beta }_{1}})$ and $p({\boldsymbol{\beta }_{2}})$ diffuse and result in a noninformative priors.
The criterion $Q(\boldsymbol{X})$, consisting of three additive terms, can be interpreted intuitively. The first additive term ${\mathbb{E}_{\boldsymbol{\eta }}}\{\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F})\}$ is known as the Bayesian D-optimal criterion for logistic regression and ${\mathbb{E}_{\boldsymbol{\eta }}}\{\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}+\rho {\boldsymbol{R}_{i}^{-1}})\}$ is the Bayesian D-optimal criterion for the linear regression model of Y. To explain the weights, we rewrite $Q(\boldsymbol{X})$ as follows.
\[\begin{aligned}{}Q(\boldsymbol{X})& =1\cdot {\mathbb{E}_{\boldsymbol{\eta }}}\left(\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F})\right)\\ {} & +1\cdot \left(\frac{1}{2}{\sum \limits_{i=1}^{2}}{\mathbb{E}_{\boldsymbol{\eta }}}(\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}+\rho {\boldsymbol{R}_{i}^{-1}}))\right).\end{aligned}\]
Since there are equal numbers of binary and continuous response observations, the design criterion should put the same weight (equal to 1) on both design criteria for Z and Y. For the two criteria for the linear regression models, the same weight 1/2 is used. This is also reasonable because we assume ${\pi _{i}}\in (0,1)$. Then none of the diagonal entries of ${\boldsymbol{W}_{1}}$ and ${\boldsymbol{W}_{2}}$ are zero, so the two terms should split the total weight 1 assigned for the entire linear regression part. Therefore, even though $Q(\boldsymbol{X})$ are derived analytically, all the additive terms and their weights make sense intuitively.4.3 Prior Parameters
Note that the conjugate prior $p(\boldsymbol{\eta })$ requires prior parameters $(s,\boldsymbol{b})$ to be specified. Moreover, the prior distribution (4.1) contains $f({\boldsymbol{x}_{i}})$, which depends on the design points. When sampling $\boldsymbol{\eta }$ from the prior (4.1), it does not matter whether $f({\boldsymbol{x}_{i}})$’s are actually from the design points. If relevant historical data is available, we can simply sample $\boldsymbol{\eta }$ from the likelihood of the data. Alternatively, one can adopt the method in [4] to estimate the parameters $(s,\boldsymbol{b})$. Without the relevant data, we would use the noninformative prior for $\boldsymbol{\eta }$, i.e., $p(\boldsymbol{\eta })\propto 1$ in the bounded region for $\boldsymbol{\eta }$.
The design criterion $Q(\boldsymbol{X})$ contains some unknown parameters, including the noise-to-signal ratio $\rho ={\sigma ^{2}}/{\tau ^{2}}$ and the correlation matrices ${\boldsymbol{R}_{i}}$’s for $i=1,2$. The value of ρ has to be specified either from the historical data or from the domain knowledge. Typically we would assume $\rho \lt 1$ such that the measurement error has a smaller variance than the signal variance.
The setting of ${\boldsymbol{R}_{i}}$ can also be specified flexibly. If historical data are available, ${\boldsymbol{R}_{i}}$ can be set as the estimated correlation matrix of ${\boldsymbol{\beta }^{(i)}}$. Otherwise, we can use the correlation matrix in [20] and [24], which is targeted for factorial design. Specifically, let $\boldsymbol{\beta }$ be the unknown coefficients of the linear regression model and the prior distribution is $\boldsymbol{\beta }\sim N(\mathbf{0},{\tau ^{2}}\boldsymbol{R})$. For 2-level factor coded in $-1$ and 1, [20] suggests that $\boldsymbol{R}$ is a diagonal matrix and the priors for individual ${\beta _{j}}$ is
where ${\beta _{j}}$ $i=1,\dots ,p$ are main effects, ${\beta _{j}}$ $j=p+1,\dots ,p+\left(\genfrac{}{}{0.0pt}{}{p}{2}\right)$ are 2-factor-interactions and up to the p-factor-interaction ${\beta _{{2^{p}}-1}}$. The variance of ${\beta _{j}}$ decreases exponentially with the order of their corresponding effects by $r\in (0,1)$, thus it incorporates the effects hierarchy principle [46]. [20] showed that if $\boldsymbol{f}(\boldsymbol{x})$ contains all the ${2^{p}}$ effects of all p orders, ${\tau ^{2}}\boldsymbol{R}$ can be represented alternatively by Kronecker product as ${\tau ^{2}}\boldsymbol{R}={\varsigma ^{2}}{\textstyle\bigotimes _{j=1}^{p}}{\boldsymbol{F}_{j}}{({x_{j}})^{-1}}{\boldsymbol{\Psi }_{j}}({x_{j}}){({\boldsymbol{F}_{j}}({x_{j}}))^{-1}}$. The model matrix for the 2-level factor and the correlation matrix are
To keep the two different presentations equivalent, let $\zeta =\frac{1-r}{1+r}$ and ${\tau ^{2}}={(\frac{1+\zeta }{2})^{p}}{\varsigma ^{2}}$. For the mixed-level of 2- and 3-level experiments, [24] have extended the 2-level case to the format
where ${p_{2}}$ is the number of 2-level factors, ${p_{3,c}}$ is the number of 3-level qualitative (categorical) factors, and ${p_{3,q}}$ is the number of 3-level quantitative factors. For all the 3-level factors, the model matrix is
An isotropic correlation function is recommended for the 3-level qualitative factors and a Gaussian correlation function for quantitative factors. Thus, the correlation matrices for the 3-level qualitative and quantitative factors are
respectively. To keep the covariance (4.7) consistent with the 2-level case we still set $\zeta =\frac{1-r}{1+r}$. To keep the variance of the intercept equal to ${\tau ^{2}}$ [24], we set
(4.5)
\[\begin{aligned}{}& {\beta _{0}}\sim N(0,{\tau ^{2}}),\\ {} & {\beta _{j}}\sim N(0,{\tau ^{2}}r),\hspace{2em}i=1,\dots ,p,\\ {} & {\beta _{j}}\sim N(0,{\tau ^{2}}{r^{2}}),\hspace{2em}i=p+1,\dots ,p+\left(\genfrac{}{}{0.0pt}{}{p}{2}\right),\\ {} & \vdots \\ {} & {\beta _{{2^{p}}-1}}\sim N(0,{\tau ^{2}}{r^{p}}),\end{aligned}\](4.6)
\[ {\boldsymbol{F}_{j}}({x_{j}})=\left(\begin{array}{c@{\hskip10.0pt}c}1& -1\\ {} 1& 1\end{array}\right)\hspace{0.2778em}\hspace{0.2778em}\text{and}\hspace{0.2778em}\hspace{0.2778em}{\boldsymbol{\Psi }_{j}}({x_{j}})=\left(\begin{array}{c@{\hskip10.0pt}c}1& \zeta \\ {} \zeta & 1\end{array}\right).\](4.7)
\[ {\tau ^{2}}\boldsymbol{R}={\varsigma ^{2}}{\underset{j=1}{\overset{{p_{2}}+{p_{3,c}}+{p_{3,q}}}{\bigotimes }}}{\boldsymbol{F}_{j}}{({x_{j}})^{-1}}{\boldsymbol{\Psi }_{j}}({x_{j}}){({\boldsymbol{F}_{j}}{({x_{j}})^{-1}})^{\prime }},\](4.8)
\[ {\boldsymbol{F}_{j}}({x_{j}})=\left(\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c}1& -\sqrt{\frac{3}{2}}& \sqrt{\frac{1}{2}}\\ {} 1& 0& -\sqrt{2}\\ {} 1& \sqrt{\frac{3}{2}}& \sqrt{\frac{1}{2}}\end{array}\right).\](4.9)
\[ {\boldsymbol{\Psi }_{j}}({x_{j}})=\left(\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c}1& \zeta & \zeta \\ {} \zeta & 1& \zeta \\ {} \zeta & \zeta & 1\end{array}\right)\hspace{0.2778em}\hspace{0.2778em}\text{and}\hspace{0.2778em}\hspace{0.2778em}{\boldsymbol{\Psi }_{j}}({x_{j}})=\left(\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c}1& \zeta & {\zeta ^{4}}\\ {} \zeta & 1& \zeta \\ {} {\zeta ^{4}}& \zeta & 1\end{array}\right),\]
\[ {\tau ^{2}}={\varsigma ^{2}}{\left(\frac{1+\zeta }{2}\right)^{{p_{2}}}}{\left(\frac{1+2\zeta }{3}\right)^{{p_{3,c}}}}{\left(\frac{3+4\zeta +2{\zeta ^{4}}}{9}\right)^{{p_{3,q}}}},\]
and thus
\[\begin{aligned}{}\boldsymbol{R}=& {\left({\left(\frac{1+\zeta }{2}\right)^{{p_{2}}}}{\left(\frac{1+2\zeta }{3}\right)^{{p_{3,c}}}}{\left(\frac{3+4\zeta +2{\zeta ^{4}}}{9}\right)^{{p_{3,q}}}}\right)^{-1}}\\ {} & \times {\underset{j=1}{\overset{{p_{2}}+{p_{3,c}}+{p_{3,q}}}{\bigotimes }}}{\boldsymbol{F}_{j}}{({x_{j}})^{-1}}{\boldsymbol{\Psi }_{j}}({x_{j}}){({\boldsymbol{F}_{j}}{({x_{j}})^{-1}})^{\prime }}.\end{aligned}\]
It is straightforward to prove that $\boldsymbol{R}$ is a diagonal matrix if only 2-level and 3-level qualitative factors are involved, but not so if any 3-level quantitative factors are involved, and the first diagonal entry of $\boldsymbol{R}$ is always 1.To specify different prior distributions for ${\boldsymbol{\beta }^{(1)}}$ and ${\boldsymbol{\beta }^{(2)}}$, we only need to use different values ${r_{1}}$ (or ${\zeta _{1}}$) and ${r_{2}}$ (or ${\zeta _{2}}$) to construct the prior correlation matrix. If the prior knowledge assumes that the two responses Z and Y are independent, one can set ${r_{1}}={r_{2}}=r$ so that the two correlation matrices ${\boldsymbol{R}_{i}}$’s are the same, denoted as $\boldsymbol{R}$. [24] has used $r=1/3$ (equivalently $\zeta =1/2$) according to a meta-analysis of 113 data sets from published experiments [30]. Thus we also use $r=1/3$ in all the examples. The readers can specify different values for ${r_{1}}$ and ${r_{2}}$ if needed.
In computation, we construct $\boldsymbol{R}$ using the Kronecker product in (4.7). But such $\boldsymbol{R}$ is for $\boldsymbol{f}(\boldsymbol{x})$ containing effects of all possible orders. Usually, we would assume the model just contains lower-order effects. So we just pick the rows and columns that correspond to the lower-order effects assumed in the model as the correlation matrix.
5 Design Search Algorithm
In this work, we focus on the construction of optimal design based on factorial design, which is suited for the prior distribution introduced in Section 4.3. For optimizing the design criterion $Q(\boldsymbol{X})$ we consider two cases. First, for fixed $\boldsymbol{\eta }$ value, we develop a point-exchange algorithm to construct a local optimal design that maximizes the criterion $Q(\boldsymbol{X}|\boldsymbol{\eta })$. Second, we construct a global optimal design based on the prior distribution of $\boldsymbol{\eta }$. Specifically, we construct the local optimal designs for different $\boldsymbol{\eta }$’s sampled from its prior distribution. Then the global optimal continuous design is obtained by accumulating the frequencies of design points selected into those local optimal designs.
5.1 Local Optimal Design for Fixed η
For a fixed $\boldsymbol{\eta }$, we adapt the point-wise exchange algorithm to maximize the criterion
\[ Q(\boldsymbol{X}|\boldsymbol{\eta })=\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F})+\frac{1}{2}{\sum \limits_{i=1}^{n}}\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}+\rho {\boldsymbol{R}_{i}^{-1}}).\]
The point-wise exchange algorithm is commonly used to construct D-optimal designs. It was first introduced by [16] and then widely used in many works [5, 35].The point-wise exchange algorithm finds the optimal design from a candidate set. Here the candidate set is chosen to be the full factorial design without replicates. For now, we develop the method for 2- and 3-level factors, but it can be generalized to factors of more levels. Use previous notation that ${p_{2}}$, ${p_{3,c}}$, ${p_{3,q}}$ as the number of 2-level, 3-level categorical, and 3-level quantitative factors. The total number of full factorial design points is $N={2^{{p_{2}}}}{3^{{p_{3,c}}+{p_{3,q}}}}$, which can be large if the experiment involves many factors. To make the algorithm efficient, we filter out the candidate points that are unlikely to be the optimal design points. Following the suggestion from [12], we exclude the candidate design points whose corresponding probabilities $\pi (\boldsymbol{x},\boldsymbol{\eta })$ is outside of $[0.15,0.85]$. This range is used because the approximate variance of $\log \left(\frac{{\pi _{i}}}{1-{\pi _{i}}}\right)$ is nearly constant for ${\pi _{i}}\in (0.2,0.8)$ but increases rapidly if ${\pi _{i}}$ is outside that range [45]. Denote the reduced candidate set as ${\boldsymbol{X}_{c}}$ with size ${N^{\prime }}$.
Next we construct the initial design of size n, such that ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F}$ is nonsingular, and so should be ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}$ if $\rho =0$ for $i=1,2$. If ${N^{\prime }}\ge q$, we construct the initial design by reduction. Starting the initial design as ${\boldsymbol{X}_{c}}$, we remove the design points one by one until there are q points left. The remaining $n-q$ design points are then sampled from these q initial design points with probabilities proportional to the lower bounds in the sufficient condition in Proposition 1. For removing one design point, we select the one having the smallest deletion function $d(\boldsymbol{x})$ defined in (5.1). Shortcut formulas are developed in Supplement S1 for updating the inverse of the matrices ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F}$ and ${\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}+\rho \boldsymbol{R}$ for $i=1,2$ after one design point is removed. If ${N^{\prime }}\le q$, we have to restore the candidate set back to the full factorial design and construct the initial design in the same reduction fashion.
To simplify the notation for $d(\boldsymbol{x})$, we define ${v_{i}}({\boldsymbol{x}_{1}},{\boldsymbol{x}_{2}})=\boldsymbol{f}{({\boldsymbol{x}_{1}})^{\prime }}{\boldsymbol{M}_{i}}\boldsymbol{f}({\boldsymbol{x}_{2}})$ and ${v_{i}}(\boldsymbol{x})=\boldsymbol{f}{(\boldsymbol{x})^{\prime }}{\boldsymbol{M}_{i}}\boldsymbol{f}(\boldsymbol{x})$ for $i=0,1,2$, where ${\boldsymbol{M}_{0}}={\left({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F}\right)^{-1}}$ and ${\boldsymbol{M}_{i}}={\left({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}+\rho {\boldsymbol{R}_{i}^{-1}}\right)^{-1}}$ for $i=1,2$. Denote $\boldsymbol{X}$ as the current design and ${\boldsymbol{X}_{-i}}$ the design of $\boldsymbol{X}$ with the ith row removed. Then the deletion function can be derived as
The smaller $d({\boldsymbol{x}_{i}})$ is, the less contribution the corresponding point makes for the overall objective $Q(\boldsymbol{X}|\boldsymbol{\eta })$.
(5.1)
\[\begin{aligned}{}& d({\boldsymbol{x}_{i}})=Q(\boldsymbol{X}|\boldsymbol{\eta })-Q({\boldsymbol{X}_{-i}}|\boldsymbol{\eta })\\ {} & =-\log \left[1-\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta })(1-\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta })){v_{0}}({\boldsymbol{x}_{i}})\right]\\ {} & -\frac{1}{2}\log \left[1-\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta }){v_{1}}({\boldsymbol{x}_{i}})\right]-\frac{1}{2}\log \left[1-(1-\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta })){v_{2}}({\boldsymbol{x}_{i}})\right].\end{aligned}\]One key of the point-wise exchange algorithm is to compute $\Delta (\boldsymbol{x},{\boldsymbol{x}_{i}})=Q({\boldsymbol{X}^{\ast }}|\boldsymbol{\eta })-Q(\boldsymbol{X}|\boldsymbol{\eta })$, the change in the criterion after the candidate design point $\boldsymbol{x}$ replaces ${\boldsymbol{x}_{i}}$ in the current design $\boldsymbol{X}$. Here ${\boldsymbol{X}^{\ast }}$ is the new design matrix after the exchange. To compute $\Delta (\boldsymbol{x},{\boldsymbol{x}_{i}})$ efficiently, we can obtain the following formula.
where ${\Delta _{i}}(\boldsymbol{x},{\boldsymbol{x}_{i}})$ for $i=0,1,2$ are derived in Supplement S1. The matrices ${\boldsymbol{M}_{i}}$ for $i=0,1,2$ need to be updated after the exchange of design points. Denote the updated matrices as ${\boldsymbol{M}_{i}^{\ast }}$ for the updated design ${\boldsymbol{X}^{\ast }}$. We derive the shortcut formulas to easily compute ${\boldsymbol{M}_{i}^{\ast }}$ for $i=0,1,2$ as shown in Supplement S1.
(5.2)
\[\begin{aligned}{}\Delta (\boldsymbol{x},{\boldsymbol{x}_{i}})& =Q({\boldsymbol{X}^{\ast }}|\boldsymbol{\eta })-Q(\boldsymbol{X}|\boldsymbol{\eta })\\ {} & =\log {\Delta _{0}}(\boldsymbol{x},{\boldsymbol{x}_{i}})+\frac{1}{2}{\sum \limits_{i=1}^{2}}\log {\Delta _{i}}(\boldsymbol{x},{\boldsymbol{x}_{i}}),\end{aligned}\]Given the initial design, we can iteratively exchange the current design points with candidate design points to improve the objective $Q(\boldsymbol{X}|\boldsymbol{\eta })$. The details are listed in the following Algorithm 1.
-
Step 0 Generate the candidate design set from full factorial design. Filter out the points with probabilities $\pi (\boldsymbol{x},\boldsymbol{\eta })$ outside of $[0.15,0.85]$.
-
Step 1 Generate the initial design. Based on the initial design $\boldsymbol{X}$, update the matrices $\boldsymbol{F}$, ${\boldsymbol{W}_{i}}$, and ${\boldsymbol{M}_{i}}$ for $i=0,1,2$. Compute the current objective value $Q(\boldsymbol{X}|\boldsymbol{\eta })$.
-
Step 2 Compute the deletion function $d({\boldsymbol{x}_{i}})$ for each ${\boldsymbol{x}_{i}}$ in $\boldsymbol{X}$. Randomly sample one design point with probability inversely proportional to $d({\boldsymbol{x}_{i}})$’s. Denote it as ${\boldsymbol{x}_{{i_{0}}}}$.
-
Step 3 Find ${\boldsymbol{x}^{\ast }}$ as the candidate point having the largest $\Delta (\boldsymbol{x},{\boldsymbol{x}_{{i_{0}}}})$. If $\Delta ({\boldsymbol{x}^{\ast }},{\boldsymbol{x}_{{i_{0}}}})\gt 0$, exchange ${\boldsymbol{x}^{\ast }}$ with ${\boldsymbol{x}_{{i_{0}}}}$ in $\boldsymbol{X}$ and update the objective function value to $Q(\boldsymbol{X}|\boldsymbol{\eta })+\Delta ({\boldsymbol{x}^{\ast }},{\boldsymbol{x}_{{i_{0}}}})$.
-
Step 4 Repeat Step 2 and 3 until the objective function has been stabilized or the maximum number of iterations is reached.
Algorithm 1
Exchange-Point Algorithm for Local D-Optimal Design.
The Algorithm 1 can return different optimal designs due to different initial designs and the random sampling in Step 2. Thus, we run Algorithm 1 a few times and return the design with the best optimal value. We have several remarks regarding the algorithm. (1) The initial design generated via reduction does not have singularity issues. (2) The updated design from point-exchange does not have the singularity problem either, based on the way ${\boldsymbol{x}^{\ast }}$ is selected and ${\boldsymbol{M}_{i}^{\ast }}$ for $i=0,1,2$ are computed. (3) To avoid being trapped in a local maximum, in Step 2 we randomly sample the design point for an exchange instead of deterministically picking the “worst” point. (4) Different from some other point-exchange algorithms, the candidate set here remains the same through Steps 1-4 since no points are deleted if they are selected in the design. It enables the resultant optimal design having replicated design points.
5.2 Global Optimal Design
Based on Algorithm 1 for local D-optimal design, we can use the following Algorithm 2 to construct global optimal design.
-
Step 0 If $p(\boldsymbol{\eta })$ is informative, simulate ${\boldsymbol{\eta }_{j}}\sim p(\boldsymbol{\eta })$ for $j=1,\dots ,B$. Otherwise, $\boldsymbol{\eta }$ is uniformly distributed in a rectangular high-dimensional space.
-
Step 1 For each ${\boldsymbol{\eta }_{j}}$, call Algorithm 1 to construct the local optimal design ${\boldsymbol{X}_{j}}$.
-
Step 2 For each point in the candidate set, count its frequency of being selected in the local optimal designs. The continuous optimal design is formed by the normalized frequency as a discrete distribution.
-
Step 3 To obtain a discrete optimal design, sample n design points from the continuous optimal design.
Algorithm 2
Algorithm for Global D-Optimal Design.
In Step 1 of generating $\boldsymbol{\eta }$ uniformly, we can use uniform design [14], maximin Latin hypercube design [34], or other space filling design methods [21, 31, 37] to select samples ${\boldsymbol{\eta }_{j}}$ for $j=1,\dots ,B$. From Algorithm 2, it is likely that the discrete design obtained in Step 3 has some design points with ${n_{i}}=1$. When experimenters prefer to have replications at every design point, they can choose a saturated design by sampling $m=q$ unique design points in Step 3. Then sample some $\boldsymbol{\eta }$ values as in Step 0. Compute the lower bounds for ${n_{i}}$ for every $\boldsymbol{\eta }$ sample according to Proposition 1 and use the averaged lower bounds to set ${n_{i}}$. If ${\textstyle\sum _{i=1}^{m}}{n_{i}}$ exceeds n, the experimenters have to either increase the experiment budget or reduce the κ value.
6 Examples
In this section, we use two examples to demonstrate the proposed Bayesian D-optimal design and the construction method. For both examples, we set $r=1/3$ (equivalently $\zeta =1/2$) as explained in Section 4.3. Since there are few existing works on experimental design for continuous and binary responses, we compare the proposed method with three alternative designs: the optimal designs for the quantitative-only response, the optimal design for the binary-only response, and the naively combined design method as mentioned in Example 1.
6.1 Artificial Example
In this artificial experiment, there are three 2-level factors ${x_{1}}\sim {x_{3}}$, one 3-level categorical factor ${x_{4}}$, and one 3-level quantitative factor ${x_{5}}$. The underlying model assumed is the complete quadratic model and $\boldsymbol{f}(\boldsymbol{x})$ contains $q=22$ model effects including the intercept and the following model effects.
First order effects:
Second order effect:
\[\begin{aligned}{}& {x_{1}}{x_{2}},{x_{1}}{x_{3}},{x_{1}}{x_{4,1}},{x_{1}}{x_{4,2}},{x_{1}}{x_{5,l}},{x_{2}}{x_{3}},{x_{2}}{x_{4,1}},{x_{2}}{x_{4,2}},\\ {} & {x_{2}}{x_{5,l}},{x_{3}}{x_{4,1}},{x_{3}}{x_{4,2}},{x_{3}}{x_{5,l}},{x_{4,1}}{x_{5,l}},{x_{4,2}}{x_{5,l}},{x_{5,\text{quad}}}.\end{aligned}\]
Here for the 3-level factors ${x_{4}}$ and ${x_{5}}$, the effects ${x_{4,1}}$ (1st comparison) and ${x_{5,l}}$ (linear effect) have values $\left\{-\sqrt{\frac{3}{2}},0,\sqrt{\frac{3}{2}}\right\}$, and ${x_{4,2}}$ (2nd comparison) and ${x_{5,\text{quad}}}$ (quadratic effect) have values $\left\{-\sqrt{\frac{1}{2}},\sqrt{2},\sqrt{\frac{1}{2}}\right\}$. For the 2-level factors, the effects ${x_{i}}$ $i=1,2,3$ have the same values as the design settings $\{-1,1\}$. We consider independent uniform distribution for each ${\eta _{i}}$. Specifically, ${\eta _{i}}\sim \text{Uniform}[-1,1]$ for the intercept and the first order effects and Uniform$[-0.5,0.5]$ for the second order effects. The ranges of ${\eta _{i}}$’s satisfy the effect hierarchy principle.Table 1
An example of $\boldsymbol{\eta }$ value for the local D-optimal design.
Effect | η | Effect | η | Effect | η | Effect | η |
Intercept | −0.0153 | ${x_{1}}$ | −0.6067 | ${x_{2}}$ | 0.7212 | ${x_{1}}{x_{2}}$ | 0.0080 |
${x_{3}}$ | −0.1682 | ${x_{1}}{x_{3}}$ | 0.0010 | ${x_{2}}{x_{3}}$ | 0.1349 | ${x_{4,1}}$ | 0.0283 |
${x_{1}}{x_{4,1}}$ | 0.0594 | ${x_{2}}{x_{4,1}}$ | −0.1719 | ${x_{3}}{x_{4,1}}$ | 0.1492 | ${x_{4,2}}$ | −0.1468 |
${x_{1}}{x_{4,2}}$ | 0.0553 | ${x_{2}}{x_{4,2}}$ | −0.0634 | ${x_{3}}{x_{4,2}}$ | −0.2629 | ${x_{5,l}}$ | −0.0660 |
${x_{1}}{x_{5,l}}$ | −0.1054 | ${x_{2}}{x_{5,l}}$ | −0.0857 | ${x_{3}}{x_{5,l}}$ | −0.0807 | ${x_{4,1}}{x_{5,l}}$ | −0.1198 |
${x_{4,2}}{x_{5,l}}$ | −0.0292 | ${x_{5,\text{quad}}}$ | −0.1336 |
We set the experimental run size to be $n=66$. Table 1 illustrates the values of a randomly chosen $\boldsymbol{\eta }$. Using Algorithm 2 with this $\boldsymbol{\eta }$, we construct the proposed local D-optimal designs for QQ model ${D_{QQ}}$ for $\rho =0$ and $\rho =0.3$, respectively. For comparison, we also generate three alternative designs via R package AlgDesign developed by [41]. Specifically, they are (i) the 66-run classic D-optimal design for linear regression model, denoted as ${D_{L}}$, (ii) the 66-run local D-optimal design for logistic regression model given the $\boldsymbol{\eta }$, denoted as ${D_{G}}$, (iii) and the naively combined design of 44-run local D-optimal design for logistic regression model and 22-run D-optimal design for the linear regression model, denoted as ${D_{C}}$. The details of these designs can be found in Table S1 in Supplement S2.
To evaluate the performance of the proposed design in comparison with alternative designs, we consider the efficiency between two designs [44] as
Table 2 reports the efficiency of ${D_{QQ}}$ compared with ${D_{L}}$, ${D_{G}}$, and ${D_{C}}$, respectively. The proposed QQ optimal design ${D_{QQ}}$ gains the best efficiency over the three alternative designs. It appears that the combined design ${D_{C}}$ has the second-best design efficiency.
(6.1)
\[ \text{eff}({D_{1}},{D_{2}}|\boldsymbol{\eta })=\exp \left\{\frac{1}{q}\left(Q({D_{1}}|\boldsymbol{\eta })-Q({D_{2}}|\boldsymbol{\eta })\right)\right\}.\]Table 2
Design efficiency between the proposed local design ${D_{QQ}}$ and three alternative designs.
ρ | $\text{eff}({D_{QQ}},{D_{L}}|\boldsymbol{\eta })$ | $\text{eff}({D_{QQ}},{D_{G}}|\boldsymbol{\eta })$ | $\text{eff}({D_{QQ}},{D_{C}}|\boldsymbol{\eta })$ |
0 | 1.08 | 1.11 | 1.05 |
0.3 | 1.10 | 1.14 | 1.07 |
Next, we focus on the comparison of ${D_{QQ}}$ with ${D_{C}}$ under different $\boldsymbol{\eta }$ values. We generate a maximin Latin hypercube design of $B=500$ runs (R package lhs by [2]) for $\boldsymbol{\eta }$ with the lower and upper bounds specified earlier. For each of the $\boldsymbol{\eta }$ values, we construct a local QQ optimal design ${D_{QQ}}$ and the combined design ${D_{C}}$. Figure 2 shows the histogram of the $\text{eff}({D_{QQ}},{D_{C}}|\boldsymbol{\eta })$ for different $\boldsymbol{\eta }$ value. All $\text{eff}({D_{QQ}},{D_{C}}|\boldsymbol{\eta })$ values are larger than 1, indicating that the local QQ optimal design outperforms the combined design.
Figure 2
Artificial example: the efficiency between each local ${D_{QQ}}$ and ${D_{C}}$ under different $\boldsymbol{\eta }$ values: (a) $\rho =0$ (b) $\rho =0.3$.
Based on Algorithm 2, we accumulate frequencies of locally optimal designs and obtain the global D-optimal designs shown in Figure 3. Denote ${d_{QQ}}$ and ${d_{C}}$ are the proposed global optimal design for the QQ model and the global optimal combined design, respectively. The bar plots show the normalized frequencies for all the candidate points with the largest 22 frequencies colored blue. From Figure 3, for ${d_{QQ}}$, the points in the middle have much smaller frequencies than the other points. It is known that these points in the middle correspond to the points with ${x_{5}}=0$ in Table S1. Note that these points are only necessary for estimating the coefficient for ${x_{5,\text{quad}}}$, whose variances are the smallest in the prior for ${\boldsymbol{\beta }^{(i)}}$’s based on the effects hierarchy principle. In contrast, such a pattern is not observed for points with ${x_{4}}=0$. The reason is that ${x_{4}}$ is a categorical variable and ${x_{4}}=-1,0,1$ are equally necessary to estimate the effects ${x_{4,1}}$ and ${x_{4,2}}$. For ${d_{C}}$, the points with the largest 22 frequencies correspond to the 22-run D-optimal design for the linear regression model, which is independent of $\boldsymbol{\eta }$ and remains the same every time. The points with ${x_{5}}=1$ and $-1$ only have slightly higher frequencies than the ones with ${x_{5}}=0$, due to the way we specify the prior distribution of $\boldsymbol{\eta }$.
To compare the performances of the global designs, the design efficiencies in (6.1) is used with $Q(d|\boldsymbol{\eta })$ adapted as
\[\begin{aligned}{}Q(d|\boldsymbol{\eta })=\\ {} \log \det & \left(n{\sum \limits_{i=1}^{N}}d({\boldsymbol{x}_{i}})\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta })(1-\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta }))\boldsymbol{f}({\boldsymbol{x}_{i}})f{({\boldsymbol{x}_{i}})^{\prime }}\right)\\ {} +& \frac{1}{2}\log \det \left(n{\sum \limits_{i=1}^{N}}d({\boldsymbol{x}_{i}})\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta })\boldsymbol{f}({\boldsymbol{x}_{i}})\boldsymbol{f}{({\boldsymbol{x}_{i}})^{\prime }}+\rho \boldsymbol{R}\right)\\ {} +\frac{1}{2}\log & \det \left(n{\sum \limits_{i=1}^{N}}d({\boldsymbol{x}_{i}})(1-\pi ({\boldsymbol{x}_{i}},\boldsymbol{\eta }))\boldsymbol{f}({\boldsymbol{x}_{i}})\boldsymbol{f}{({\boldsymbol{x}_{i}})^{\prime }}+\rho \boldsymbol{R}\right)\end{aligned}\]
for a global optimal design d given $\boldsymbol{\eta }$ value. Here $d({\boldsymbol{x}_{i}})$ is the probability frequency for candidate design point ${\boldsymbol{x}_{i}}$ specified by the design d and ${\textstyle\sum _{i=1}^{N}}d({\boldsymbol{x}_{i}})=1$. For the global optimal designs ${d_{QQ}}$ and ${d_{C}}$ obtained previously, Figure 4 shows the histograms of the $\text{eff}({d_{QQ}},{d_{C}}|\boldsymbol{\eta })$ values, where the $\boldsymbol{\eta }$ values are generated from another 100-run maximin Latin hypercube design. It is clear that ${d_{QQ}}$ is universally better than ${d_{C}}$, thus the proposed design is more robust to different values of $\boldsymbol{\eta }$.6.2 Etching Experiment
In the etching process described in Section 1, the etchant is circulating at a certain flow rate. The wafers are rotated and swung horizontally and vertically. Meanwhile, the air is blown in the etchant with certain pressure. There are five factors involved in the etching process, the wafer rotation speed (${x_{1}}$), the pressure for blowing the bubbles (${x_{2}}$), the horizontal and vertical frequencies for swinging wafers (${x_{3}}$, ${x_{4}}$), and the flow rate of circulating the etchant (${x_{5}}$). The engineers intend to experiment to study the relationship between these factors and the two QQ responses.
Because of the newly developed process, the historical data on similar processes are not directly applicable to this experiment. Based on some exploratory analysis, we set $\rho =0.5$. Both domain knowledge and data have shown that the wafer appearance is the worst when both the rotating speed (${x_{1}}$) and bubble pressure (${x_{2}}$) are low. Accordingly, we set the prior of $\boldsymbol{\eta }$ as follows. For intercept ${\eta _{0}}\sim \text{Uniform}[0,6]$. The linear effects of rotating speed and bubble pressure follow $\text{Uniform}[1,5]$. The other linear effects follow $\text{Uniform}[-1,1]$ and the second order interactions and quadratic effects $\text{Uniform}[-0.3,0.3]$. The experimental run size is set to be $n=21\times 6=126$, 6 times the number of effects $q=21$.
We generate a maximin Latin hypercube design of $B=500$ runs for $\boldsymbol{\eta }$ values. For each $\boldsymbol{\eta }$ value, we obtain the local optimal designs ${D_{QQ}}$ and ${D_{C}}$. Here the local combined design ${D_{C}}$ has $2/3$ of the runs generated from the local D-optimal design for logistic regression and $1/3$ of the runs from the D-optimal design for linear regression. The efficiency between each pair of local designs ${D_{QQ}}$ and ${D_{C}}$ is reported in Figure 6(a). We can see that almost every local design ${D_{QQ}}$ is better than ${D_{C}}$. Moreover, we obtain the global optimal designs ${d_{QQ}}$ and ${d_{C}}$ by accumulating the frequencies of the local designs. To compare ${d_{QQ}}$ and ${d_{C}}$, we generate another 100-run maximin Latin hypercube design for $\boldsymbol{\eta }$ values and compute the efficiencies between ${d_{QQ}}$ and ${d_{C}}$ under different $\boldsymbol{\eta }$ values, which are shown in Figure 6 (b). Clearly, ${d_{QQ}}$ is universally better and more robust to $\boldsymbol{\eta }$ than ${d_{C}}$.
Fractional factorial design [46] is another commonly used design in practice. We compare the proposed design with a ${3^{5-2}}$ minimum aberration (MA) fractional factorial design by defining the contrast subgroup as $I=AB{D^{2}}=A{B^{2}}C{E^{2}}=A{C^{2}}DE=BCD{E^{2}}$. Each design point is replicated 5 times and the overall run is ${3^{5-2}}\times 5=135$. Figure 6(c) shows the histogram of the efficiencies between ${d_{QQ}}$ and the MA design, and the proposed global optimal design is still superior.
Figure 5
Etching experiment: (a) the global Bayesian QQ D-optimal design for $\rho =0.5$ and (b) the global combined design.
Figure 6
Etching experiment: histograms of the efficiencies (a) efficiencies between local designs ${D_{QQ}}$ and ${D_{C}}$ for different $\boldsymbol{\eta }$’s; (b) efficiencies between global designs ${d_{QQ}}$ and ${d_{C}}$; (c) efficiencies between global design ${d_{QQ}}$ and the ${3^{5-2}}$ MA fractional factorial design.
7 Discussion
In this paper, we propose the Bayesian D-optimal design criterion for QQ responses. The adoption of the Bayesian approach allows us to consider both the non-informative priors as the frequentist approach and informative priors when domain knowledge or historical data are available. A new point-exchange algorithm is developed for efficiently constructing the proposed designs. This algorithm can also be used to construct non-factorial designs when the candidate set is not a full factorial design. Moreover, the proposed method can be directly generalized for the sequential design with the QQ response. In the following, we discuss some other scenarios for the proposed method that are not considered in detail previously.
Non-conjugate Prior for $\boldsymbol{\eta }$
Other than the conjugate prior $p(\boldsymbol{\eta })$, we can also use a non-conjugate prior distribution $\boldsymbol{\eta }\sim N(\mathbf{0},{\tau _{0}^{2}}{\boldsymbol{R}_{0}})$. In this situation, one can consider the normal approximation in the posterior distribution $p(\boldsymbol{\eta }|\boldsymbol{z})$. Then the design criterion for the binary response becomes [3] ${\mathbb{E}_{\boldsymbol{z},\boldsymbol{\eta }}}\{\log (p(\boldsymbol{\eta }|\boldsymbol{z})\}\approx {\mathbb{E}_{\boldsymbol{\eta }}}\left\{\log \det ({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{0}}\boldsymbol{F}+{\rho _{0}}{\boldsymbol{R}_{0}^{-1}})\right\}$. The overall design criterion $Q(\boldsymbol{X})$ can be updated as
\[ Q(\boldsymbol{X})={\sum \limits_{i=0}^{2}}{\mathbb{E}_{\boldsymbol{\eta }}}\left\{\log \det \left({\boldsymbol{F}^{\prime }}{\boldsymbol{W}_{i}}\boldsymbol{F}+{\tau _{i}}{\boldsymbol{R}_{i}^{-1}}\right)\right\},\]
where ${\rho _{i}}={\sigma ^{2}}/{\tau _{i}}$, ${\tau _{i}}$ and ${\boldsymbol{R}_{i}}$ are the prior variance and correlation of $\boldsymbol{\eta }$, ${\boldsymbol{\beta }^{(1)}}$, and ${\boldsymbol{\beta }^{(2)}}$ respectively. The proposed design construction algorithm can still be applied with minor modifications.Multiple QQ Responses
In this paper, we focus on optimal designs for one quantitative response and one qualitative response. The proposed method can also be generalized to accommodate the QQ models with multiple quantitative responses ${Y_{1}},\dots ,{Y_{l}}$ and binary responses ${Z_{1}},\dots ,{Z_{k}}$. For example, a multi-level qualitative response can be transformed into a set of dummy binary responses. One idea is to generalize the QQ models in (2.1) and (2.2) for both $l\ge 1$ and $k\ge 1$. For $l\ge 1$ and $k=1$, we generalize the QQ model by introducing the correlation matrix between ${Y_{1}},\dots ,{Y_{l}}$ as in the standard multi-response regression [1]. Then the corresponding optimal design can be established by studying its likelihood function. For $l\ge 1$ and $k\gt 1$ with multiple binary responses, considering all ${2^{k}}$ conditional models for $({Y_{1}},\dots ,{Y_{l}}|{Z_{1}}=1,\dots ,{Z_{k}}=1)$,…, $({Y_{1}},\dots ,{Y_{l}}|{Z_{1}}=0,\dots ,{Z_{k}}=0)$ only works for a small k. Moreover, the construction algorithm can be more complicated as it needs to involve the multi-logit model [32] for modeling the multiple binary responses. When k is relatively large, we are going to pursue an alternative QQ model and develop its corresponding optimal design method as a future research topic.
Continuous Design
The point-exchange algorithm is to construct the exact discrete optimal designs, which are different from the theoretical continuous optimal designs. As described in Sections 5 and 6, the way of generating the frequency as the local optimal design is heuristic. The rigorous definition of the local continuous D-optimal design criterion is the probability measure ξ on the design space Ω that maximizes
\[\begin{aligned}{}& Q(\boldsymbol{X}|\boldsymbol{\eta })=\log \det \left(\int \pi (\boldsymbol{x})(1-\pi (\boldsymbol{x}))f(\boldsymbol{x})f{(\boldsymbol{x})^{\prime }}\mathrm{d}\xi (\boldsymbol{x})\right)\\ {} & +\log \det \left(\int (\pi \left(\boldsymbol{x})f(\boldsymbol{x})f{(\boldsymbol{x})^{\prime }}+\rho {\boldsymbol{R}_{1}^{-1}}\right)\mathrm{d}\xi (\boldsymbol{x})\right)\\ {} & +\log \det \left(\int \left((1-\pi (\boldsymbol{x}))f(\boldsymbol{x})f{(\boldsymbol{x})^{\prime }}+\rho {\boldsymbol{R}_{2}^{-1}}\right)\mathrm{d}\xi (\boldsymbol{x})\right).\end{aligned}\]
[49] developed a method to obtain the optimal ξ for the nonlinear models. It will be interesting to extend their framework and develop the method to obtain the optimal ξ for QQ models.Different QQ Models
The proposed design is not restricted to the logit model for the binary response. For example, if the probit model is used, the Bayesian D-optimal design criterion can be directly obtained by replacing the logit transformation with the probit transformation in both $p(\boldsymbol{z}|\boldsymbol{\eta })$ and $p(\boldsymbol{\eta })$. The design criterion can be derived similarly with minor modifications. The criterion formula remains the same with the following different diagonal matrices,
\[\begin{aligned}{}{\boldsymbol{W}_{0}}& =\text{diag}\left\{\frac{{\phi ^{2}}(\boldsymbol{f}{({\boldsymbol{x}_{1}})^{\prime }}\boldsymbol{\eta })}{\Phi (\boldsymbol{f}{({\boldsymbol{x}_{1}})^{\prime }}\boldsymbol{\eta })\left(1-\Phi (\boldsymbol{f}{({\boldsymbol{x}_{1}})^{\prime }}\boldsymbol{\eta })\right)},\dots ,\right.\\ {} & \hspace{2em}\left.\frac{{\phi ^{2}}(\boldsymbol{f}{({\boldsymbol{x}_{n}})^{\prime }}\boldsymbol{\eta })}{\Phi (\boldsymbol{f}{({\boldsymbol{x}_{n}})^{\prime }}\boldsymbol{\eta })\left(1-\Phi (\boldsymbol{f}{({\boldsymbol{x}_{n}})^{\prime }}\boldsymbol{\eta })\right)}\right\},\\ {} {\boldsymbol{W}_{1}}& =\text{diag}\left\{\Phi \left(\boldsymbol{f}{({\boldsymbol{x}_{1}})^{\prime }}\boldsymbol{\eta }\right),\dots ,\Phi \left(\boldsymbol{f}{({\boldsymbol{x}_{n}})^{\prime }}\boldsymbol{\eta }\right)\right\},\\ {} {\boldsymbol{W}_{2}}& =\boldsymbol{I}-{\boldsymbol{W}_{1}},\end{aligned}\]
where Φ and ϕ are CDF and PDF of the standard normal distribution.It is worth pointing out that the design criterion in the work is based on the QQ model constructed by the joint model of $Y|Z$ in (2.1) and Z in (2.2). [26] created a new QQ model based on $Z|U$ where U is a latent continuous variable that is assumed to be correlated with the observed continuous response variable Y. Besides the conditional model structures, other model structures such as mixed graphical models [47] can also be used as long as the D-optimality can be derived.