1 Introduction
Consider the standard multivariate linear regression
where the responses $\boldsymbol{Y}\in {\mathbb{R}^{r}}$ and the centered predictors $\boldsymbol{X}\in {\mathbb{R}^{p}}$ are both multivariate, the means of responses ${\boldsymbol{\mu }_{\boldsymbol{Y}}}\in {\mathbb{R}^{r}}$, the regression coefficients $\boldsymbol{\beta }\in {\mathbb{R}^{p\times r}}$, and the random error vector ${\boldsymbol{\epsilon }_{\boldsymbol{Y}\mid \boldsymbol{X}}}\hspace{2.5pt}\sim \hspace{2.5pt}{\mathcal{N}_{r}}(\mathbf{0},{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}})$. It has applications in broad fields of scientific research, including economics [6], chemistry [35], agriculture [51], engineering [44], bioinformatics [34], etc. Specifically, we consider the application to the imaging genetics problem (IGP).
(1.1)
\[ \boldsymbol{Y}={\boldsymbol{\mu }_{\boldsymbol{Y}}}+{\boldsymbol{\beta }^{T}}\boldsymbol{X}+{\boldsymbol{\epsilon }_{\boldsymbol{Y}\mid \boldsymbol{X}}},\]Thanks to the rapid technological development in medical imaging and genome sequencing, the diagnosis, prevention and treatment of mental illness have been greatly improved. Compared with traditional medical imaging analysis and genome-wide association studies (GWAS) that make use of only imaging or only genome data and were studied to a large extent [67, 63], another more promising yet more challenging study is to directly relate the imaging phenotypes (IPs) as responses to genotypes (GPs) as predictors of interest, while also considering the effects of some demographic covariates or other risk factors. This is called IGP and is less explored in the statistical community [42]. Because of the Central Dogma that is often stated as DNA makes RNA and RNA makes protein, genetic markers should be promising in revealing the abnormal protein expression in cortical or sub-cortical structures of brain, whose volumes should be more informative quantitative traits than simple disease status.
Some early practices in studying IGP focus on univariate [54] or voxel-wise type of methods [25], which explores the relationship between every (voxel, SNP) pair (SNP: Single-nucleotide polymorphism), or between each individual voxel and all genetic markers together, respectively. These types of methods enjoy simplicity and can handle high dimensional imaging or genomic datasets, but cannot leverage spatial dependence among voxels. To this end, several multivariate regression models have been proposed to jointly relate tens of brain summary measures, like the volumes of Regions of Interest (ROIs), to hundreds of SNPs [65, 68, 23, 64, 46]. Among them, [46] is the only work that employs the envelope model to IGP, to the best of our knowledge.
As a dimension reduction method developed in the past decade, the first envelope model [15] aims at providing an efficient estimator of the regression coefficients $\boldsymbol{\beta }$ in the multivariate linear regression (1.1), by assuming that there are linear combinations of the responses $\boldsymbol{Y}$ that are not related to the regression, and removing its variation as immaterial information (see Section 2.2 for more details). With a great success in achieving efficient estimation and modeling flexibility, the original envelope model has been adapted or extended to predictor envelope [12], groupwise envelope [46], partial response envelope [57], simultaneous envelope [14], sparsity learning [56], matrix-valued [18] or tensor-valued [39] responses or predictors, quantile regression [19], generalized linear models [11], etc. The envelope methods were recently reviewed by [37] and more details were introduced in [13]. The aforementioned envelope methods are all developed from the frequentist perspective. However, the literature addressing Bayesian envelope methods is rare, but still interesting for its capabilities of incorporating prior information into inference, and implementing posterior uncertainty quantification, compared with the unnatural prior information incorporation and the bootstrap or asymptotic variance in the frequentist setting.
[33] introduced the first framework for the Bayesian envelope model. However, it is hard to apply this framework to other envelope contexts, since it requires the specific formulation of the response envelope to construct the conjugacy. Furthermore, sampling from the generalized matrix Bingham distributions and the truncated inverse gamma distributions is required in this framework, which makes the MCMC algorithm computationally expensive. Later, [5] proposed a new flexible and handy Bayesian framework for the envelope model, which bypasses the manifold restriction on the basis matrix of the envelope space by an unconstrained matrix re-parameterization. Since all parameters under this new framework are either vectors, or unconstrained, or positive definite matrices, the computation is much more efficient without sampling from the generalized matrix Bingham distributions and the truncated inverse gamma distributions.
The rapid development in the envelope family requires a model to unify various perspectives, and a more clear choice to practitioners on which envelope model to apply, especially when we have a particular interest for a subset of predictors that could be either continuous or discrete or a mix of them. The existing simultaneous envelope model [14] integrates both the response and predictor envelopes within a unified modeling framework. It simultaneously enjoys the benefits from two envelope components, i.e., more efficient estimation of the regression coefficients offered by the response envelope, and improved prediction efficiency of the responses obtained by the predictor envelope, but it cannot focus on the predictors of main interest, and therefore cannot leverage the idea from the partial response envelope [57] to obtain further estimation efficiency. More importantly, without additional assumptions imposed, it is guaranteed to outperform the predictor or the response envelope only for the Normal predictors ([14], Propositions 2 and 5). However, the discrete predictors will not follow the Normal distribution and including them in the simultaneous envelope by implicitly regarding them as Normal will lead to biased estimates and invalid inference. The partial response envelope [57] focuses on enveloping only the coefficients of the predictors of main interest ${\boldsymbol{X}_{1}}$ (where in (1.1), $\boldsymbol{X}$ is partitioned into ${({\boldsymbol{X}_{1}^{T}},{\boldsymbol{X}_{2}^{T}})^{T}}$, and ${\boldsymbol{X}_{1}}$ denotes the predictor of main interest to us and ${\boldsymbol{X}_{2}}$ are nuisance covariates), for example the genetic markers in IGP. Compared with full response envelope [15] that indistinguishably envelopes the coefficients of all predictors $\boldsymbol{X}$, enhanced estimation efficiency is possible by this model since the envelope space could be shrunk if only concentrating on the coefficients of ${\boldsymbol{X}_{1}}$, but it could not leverage the benefits of the predictor envelope [12], which is known to have the potential to increase prediction efficiency. The envelope model that is proposed in this paper could address these limitations while still providing a valid inference procedure.
To address the aforementioned limitations, and improve the efficiencies for the estimation of the regression coefficients of main interest (the coefficients of ${\boldsymbol{X}_{1}}$, which we denote as ${\boldsymbol{\beta }_{1}}$ in (3.2)) and the prediction of the responses ($\boldsymbol{Y}$ in (3.2)), this paper proposes a new envelope model, called the simultaneous partial envelope model (3.7)–(3.10), in Section 3. The proposed envelope model combines the partial response envelope model [57] and the simultaneous envelope model [14] within a more general modeling framework, by further partitioning the predictors of main interest ${\boldsymbol{X}_{1}}$ into ${\boldsymbol{X}_{1C}}$ and ${\boldsymbol{X}_{1D}}$, the continuous part and the discrete part. Our proposed envelope model simultaneously imposes the partial response envelope structure on the coefficients of ${\boldsymbol{X}_{1}}$, and the partial predictor envelope only on the coefficients of ${\boldsymbol{X}_{1C}}$, instead of the coefficients of whole ${\boldsymbol{X}_{1}}$. By this construction, our proposed model could address the aforementioned limitations for the simultaneous and the partial response envelopes. Firstly, two envelope components that are simultaneously imposed on our model are partial envelopes, which both focus on the predictors of interest only. Secondly, it avoids the Normal requirement mentioned above for the discrete predictors in the simultaneous envelope model while still providing a valid inference, by only imposing the partial response envelope (i.e. no partial predictor envelope) on the coefficients of ${\boldsymbol{X}_{1D}}$. It is also worthy to note that ${\boldsymbol{X}_{1C}}$ is Normal with means depending on ${\boldsymbol{X}_{1D}}$ in our model. Lastly, to further improve the estimation and prediction efficiencies upon the partial response envelope, the simultaneous envelope structure is considered in our model for the coefficients of ${\boldsymbol{X}_{1C}}$, which turns out to enhance not only the estimation of the coefficients of ${\boldsymbol{X}_{1C}}$, but also that of the coefficients of ${\boldsymbol{X}_{1D}}$ in our simulation (see the comparisons on the estimation of the coefficents of ${\boldsymbol{X}_{1D}}$ between our method and the Bayesian partial response envelope in Table 3 and Figure 2), as a possible synergetic effect.
Our proposed model includes a number of important models in the envelope family as special cases, including the (partial) response envelope, the (partial) predictor envelope and the simultaneous envelope (see Table 1 for their relationship). In Section 4, we develop an efficient Bayesian inference procedure based on [5], which allows convenient prior information incorporation and variability quantification, compared with the frequentist envelope methods.
The contribution of this paper is fivefold.
-
1. We propose a new simultaneous partial envelope model (3.7)–(3.10) that unifies several existing envelope models as special cases. Our proposed model allows both predictors of interest ${\boldsymbol{X}_{1}}$ and nuisance covariates ${\boldsymbol{X}_{2}}$ to be either continuous or discrete or a mix of them. Compared with any envelopes that our model degenerates to, our method provides a more efficient while still valid estimator for the regression coefficients of ${\boldsymbol{X}_{1}}$ and improved prediction for $\boldsymbol{Y}$ in (3.2). These advantages address the limitations that are mentioned above for the simultaneous envelope model [14] and the partial response envelope model [57]. The corresponding block Metropolis-within-Gibbs algorithm is developed for the posterior inference.
-
2. We establish the theoretical properties including the posterior propriety for our model and the Harris ergodicity for our algorithm.
-
3. We are the first to investigate the performance of several popular dimension selection methods together, among the Bayesian envelope literature, to the best of our knowledge.
-
4. We apply our method to an imaging genetics analysis for the ADNI study. We show the improved prediction accuracy of our method compared with all other competitors, and obtain some scientifically meaningful findings from posterior selection. By incorporating the weak imaging genetics relationship as prior information, the prediction accuracy is further improved and the associated shrinkage effect is also investigated.
-
5. We build the R package $\mathrm{SIMP}$ that implements our algorithm, which is available at https://github.com/yanbowisc/SIMP.git.
Some notations frequently used in this paper are summarized here. For any $a=1,2,\dots $, let ${\mathbb{P}_{\mathcal{S}}}$ be a projection matrix onto a subspace $\mathcal{S}\subseteq {\mathbb{R}^{a}}$ and ${\mathbb{P}_{{\mathcal{S}^{\perp }}}}={\mathbf{I}_{a}}-{\mathbb{P}_{\mathcal{S}}}$ be the projection matrix onto ${\mathcal{S}^{\perp }}$, where ${\mathcal{S}^{\perp }}$ denotes the orthogonal complement of $\mathcal{S}$ and ${\mathbf{I}_{a}}$ denotes the a-dimensional identity matrix. Let ${\mathbb{S}_{+}^{a\times a}}$ denote the class of all $a\times a$ real valued positive definite matrices. Let ${\mathbf{1}_{n}}$ denote a vector of all ones with length n. For a matrix $\mathbf{A}\in {\mathbb{R}^{a\times a}}$ and a subspace $\mathcal{E}\subseteq {\mathbb{R}^{a}}$, $\mathbf{A}\mathcal{E}$ is defined as $\mathbf{A}\mathcal{E}=\{\mathbf{A}\boldsymbol{\epsilon }:\boldsymbol{\epsilon }\in \mathcal{E}\}$. Notations $|\mathbf{A}|$, $\| \mathbf{A}\| $, $\mathrm{vec}(\mathbf{A})$ and $\mathrm{span}(\mathbf{A})$ denote the determinant, the spectral norm, the vectorization by columns and the column space of the matrix A respectively. For ${a_{1}},{a_{2}},\dots ,{a_{n}}\in \mathbb{R}$ and a square matrix $\mathbf{S}\in {\mathbb{R}^{a\times a}}$, we describe a diagonal matrix by either specifying its diagonal elements by the notation $\mathrm{diag}({a_{1}},\dots ,{a_{n}})$, or using the notation $\mathrm{diag}\{\mathbf{S}\}$ to indicate that the diagonal elements of the square matrix S will be taken out to constitute this diagonal matrix. Moreover, ${\mathbf{S}_{ii}}$ denotes the i-th diagonal element of the square matrix S. For a random vector $\boldsymbol{Y}$, the distribution of $\boldsymbol{Y}\mid \boldsymbol{X}$ is interpreted as the distribution of $\boldsymbol{Y}$ given the fixed value of $\boldsymbol{X}$ for a non-stochastic vector $\boldsymbol{X}$, or the distribution of $\boldsymbol{Y}$ conditional on $\boldsymbol{X}$ for a random vector $\boldsymbol{X}$. For $a,b,c=1,2,\dots $, and random vectors $\boldsymbol{X}\in {\mathbb{R}^{a}}$, $\boldsymbol{Y}\in {\mathbb{R}^{b}}$, we use $\mathrm{Cov}((\boldsymbol{X},\boldsymbol{Y})\mid \boldsymbol{Z})$ to denote the $a\times b$ dimensional covariance matrix of $\boldsymbol{X}$ and $\boldsymbol{Y}$ conditional on a random (or given a non-stochastic) vector $\boldsymbol{Z}\in {\mathbb{R}^{c}}$. If multiple random (or non-stochastic) vectors are conditional on (or given), they are bracketed together after the vertical line, like $\boldsymbol{Y}\mid (\cdot )$ and $\mathrm{Cov}((\boldsymbol{X},\boldsymbol{Y})\mid (\cdot ))$. Also, $:=$ and $\boldsymbol{X}\stackrel{\mathcal{D}}{=}\boldsymbol{Y}$ indicate equating by definition and random vectors $\boldsymbol{X}$ and $\boldsymbol{Y}$ are equally distributed respectively. Lastly, for any $x,y\in \mathbb{R}$, $x\ll y$ indicates that x is much less than y, while $x\gg y$ means that x is much greater than y.
The rest of this paper is organized as follows. Section 2 reviews two existing envelope models, which are prototypes of two envelope components of our proposed model. Section 3 introduces the definition and the formulation of our proposed simultaneous partial envelope model. Section 4 develops a Bayesian inference procedure for our proposed model, with detailed procedure of MCMC algorithm left to Appendix A. Section 5 establishes the theoretical properties of our model and algorithm. Section 6 investigates the issue of model selection for our Bayesian envelope model. Sections 7 and 8 conduct comprehensive simulation studies and investigate a real data application respectively. Section 9 concludes our paper by further discussions.
2 Review of the Response and the Predictor Envelope Models
2.1 Preliminaries
In this section, we introduce two definitions and one proposition, which are necessary for the formal introduction of the response envelope model [15] and the predictor envelope model [12]. For convenience, the definition of reducing subspace [10] is firstly introduced.
Definition 1 ([10], Chapter II, Definition 3.5).
A subspace $\mathcal{E}\subseteq {\mathbb{R}^{a}}$ is called a reducing subspace of $\boldsymbol{\Sigma }\in {\mathbb{R}^{a\times a}}$ for some integer $a=1,2,\dots $, if $\boldsymbol{\Sigma }\mathcal{E}\subseteq \mathcal{E}$ and $\boldsymbol{\Sigma }{\mathcal{E}^{\perp }}\subseteq {\mathcal{E}^{\perp }}$.
If $\mathcal{E}$ is a reducing subspace of Σ, there is a crucial decomposition of Σ as illustrated in the following proposition.
Proposition 1 ([15], Proposition 2.1).
$\mathcal{E}$ is a reducing subspace of Σ if and only if Σ could be decomposed as $\boldsymbol{\Sigma }={\mathbb{P}_{\mathcal{E}}}\boldsymbol{\Sigma }{\mathbb{P}_{\mathcal{E}}}+{\mathbb{P}_{{\mathcal{E}^{\perp }}}}\boldsymbol{\Sigma }{\mathbb{P}_{{\mathcal{E}^{\perp }}}}$, where ${\mathbb{P}_{\mathcal{E}}}$ and ${\mathbb{P}_{{\mathcal{E}^{\perp }}}}$ are the projection matrices onto subspaces $\mathcal{E}$ and ${\mathcal{E}^{\perp }}$.
Next, for a positive definite matrix M, we introduce the general definition of the M-envelope of a subspace $\mathcal{S}$. It will be used in defining the response and the predictor envelope models in Section 2.2 and Section 2.3 and our proposed model in Section 3.
Definition 2 ([15], Definition 2.1).
Let a matrix $\mathbf{M}\in {\mathbb{S}_{+}^{a\times a}}$ and a subspace $\mathcal{S}\subset \mathrm{span}(\mathbf{M})$. The M-envelope of $\mathcal{S}$ is the intersection of all reducing subspaces of M that contain $\mathcal{S}$, and is denoted by ${\mathcal{E}_{\mathbf{M}}}(\mathcal{S})$.
For convenience, we will frequently use the notation ${\mathcal{E}_{\mathbf{M}}}(\mathcal{S})$ to define envelope structures in the rest of this paper. Below we briefly review the response envelope model [15] in Section 2.2, and mention how this idea was adapted to define the predictor envelope model [12] in Section 2.3.
2.2 The Response Envelope Model
In this section, we review the definitions and the coordinate form of the response envelope model [15], which aims to improve the estimation efficiency of $\boldsymbol{\beta }$ in (1.1) by assuming some linear combinations of $\boldsymbol{Y}$ do not depend on $\boldsymbol{X}$. Here, we assume $\boldsymbol{X}$ in (1.1) to be non-stochastic. The efficiency gains come from removing those redundant variation in $\boldsymbol{Y}$ by estimating the associated linear combination coefficients. Formally, the response envelope model is defined by the smallest subspace ${\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ that satisfies Condition 1.
Condition 1 ([15], p. 928).
Assume in (1.1), the subspace ${\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}\subset {\mathbb{R}^{r}}$ satisfies
-
$(a)$ ${\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}}}\boldsymbol{Y}\mid \boldsymbol{X}\stackrel{\mathcal{D}}{=}{\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}}}\boldsymbol{Y}$,
-
$(b)$ $\mathrm{Cov}(({\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}\boldsymbol{Y},{\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}}}\boldsymbol{Y})\mid \boldsymbol{X})=\mathbf{0}$.
Condition 1$(a)$ assumes the distribution of ${\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}}}\boldsymbol{Y}$ does not depend on the value of $\boldsymbol{X}$. Meanwhile, Condition 1$(b)$ excludes the indirect effect of $\boldsymbol{X}$ on ${\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}}}\boldsymbol{Y}$ via ${\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}\boldsymbol{Y}$, by assuming ${\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}\boldsymbol{Y}$ and ${\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}}}\boldsymbol{Y}$ are uncorrelated given $\boldsymbol{X}$. Therefore, ${\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}}}\boldsymbol{Y}$ only contains immaterial information, and ${\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}\boldsymbol{Y}$ carries all the material information and possibly some extra immaterial information for the regression. Instead of using any satisfied ${\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$, the definition of the response envelope model ensures the uniqueness of ${\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ by taking it to be the smallest one, i.e. the intersection of all satisfied subspaces. At the same time, by using ${\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ rather than any ${\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$, immaterial information is guaranteed to be not contained in ${\mathbb{P}_{{\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}\boldsymbol{Y}$ anymore, and could be removed to the largest degree. So we call ${\mathbb{P}_{{\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}\boldsymbol{Y}$ the material part and ${\mathbb{P}_{{\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}}}\boldsymbol{Y}$ the immaterial part of $\boldsymbol{Y}$ respectively. Although intuitive, it is not easy to formulate the response envelope model under Condition 1. [15] derived the following Condition 2, which is equivalent to Condition 1. The response envelope model could be equivalently defined by Condition 2 with subspace ${\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$, for (1.1).
Condition 2 ([15], Definition 2.1).
Assume in (1.1), the subspace ${\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}\subset {\mathbb{R}^{r}}$ satisfies
-
$({a^{\prime }})$ $\mathrm{span}({\boldsymbol{\beta }^{T}})\subseteq {\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$,
-
$({b^{\prime }})$ ${\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}={\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}{\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}+{\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}}}{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}{\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}}}$, i.e. ${\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ is a reducing subspace of ${\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}$.
Condition 2 assumes that ${\mathcal{S}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ is a reducing subspace of ${\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ that contains $\mathrm{span}({\boldsymbol{\beta }^{T}})$. Recalling Definition 2, ${\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ is exactly ${\mathcal{E}_{{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}(\mathrm{span}({\boldsymbol{\beta }^{T}}))$, i.e. the ${\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}$-envelope of $\mathrm{span}({\boldsymbol{\beta }^{T}})$. Therefore, the response envelope model could be defined directly by ${\mathcal{E}_{{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}(\mathrm{span}({\boldsymbol{\beta }^{T}}))$ for (1.1).
Suppose that the orthonormal basis matrices of ${\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ and ${\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}^{\perp }}$ are Γ and ${\boldsymbol{\Gamma }_{0}}$ respectively. Then according to Condition 2 with subspace ${\mathcal{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}}$, the coordinate form of the response envelope model is
where $\boldsymbol{\eta }$, $\boldsymbol{\Delta }={\boldsymbol{\Gamma }^{T}}{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}\boldsymbol{\Gamma }$ and ${\boldsymbol{\Delta }_{0}}={\boldsymbol{\Gamma }_{0}^{T}}{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}{\boldsymbol{\Gamma }_{0}}$ carry the coordinates of ${\boldsymbol{\beta }^{T}}$ relative to Γ, ${\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ relative to Γ and ${\boldsymbol{\Gamma }_{0}}$ respectively. The response envelope model could provide significant efficiency gains when $\| \boldsymbol{\Delta }\| \ll \| {\boldsymbol{\Delta }_{0}}\| $, since there will be a large amount of immaterial information for removal under this scenario.
2.3 The Predictor Envelope Model
In this section, we review the predictor envelope model [12] by showing how the idea of the response envelope could be adapted to reduce the predictor space. The predictor envelope model has the potential to improve the prediction of $\boldsymbol{Y}$ in (1.1).
To define the predictor envelope, we assume $\boldsymbol{X}$ to be stochastic following $\mathcal{N}(0,{\boldsymbol{\Sigma }_{\boldsymbol{X}}})$. Then (1.1) is called the predictor envelope model if the envelope structure ${\mathcal{E}_{{\boldsymbol{\Sigma }_{\boldsymbol{X}}}}}(\mathrm{span}(\boldsymbol{\beta }))$ is imposed. Unlike the response envelope, ${\mathcal{E}_{{\boldsymbol{\Sigma }_{\boldsymbol{X}}}}}(\mathrm{span}(\boldsymbol{\beta }))$ is the intersection of all reducing subspaces of ${\boldsymbol{\Sigma }_{\boldsymbol{X}}}$ that contain $\mathrm{span}(\boldsymbol{\beta })$. Similarly, the material and immaterial parts of $\boldsymbol{X}$ could be defined. The immaterial part of $\boldsymbol{X}$ is assumed not to carry the information for the regression. Suppose the orthonormal basis matrices of ${\mathcal{E}_{{\boldsymbol{\Sigma }_{\boldsymbol{X}}}}}(\mathrm{span}(\boldsymbol{\beta }))$ and its orthogonal complement subspace are Υ and ${\boldsymbol{\Upsilon }_{0}}$, then the coordinate form of the predictor envelope model is
where $\boldsymbol{\psi }$, $\boldsymbol{\Xi }={\boldsymbol{\Upsilon }^{T}}{\boldsymbol{\Sigma }_{\boldsymbol{X}}}\boldsymbol{\Upsilon }$ and ${\boldsymbol{\Xi }_{0}}={\boldsymbol{\Upsilon }_{0}^{T}}{\boldsymbol{\Sigma }_{\boldsymbol{X}}}{\boldsymbol{\Upsilon }_{0}}$ carry the coordinates of $\boldsymbol{\beta }$ relative to Υ, ${\boldsymbol{\Sigma }_{\boldsymbol{X}}}$ relative to Υ and ${\boldsymbol{\Upsilon }_{0}}$ respectively. The large efficiency gains could be obtained when $\| \boldsymbol{\Xi }\| \gg \| {\boldsymbol{\Xi }_{0}}\| $.
3 Proposed Simultaneous Partial Envelope Model
As mentioned in Section 1, suppose that our predictors $\boldsymbol{X}\in {\mathbb{R}^{p}}$ could be partitioned into three parts
where ${\boldsymbol{\mu }_{\boldsymbol{Y}}}\in {\mathbb{R}^{r}}$, ${\boldsymbol{\mu }_{1C}}\in {\mathbb{R}^{{p_{C}}}}$, ${\boldsymbol{\mu }_{1D}}\in {\mathbb{R}^{{p_{D}}}}$ and ${\boldsymbol{\mu }_{2}}\in {\mathbb{R}^{{p_{2}}}}$ denote the unknown means of $\boldsymbol{Y}$, ${\boldsymbol{X}_{1C}}$, ${\boldsymbol{X}_{1D}}$ and ${\boldsymbol{X}_{2}}$ respectively, ${\boldsymbol{\beta }_{1C}}\in {\mathbb{R}^{{p_{C}}\times r}}$, ${\boldsymbol{\beta }_{1D}}\in {\mathbb{R}^{{p_{D}}\times r}}$ and ${\boldsymbol{\beta }_{2}}\in {\mathbb{R}^{{p_{2}}\times r}}$ denote the unknown regression coefficients of ${\boldsymbol{X}_{1C}}$, ${\boldsymbol{X}_{1D}}$ and ${\boldsymbol{X}_{2}}$, and ${\boldsymbol{\epsilon }_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ is the random error vector. Specifically, combining ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$, ${\boldsymbol{\beta }_{1}}={({\boldsymbol{\beta }_{1C}^{T}},{\boldsymbol{\beta }_{1D}^{T}})^{T}}$ denotes the coefficients of main interest to us. Assume ${\boldsymbol{X}_{1C}}$ to be stochastic, and ${\boldsymbol{X}_{1D}}$ and ${\boldsymbol{X}_{2}}$ to be non-stochastic, with sample means ${\overline{\boldsymbol{X}}_{1D}}$ and ${\overline{\boldsymbol{X}}_{2}}$. Replacing ${\boldsymbol{\mu }_{1D}}$ and ${\boldsymbol{\mu }_{2}}$ with ${\overline{\boldsymbol{X}}_{1D}}$ and ${\overline{\boldsymbol{X}}_{2}}$ respectively, (3.1) is equal to (3.2)
\[ \boldsymbol{X}=\left(\begin{array}{c}{\boldsymbol{X}_{1}}\\ {} {\boldsymbol{X}_{2}}\end{array}\right)=\left(\begin{array}{c}{\boldsymbol{X}_{1C}}\\ {} {\boldsymbol{X}_{1D}}\\ {} {\boldsymbol{X}_{2}}\end{array}\right),\]
where ${\boldsymbol{X}_{1}}={({\boldsymbol{X}_{1C}^{T}},{\boldsymbol{X}_{1D}^{T}})^{T}}\in {\mathbb{R}^{{p_{1}}}}$ is the predictors of main interest to us in the multivariate linear regression, with its continuous and discrete parts being ${\boldsymbol{X}_{1C}}\in {\mathbb{R}^{{p_{C}}}}$ and ${\boldsymbol{X}_{1D}}\in {\mathbb{R}^{{p_{D}}}}$ respectively (${p_{1}}={p_{C}}+{p_{D}}$), while ${\boldsymbol{X}_{2}}\in {\mathbb{R}^{{p_{2}}}}$ denotes the set of predictors that is not of main interest ($p={p_{1}}+{p_{2}}$). By separating ${\boldsymbol{X}_{1D}}$ with ${\boldsymbol{X}_{1C}}$, we could avoid the Normality restrictions for ${\boldsymbol{X}_{1D}}$ and hence whole ${\boldsymbol{X}_{1}}$, while still providing a valid simultaneous envelope estimator for the coefficients of ${\boldsymbol{X}_{1C}}$. Assuming predictors are not centered, the standard multivariate linear model can be written as
(3.1)
\[\begin{aligned}{}\boldsymbol{Y}=& {\boldsymbol{\mu }_{\boldsymbol{Y}}}+{\boldsymbol{\beta }_{1C}^{T}}({\boldsymbol{X}_{1C}}-{\boldsymbol{\mu }_{1C}})+{\boldsymbol{\beta }_{1D}^{T}}({\boldsymbol{X}_{1D}}-{\boldsymbol{\mu }_{1D}})+\\ {} & {\boldsymbol{\beta }_{2}^{T}}({\boldsymbol{X}_{2}}-{\boldsymbol{\mu }_{2}})+{\boldsymbol{\epsilon }_{\boldsymbol{Y}\mid \boldsymbol{X}}},\end{aligned}\](3.2)
\[\begin{aligned}{}\boldsymbol{Y}=& {\boldsymbol{\mu }_{\boldsymbol{Y}}}+{\boldsymbol{\beta }_{1C}^{T}}({\boldsymbol{X}_{1C}}-{\boldsymbol{\mu }_{1C}})+{\boldsymbol{\beta }_{1D}^{T}}({\boldsymbol{X}_{1D}}-{\overline{\boldsymbol{X}}_{1D}})+\\ {} & {\boldsymbol{\beta }_{2}^{T}}({\boldsymbol{X}_{2}}-{\overline{\boldsymbol{X}}_{2}})+{\boldsymbol{\epsilon }_{\boldsymbol{Y}\mid \boldsymbol{X}}}.\end{aligned}\]Remark.
For the purpose of expositional simplicity, we implicitly assume the discrete predictors in ${\boldsymbol{X}_{1D}}$ and ${\boldsymbol{X}_{2}}$ to be quantitative variables in (3.1), (3.2) and later in (3.7). However, it is worthy to note that categorical predictors, either nominal or ordinal, are still applicable to our model by including their dummy variables into ${\boldsymbol{X}_{1D}}$ and ${\boldsymbol{X}_{2}}$ instead. In either case, the centering for these two predictors serves to ensure that ${\boldsymbol{\mu }_{\boldsymbol{Y}}}$ could be interpreted as the expectation (conditional on model parameters) of the sample means of $\boldsymbol{Y}$.
Table 1
Relationship between $\mathrm{SIMP}$ (3.7)–(3.10) and several other envelope models. The meanings of r, ${p_{C}}$, ${p_{D}}$, ${p_{1}}$ and ${p_{2}}$ are introduced in (3.1), and ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$ are dimensions of ${\mathcal{E}_{C\mid D}}$ and ${\mathcal{E}_{\boldsymbol{Y}\mid 1}}$ respectively. The effective number of parameters for each model is also listed. Note for the response and the partial response envelopes, ${\boldsymbol{X}_{1D}}$ is required to be non-stochastic only and not necessary to be discrete.
Conditions for (3.7)–(3.10) | Effective number of parameters | |
$\mathrm{SIMP}$ | N.A. | ${d_{\boldsymbol{Y}}}({d_{\boldsymbol{X}}}+{p_{D}})+r(r+2{p_{2}}+3)/2+{p_{C}}({p_{C}}+3)/2$ |
response envelope [15] | ${p_{C}}={p_{2}}=0$ | ${d_{\boldsymbol{Y}}}p+r(r+3)/2$ |
predictor envelope [12] | ${p_{D}}={p_{2}}=0$ and ${d_{\boldsymbol{Y}}}=r$ | $r(r+2{d_{\boldsymbol{X}}}+3)/2+p(p+3)/2$ |
simultaneous envelope [14] | ${p_{D}}={p_{2}}=0$ | ${d_{\boldsymbol{Y}}}{d_{\boldsymbol{X}}}+r(r+3)/2+p(p+3)/2$ |
partial response envelope [57] | ${p_{C}}=0$ | ${d_{\boldsymbol{Y}}}{p_{1}}+r(r+2{p_{2}}+3)/2$ |
partial predictor envelope [45] | ${p_{2}}=0$ and ${d_{\boldsymbol{Y}}}=r$ | $r(r+2{d_{\boldsymbol{X}}}+2{p_{D}}+3)/2+{p_{C}}({p_{C}}+3)/2$ |
We firstly impose some distributional assumptions on (3.2). Assume ${\boldsymbol{\epsilon }_{\boldsymbol{Y}\mid \boldsymbol{X}}}\sim {\mathcal{N}_{r}}(\mathbf{0},{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}})$. Given ${\boldsymbol{X}_{1D}}$, assume that ${\boldsymbol{X}_{1C}}={\boldsymbol{\mu }_{1C}}+{\boldsymbol{\gamma }^{T}}({\boldsymbol{X}_{1D}}-{\overline{\boldsymbol{X}}_{1D}})+{\boldsymbol{\epsilon }_{C\mid D}}$ for an unknown $\boldsymbol{\gamma }\in {\mathbb{R}^{{p_{D}}\times {p_{C}}}}$, where ${\boldsymbol{\epsilon }_{C\mid D}}\sim {\mathcal{N}_{{p_{C}}}}(\mathbf{0},{\boldsymbol{\Sigma }_{C\mid D}})$ and is independent of ${\boldsymbol{\epsilon }_{\boldsymbol{Y}\mid \boldsymbol{X}}}$. Rigorously speaking, ${\boldsymbol{\mu }_{1C}}$ in (3.2) should only be interpreted as the expectation of ${\overline{\boldsymbol{X}}_{1C}}$ (rather than the expectation of ${\boldsymbol{X}_{1C}}$) conditional on model parameters under this assumption, due to the non-stochasticity of ${\overline{\boldsymbol{X}}_{1D}}$. However, we will keep using ${\boldsymbol{\mu }_{1C}}$ to avoid abusing notations.
To provide the efficiency gains on estimating ${\boldsymbol{\beta }_{1}}$ and predicting $\boldsymbol{Y}$, we combine the advantages of the response envelope on the estimation and the predictor envelope on the prediction, by further imposing the following two partial envelope structures on (3.2) (by Definition 2) simultaneously:
where $\mathcal{L}$ and $\mathcal{R}$ denote $\mathrm{span}({\boldsymbol{\beta }_{1C}})$ and $\mathrm{span}({\boldsymbol{\beta }_{1}^{T}})$, and d and ${d_{1}}$ denote $\mathrm{rank}({\boldsymbol{\beta }_{1C}})$ and $\mathrm{rank}({\boldsymbol{\beta }_{1}})$ respectively. (3.2) is called the Simultaneous Partial Envelope Model ($\mathrm{SIMP}$), if these distributional assumptions and envelope structures are imposed. Note that $\mathrm{SIMP}$ avoids the Normality requirement for the whole ${\boldsymbol{X}_{1}}$ (i.e., no other assumptions on ${\boldsymbol{X}_{1D}}$ besides the non-stochasticity, and ${\boldsymbol{X}_{1C}}$ is Normal with means depending on ${\boldsymbol{X}_{1D}}$, which relaxes the identical Normal distribution assumption of each observation of predictors in [14]) by imposing ${\mathcal{E}_{\boldsymbol{Y}\mid 1}}$ on the row space of whole ${\boldsymbol{\beta }_{1}}$ while ${\mathcal{E}_{C\mid D}}$ on the column space of only ${\boldsymbol{\beta }_{1C}}$.
-
$(i)$ Partial predictor envelope: ${\mathcal{E}_{{\boldsymbol{\Sigma }_{C\mid D}}}}(\mathcal{L})$, i.e. the ${\boldsymbol{\Sigma }_{C\mid D}}$-envelope of $\mathcal{L}$ and is shortened to ${\mathcal{E}_{C\mid D}}$, with dimension ${d_{\boldsymbol{X}}}$ $(d\le {d_{\boldsymbol{X}}}\le {p_{C}})$,
-
$(ii)$ Partial response envelope: ${\mathcal{E}_{{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}(\mathcal{R})$, i.e. the ${\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}$-envelope of $\mathcal{R}$ and is shortened to ${\mathcal{E}_{\boldsymbol{Y}\mid 1}}$, with dimension ${d_{\boldsymbol{Y}}}$ $({d_{1}}\le {d_{\boldsymbol{Y}}}\le r)$,
It is worthy to note that $\mathrm{SIMP}$ is a broad class of envelopes that degenerates to several popular envelope models under certain conditions as listed in Table 1. Below, we introduce Conditions 3 and 4 in Section 3.1, which could equivalently define $\mathrm{SIMP}$ and will offer more intuitions on where the efficiency gains come from by assuming these two envelope components. For the convenience of developing the Bayesian inference procedure in Section 4, we give the coordinate form of $\mathrm{SIMP}$ in (3.7)–(3.10) in Section 3.3. For the preparation of Section 3.3, Section 3.2 will introduce a re-parameterization trick, to avoid the direct Bayesian inference on matrix parameters living in the Stiefel manifold.
3.1 Equivalent Conditions for $\mathrm{SIMP}$
In this section, we introduce Conditions 3 and 4, which can equivalently define $\mathrm{SIMP}$ for (3.2). These two conditions will not be used in the formulation of $\mathrm{SIMP}$ in Section 3.3, but will offer some intuitions on the source of efficiency gains. The equivalence between the conditions in this section and the envelope structures that we have given for $\mathrm{SIMP}$ (by Definition 2) shares the same reason with the equivalence between Conditions 1 and 2 for the response envelope model in Section 2.2.
Condition 3.
Assume in (3.2), the subspace ${\mathcal{S}_{C\mid D}}\subset {\mathbb{R}^{{p_{C}}}}$ satisfies,
-
$(i)$ $\mathrm{Cov}((\boldsymbol{Y},{\mathbb{P}_{{\mathcal{S}_{C\mid D}^{\perp }}}}{\boldsymbol{X}_{1C}})\mid ({\mathbb{P}_{{\mathcal{S}_{C\mid D}}}}{\boldsymbol{X}_{1C}},{\boldsymbol{X}_{1D}},{\boldsymbol{X}_{2}}))=\mathbf{0}$,
-
$(ii)$ $\mathrm{Cov}(({\mathbb{P}_{{\mathcal{S}_{C\mid D}}}}{\boldsymbol{X}_{1C}},{\mathbb{P}_{{\mathcal{S}_{C\mid D}^{\perp }}}}{\boldsymbol{X}_{1C}})\mid {\boldsymbol{X}_{1D}})=\mathbf{0}$.
Condition 4.
Assume in (3.2), the subspace ${\mathcal{S}_{\boldsymbol{Y}\mid 1}}\subset {\mathbb{R}^{r}}$ satisfies,
-
$(I)$ ${\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid 1}^{\perp }}}}\boldsymbol{Y}\mid ({\boldsymbol{X}_{1}},{\boldsymbol{X}_{2}})\stackrel{\mathcal{D}}{=}{\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid 1}^{\perp }}}}\boldsymbol{Y}\mid {\boldsymbol{X}_{2}}$,
-
$(II)$ $\mathrm{Cov}(({\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid 1}}}}\boldsymbol{Y},{\mathbb{P}_{{\mathcal{S}_{\boldsymbol{Y}\mid 1}^{\perp }}}}\boldsymbol{Y})\mid ({\boldsymbol{X}_{1}},{\boldsymbol{X}_{2}}))=\mathbf{0}$.
Condition 3 excludes the direct and indirect partial effects (via ${\mathbb{P}_{{\mathcal{S}_{C\mid D}}}}{\boldsymbol{X}_{1C}}$) of ${\mathbb{P}_{{\mathcal{S}_{C\mid D}^{\perp }}}}{\boldsymbol{X}_{1C}}$ on $\boldsymbol{Y}$. The intersection of all subspaces that satisfy Condition 3 gives the same partial predictor envelope space ${\mathcal{E}_{C\mid D}}$ as defined previously by ${\mathcal{E}_{{\boldsymbol{\Sigma }_{C\mid D}}}}(\mathcal{L})$. We call ${\mathbb{P}_{{\mathcal{E}_{C\mid D}}}}{\boldsymbol{X}_{1C}}$ and ${\mathbb{P}_{{\mathcal{E}_{C\mid D}^{\perp }}}}{\boldsymbol{X}_{1C}}$ the material part and immaterial part of ${\boldsymbol{X}_{1C}}$. Given ${\boldsymbol{X}_{1D}}$ and ${\boldsymbol{X}_{2}}$, the immaterial part of ${\boldsymbol{X}_{1C}}$ is assumed not to affect $\boldsymbol{Y}$ by Condition 3. Similarly, the partial response envelope space ${\mathcal{E}_{\boldsymbol{Y}\mid 1}}$ previously defined by ${\mathcal{E}_{{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}}}(\mathcal{R})$ is also the intersection of all subspaces that satisfy Condition 4. We call ${\mathbb{P}_{{\mathcal{E}_{\boldsymbol{Y}\mid 1}}}}\boldsymbol{Y}$ and ${\mathbb{P}_{{\mathcal{E}_{\boldsymbol{Y}\mid 1}^{\perp }}}}\boldsymbol{Y}$ the material part and immaterial part of $\boldsymbol{Y}$. Given ${\boldsymbol{X}_{2}}$, the immaterial part of $\boldsymbol{Y}$ is assumed not to be affected by ${\boldsymbol{X}_{1}}$.
3.2 Re-Parameterization of the Basis Matrices
We only consider the case of $0\lt {d_{\boldsymbol{X}}}\lt {p_{C}}$ and $0\lt {d_{\boldsymbol{Y}}}\lt r$ in this section. When ${d_{\boldsymbol{X}}}\in \{0,{p_{C}}\}$ or ${d_{\boldsymbol{Y}}}\in \{0,r\}$, we can simply take the orthonormal bases of envelope space ${\mathcal{E}_{C\mid D}}$ or ${\mathcal{E}_{\boldsymbol{Y}\mid 1}}$ and the orthogonal complement subspace to be either null or the identity matrix. When $0\lt {d_{\boldsymbol{X}}}\lt {p_{C}}$ and $0\lt {d_{\boldsymbol{Y}}}\lt r$, let $\mathbf{L}\in {\mathbb{R}^{{p_{C}}\times {d_{\mathbf{X}}}}}$, ${\mathbf{L}_{0}}\in {\mathbb{R}^{{p_{C}}\times ({p_{C}}-{d_{\mathbf{X}}})}}$, $\mathbf{R}\in {\mathbb{R}^{r\times {d_{\mathbf{Y}}}}}$ and ${\mathbf{R}_{0}}\in {\mathbb{R}^{r\times (r-{d_{\mathbf{Y}}})}}$ denote the orthonormal bases of ${\mathcal{E}_{C\mid D}}$, ${\mathcal{E}_{C\mid D}^{\perp }}$, ${\mathcal{E}_{\boldsymbol{Y}\mid 1}}$ and ${\mathcal{E}_{\boldsymbol{Y}\mid 1}^{\perp }}$, respectively.
To avoid the difficulty of direct Bayesian inference on L, ${\mathbf{L}_{0}}$, R and ${\mathbf{R}_{0}}$, which all live in the Stiefel manifold, re-parameterization of L, ${\mathbf{L}_{0}}$, R and ${\mathbf{R}_{0}}$ is considered in the coordinate form of $\mathrm{SIMP}$ in Section 3.3. We illustrate the re-parameterization of L and ${\mathbf{L}_{0}}$ here, and this idea can be similarly exploited to re-parameterize R and ${\mathbf{R}_{0}}$. Let ${\mathbf{L}_{1}}$ be the matrix formed by the upper ${d_{\boldsymbol{X}}}$ rows of L, and ${\mathbf{L}_{2}}$ be the matrix that contains the remaining rows. Without loss of generality, we assume that ${\mathbf{L}_{1}}$ is nonsingular. Otherwise, we could reorder the rows of L to make ${\mathbf{L}_{1}}$ nonsingular. Then,
where $\mathbf{A}={\mathbf{L}_{2}}{\mathbf{L}_{1}^{-1}}$ and ${\mathbf{C}_{{d_{\mathbf{X}}}}}(\mathbf{A})=\left(\begin{array}{c}{\mathbf{I}_{{d_{\boldsymbol{X}}}}}\\ {} \mathbf{A}\end{array}\right)$. [56] shows that A depends on L only through $\mathrm{span}(\mathbf{L})$ and there is a one-to-one correspondence between ${\mathcal{E}_{C\mid D}}$ and A. [8] further shows that if ${\mathbf{C}_{{d_{\mathbf{X}}}}}(\mathbf{A})$ is a basis of ${\mathcal{E}_{C\mid D}}$, then
is a basis of ${\mathcal{E}_{C\mid D}^{\perp }}$. Therefore, after normalization,
(3.3)
\[ \mathbf{L}=\left(\begin{array}{c}{\mathbf{L}_{1}}\\ {} {\mathbf{L}_{2}}\end{array}\right)=\left(\begin{array}{c}{\mathbf{I}_{{d_{\boldsymbol{X}}}}}\\ {} \mathbf{A}\end{array}\right){\mathbf{L}_{1}}:={\mathbf{C}_{{d_{\mathbf{X}}}}}(\mathbf{A}){\mathbf{L}_{1}},\](3.4)
\[ {\mathbf{D}_{{p_{C}}-{d_{\mathbf{X}}}}}(\mathbf{A})=\left(\begin{array}{c}-{\mathbf{A}^{T}}\\ {} {\mathbf{I}_{{p_{C}}-{d_{\mathbf{X}}}}}\end{array}\right)\]
\[ \mathbf{L}(\mathbf{A})={\mathbf{C}_{{d_{\mathbf{X}}}}}(\mathbf{A}){\big({\mathbf{C}_{{d_{\mathbf{X}}}}}{(\mathbf{A})^{T}}{\mathbf{C}_{{d_{\mathbf{X}}}}}(\mathbf{A})\big)^{-1/2}}\]
and
\[ {\mathbf{L}_{0}}(\mathbf{A})={\mathbf{D}_{{p_{C}}-{d_{\mathbf{X}}}}}(\mathbf{A}){\big({\mathbf{D}_{{p_{C}}-{d_{\mathbf{X}}}}}{(\mathbf{A})^{T}}{\mathbf{D}_{{p_{C}}-{d_{\mathbf{X}}}}}(\mathbf{A})\big)^{-1/2}}\]
are a pair of orthonormal bases of ${\mathcal{E}_{C\mid D}}$ and ${\mathcal{E}_{C\mid D}^{\perp }}$ respectively. The re-parameterization
\[ \mathbf{R}(\mathbf{B})={\mathbf{C}_{{d_{\mathbf{Y}}}}}(\mathbf{B}){\big({\mathbf{C}_{{d_{\mathbf{Y}}}}}{(\mathbf{B})^{T}}{\mathbf{C}_{{d_{\mathbf{Y}}}}}(\mathbf{B})\big)^{-1/2}}\]
and
\[ {\mathbf{R}_{0}}(\mathbf{B})={\mathbf{D}_{r-{d_{\mathbf{Y}}}}}(\mathbf{B}){\big({\mathbf{D}_{r-{d_{\mathbf{Y}}}}}{(\mathbf{B})^{T}}{\mathbf{D}_{r-{d_{\mathbf{Y}}}}}(\mathbf{B})\big)^{-1/2}}\]
by an unconstrained matrix $\mathbf{B}\in {\mathbb{R}^{(r-{d_{\boldsymbol{Y}}})\times {d_{\boldsymbol{Y}}}}}$ can be similarly constructed.3.3 Coordinate Form of SIMP
Like (2.1)–(2.2) for the response envelope model and (2.3)–(2.4) for the predictor envelope model, we intend to give the coordinate form of $\mathrm{SIMP}$ explicitly as in (3.7)–(3.10), i.e., formulate two envelope structures ${\mathcal{E}_{C\mid D}}$ and ${\mathcal{E}_{\boldsymbol{Y}\mid 1}}$ by analytical expressions for ${\boldsymbol{\beta }_{1C}}$, ${\boldsymbol{\beta }_{1D}}$, ${\boldsymbol{\Sigma }_{C|D}}$ and ${\boldsymbol{\Sigma }_{\boldsymbol{Y}|\boldsymbol{X}}}$, and correspondingly rewrite (3.2), for the preparation of the Bayesian inference in Section 4.
Let ${\boldsymbol{\beta }_{1C}}=\mathbf{U}\mathbf{D}{\mathbf{V}^{T}}$ be the singular value decomposition, where $\mathbf{U}\in {\mathbb{R}^{{p_{C}}\times d}}$ and $\mathbf{V}\in {\mathbb{R}^{r\times d}}$ are semi-orthogonal matrices, and $\mathbf{D}=\mathrm{diag}({\lambda _{1}},\dots ,{\lambda _{d}})$, with ${\lambda _{i}}\ge 0$, $i=1,\dots ,d$, being d singular values of ${\boldsymbol{\beta }_{1C}}$ where $d=\mathrm{rank}({\boldsymbol{\beta }_{1C}})$. Since $\mathcal{L}=\mathrm{span}({\boldsymbol{\beta }_{1C}})=\mathrm{span}(\mathbf{U})$ and $\mathcal{R}=\mathrm{span}({\boldsymbol{\beta }_{1}^{T}})=\mathrm{span}({\boldsymbol{\beta }_{1C}^{T}},{\boldsymbol{\beta }_{1D}^{T}})=\mathrm{span}(\mathbf{V},{\boldsymbol{\beta }_{1D}^{T}})$, according to the definitions of ${\mathcal{E}_{C\mid D}}$ and ${\mathcal{E}_{\boldsymbol{Y}\mid 1}}$ by Definition 2 at the start of Section 3, $\mathbf{U}=\mathbf{L}\mathbf{O}$ and $\mathbf{V}=\mathbf{R}\mathbf{P}$ for some $\mathbf{O}\in {\mathbb{R}^{{d_{\boldsymbol{X}}}\times d}}$ and $\mathbf{P}\in {\mathbb{R}^{{d_{\boldsymbol{Y}}}\times d}}$. Hence the coordinate form of ${\boldsymbol{\beta }_{1C}}$ is
where ${\boldsymbol{\eta }_{C}}=\mathbf{P}\mathbf{D}{\mathbf{O}^{T}}\in {\mathbb{R}^{{d_{\boldsymbol{Y}}}\times {d_{\boldsymbol{X}}}}}$ is the coordinate of ${\boldsymbol{\beta }_{1C}^{T}}$ relative to R and ${\mathbf{L}^{T}}$. Also,
for some ${\boldsymbol{\eta }_{D}}\in {\mathbb{R}^{{d_{\boldsymbol{Y}}}\times {p_{D}}}}$, which carries the coordinate of ${\boldsymbol{\beta }_{1D}^{T}}$ relative to R. Then the coordinate form of $\mathrm{SIMP}$ is given by
where ${\boldsymbol{\beta }_{1C}}=\mathbf{L}(\mathbf{A}){\boldsymbol{\eta }_{C}^{T}}\mathbf{R}{(\mathbf{B})^{T}}$ and ${\boldsymbol{\beta }_{1D}}={\boldsymbol{\eta }_{D}^{T}}\mathbf{R}{(\mathbf{B})^{T}}$, as illustrated in (3.5) and (3.6). This implies that the rows and columns of ${\boldsymbol{\beta }_{1C}}$ depend on the partial predictor and partial response envelopes only, and the columns of ${\boldsymbol{\beta }_{1D}}$ depend on the partial response envelope only. Note that $\boldsymbol{\Omega }=\mathbf{L}{(\mathbf{A})^{T}}{\boldsymbol{\Sigma }_{C\mid D}}\mathbf{L}(\mathbf{A})$, ${\boldsymbol{\Omega }_{0}}={\mathbf{L}_{0}}{(\mathbf{A})^{T}}{\boldsymbol{\Sigma }_{C\mid D}}{\mathbf{L}_{0}}(\mathbf{A})$, $\boldsymbol{\Phi }=\mathbf{R}{(\mathbf{B})^{T}}{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}\mathbf{R}(\mathbf{B})$ and ${\boldsymbol{\Phi }_{0}}={\mathbf{R}_{0}}{(\mathbf{B})^{T}}{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}{\mathbf{R}_{0}}(\mathbf{B})$ carry the coordinates of ${\boldsymbol{\Sigma }_{C\mid D}}$ relative to $\mathbf{L}(\mathbf{A})$ and ${\mathbf{L}_{0}}(\mathbf{A})$, and ${\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ relative to $\mathbf{R}(\mathbf{B})$ and ${\mathbf{R}_{0}}(\mathbf{B})$ respectively. Multiply $\mathbf{R}{(\mathbf{B})^{T}}$ or ${\mathbf{R}_{0}}{(\mathbf{B})^{T}}$ on both sides of (3.7),
(3.5)
\[ {\boldsymbol{\beta }_{1C}}=(\mathbf{L}\mathbf{O})\mathbf{D}{(\mathbf{R}\mathbf{P})^{T}}:=\mathbf{L}{\boldsymbol{\eta }_{C}^{T}}{\mathbf{R}^{T}},\](3.7)
\[\begin{aligned}{}\boldsymbol{Y}& ={\boldsymbol{\mu }_{\boldsymbol{Y}}}+\mathbf{R}(\mathbf{B}){\boldsymbol{\eta }_{C}}\mathbf{L}{(\mathbf{A})^{T}}({\boldsymbol{X}_{1C}}-{\boldsymbol{\mu }_{1C}})+\mathbf{R}(\mathbf{B}){\boldsymbol{\eta }_{D}}\\ {} & \hspace{1em}({\boldsymbol{X}_{1D}}-{\overline{\boldsymbol{X}}_{1D}})+{\boldsymbol{\beta }_{2}^{T}}({\boldsymbol{X}_{2}}-{\overline{\boldsymbol{X}}_{2}})+{\boldsymbol{\epsilon }_{\boldsymbol{Y}\mid \boldsymbol{X}}},\end{aligned}\](3.8)
\[\begin{aligned}{}{\boldsymbol{X}_{1C}}& ={\boldsymbol{\mu }_{1C}}+{\boldsymbol{\gamma }^{T}}({\boldsymbol{X}_{1D}}-{\overline{\boldsymbol{X}}_{1D}})+{\boldsymbol{\epsilon }_{C\mid D}},\end{aligned}\]
\[ \left\{\begin{aligned}{}\mathbf{R}{(\mathbf{B})^{T}}\boldsymbol{Y}=& \mathbf{R}{(\mathbf{B})^{T}}{\boldsymbol{\mu }_{\boldsymbol{Y}}}+{\boldsymbol{\eta }_{C}}(\mathbf{L}{(\mathbf{A})^{T}}({\boldsymbol{X}_{1C}}-{\boldsymbol{\mu }_{1C}}))+\\ {} & {\boldsymbol{\eta }_{D}}({\boldsymbol{X}_{1D}}-{\overline{\boldsymbol{X}}_{1D}})+\mathbf{R}{(\mathbf{B})^{T}}{\boldsymbol{\beta }_{2}^{T}}({\boldsymbol{X}_{2}}-{\overline{\boldsymbol{X}}_{2}})+\\ {} & \mathbf{R}{(\mathbf{B})^{T}}{\boldsymbol{\epsilon }_{\boldsymbol{Y}\mid \boldsymbol{X}}},\\ {} {\mathbf{R}_{0}}{(\mathbf{B})^{T}}\boldsymbol{Y}=& {\mathbf{R}_{0}}{(\mathbf{B})^{T}}{\boldsymbol{\mu }_{\boldsymbol{Y}}}+{\mathbf{R}_{0}}{(\mathbf{B})^{T}}{\boldsymbol{\beta }_{2}^{T}}({\boldsymbol{X}_{2}}-{\overline{\boldsymbol{X}}_{2}})+\\ {} & {\mathbf{R}_{0}}{(\mathbf{B})^{T}}{\boldsymbol{\epsilon }_{\boldsymbol{Y}\mid \boldsymbol{X}}}.\end{aligned}\hspace{1em}\right.\]
This pair of equations shows that ${\boldsymbol{X}_{2}}$ affects both material and immaterial parts of $\boldsymbol{Y}$, ${\boldsymbol{X}_{1D}}$ affects only the material part of $\boldsymbol{Y}$, and only the material part of ${\boldsymbol{X}_{1C}}$ affects the material part of $\boldsymbol{Y}$ only. The estimation of ${\boldsymbol{\beta }_{1C}}$ is benefitted from removing the redundant variations of both ${\mathbf{R}_{0}}{(\mathbf{B})^{T}}\boldsymbol{Y}$ and ${\mathbf{L}_{0}}{(\mathbf{A})^{T}}{\boldsymbol{X}_{1C}}$, whereas the efficiency gains for estimating ${\boldsymbol{\beta }_{1D}}$ only come from removing the redundant variation of ${\mathbf{R}_{0}}{(\mathbf{B})^{T}}\boldsymbol{Y}$.Inherited from the (partial) predictor and the (partial) response envelope models, the partial predictor envelope component of $\mathrm{SIMP}$ offers large efficiency gains when $\| \boldsymbol{\Omega }\| \gg \| {\boldsymbol{\Omega }_{0}}\| $, and significant advantages of the partial response envelope component require $\| \boldsymbol{\Phi }\| \ll \| {\boldsymbol{\Phi }_{0}}\| $.
4 Bayesian Inference
In this section, we develop a Bayesian procedure for the statistical inference of $\mathrm{SIMP}$, hence our method is called the Bayesian simultaneous partial envelope model. It is implemented through sampling from the posterior distribution of parameters. The determination of the posterior distribution requires both the likelihood function and prior distributions.
Suppose that we have n independent observations $\mathcal{D}={\{{\boldsymbol{Y}_{i}},{\boldsymbol{X}_{1C,i}},{\boldsymbol{X}_{1D,i}},{\boldsymbol{X}_{2,i}}\}_{i=1}^{n}}$ from SIMP. Let $\mathbb{Y}={({\boldsymbol{Y}_{1}},\dots ,{\boldsymbol{Y}_{n}})^{T}}\in {\mathbb{R}^{n\times r}}$, ${\mathbb{X}_{1C}}={({\boldsymbol{X}_{1C,1}},\dots ,{\boldsymbol{X}_{1C,n}})^{T}}\in {\mathbb{R}^{n\times {p_{C}}}}$, ${\mathbb{X}_{1D}}=({\boldsymbol{X}_{1D,1}},\dots $, ${\boldsymbol{X}_{1D,n}}{)^{T}}\in {\mathbb{R}^{n\times {p_{D}}}}$ and ${\mathbb{X}_{2}}={({\boldsymbol{X}_{2,1}},\dots ,{\boldsymbol{X}_{2,n}})^{T}}\in {\mathbb{R}^{n\times {p_{2}}}}$ be data matrices. For notational simplicity, we consider standardized data matrices $\widetilde{\mathbb{Y}}=\mathbb{Y}-{\mathbf{1}_{n}}{\boldsymbol{\mu }_{\boldsymbol{Y}}^{T}}$, ${\widetilde{\mathbb{X}}_{1C}}={\mathbb{X}_{1C}}-{\mathbf{1}_{n}}{\boldsymbol{\mu }_{1C}^{T}}$, ${\widetilde{\mathbb{X}}_{1D}}={\mathbb{X}_{1D}}-{\mathbf{1}_{n}}{\overline{\boldsymbol{X}}_{1D}^{T}}$ and ${\widetilde{\mathbb{X}}_{2}}={\mathbb{X}_{2}}-{\mathbf{1}_{n}}{\overline{\boldsymbol{X}}_{2}^{T}}$ for $\mathbb{Y}$, ${\mathbb{X}_{1C}}$, ${\mathbb{X}_{1D}}$ and ${\mathbb{X}_{2}}$, respectively. Then, for fixed envelope dimensions ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$, the log-likelihood function for $\mathrm{SIMP}$ (3.7)–(3.10) is given by
where const denotes a constant which does not depend on parameters of the model, $\boldsymbol{\Theta }=\{{\boldsymbol{\mu }_{1C}}$, ${\boldsymbol{\mu }_{\boldsymbol{Y}}}$, ${\boldsymbol{\beta }_{2}}$, $\boldsymbol{\gamma }$, ${\boldsymbol{\eta }_{C}}$, ${\boldsymbol{\eta }_{D}}$, A, B, Ω, ${\boldsymbol{\Omega }_{0}}$, Φ, ${\boldsymbol{\Phi }_{0}}\}$.
(4.1)
\[\begin{aligned}{}l(\boldsymbol{\Theta })=& \textit{const}-\frac{n}{2}\log (|\boldsymbol{\Omega }|)-\frac{n}{2}\log (|{\boldsymbol{\Omega }_{0}}|)-\frac{1}{2}\mathrm{tr}\big\{({\widetilde{\mathbb{X}}_{1C}}-\\ {} & {\widetilde{\mathbb{X}}_{1D}}\boldsymbol{\gamma }){\big(\mathbf{L}(\mathbf{A})\boldsymbol{\Omega }\mathbf{L}{(\mathbf{A})^{T}}+{\mathbf{L}_{0}}(\mathbf{A}){\boldsymbol{\Omega }_{0}}{\mathbf{L}_{0}}{(\mathbf{A})^{T}}\big)^{-1}}\\ {} & {({\widetilde{\mathbb{X}}_{1C}}-{\widetilde{\mathbb{X}}_{1D}}\boldsymbol{\gamma })^{T}}\big\}-\frac{n}{2}\log (|\boldsymbol{\Phi }|)-\frac{n}{2}\log (|{\boldsymbol{\Phi }_{0}}|)-\\ {} & \frac{1}{2}\mathrm{tr}\big\{\big(\widetilde{\mathbb{Y}}-{\widetilde{\mathbb{X}}_{1C}}\mathbf{L}(\mathbf{A}){\boldsymbol{\eta }_{C}^{T}}\mathbf{R}{(\mathbf{B})^{T}}-{\widetilde{\mathbb{X}}_{1D}}{\boldsymbol{\eta }_{D}^{T}}\mathbf{R}{(\mathbf{B})^{T}}-\\ {} & {\widetilde{\mathbb{X}}_{2}}{\boldsymbol{\beta }_{2}}\big){\big(\mathbf{R}(\mathbf{B})\boldsymbol{\Phi }\mathbf{R}{(\mathbf{B})^{T}}+{\mathbf{R}_{0}}(\mathbf{B}){\boldsymbol{\Phi }_{0}}{\mathbf{R}_{0}}{(\mathbf{B})^{T}}\big)^{-1}}\\ {} & \big(\widetilde{\mathbb{Y}}-{\widetilde{\mathbb{X}}_{1C}}\mathbf{L}(\mathbf{A}){\boldsymbol{\eta }_{C}^{T}}\mathbf{R}{(\mathbf{B})^{T}}-{\widetilde{\mathbb{X}}_{1D}}{\boldsymbol{\eta }_{D}^{T}}\mathbf{R}{(\mathbf{B})^{T}}-\\ {} & {\widetilde{\mathbb{X}}_{2}}{\boldsymbol{\beta }_{2}}\big){^{T}}\big\},\end{aligned}\]To perform Bayesian inference on Θ related to $\mathrm{SIMP}$, we put the following prior distributions on Θ:
\[\begin{aligned}{}\pi ({\boldsymbol{\mu }_{1C}},{\boldsymbol{\mu }_{\boldsymbol{Y}}})& \propto 1,\\ {} {\boldsymbol{\beta }_{2}}\mid \boldsymbol{\Phi },{\boldsymbol{\Phi }_{0}},\mathbf{B}& \sim {\mathcal{MN}_{{p_{2}},r}}\big({\mathbf{M}^{-1}}\mathbf{Z},{\mathbf{M}^{-1}},{\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}\big),\\ {} \boldsymbol{\gamma }\mid \boldsymbol{\Omega },{\boldsymbol{\Omega }_{0}},\mathbf{A}& \sim {\mathcal{MN}_{{p_{D}},{p_{C}}}}\big({\boldsymbol{\Lambda }^{-1}}\mathbf{F},{\boldsymbol{\Lambda }^{-1}},{\boldsymbol{\Sigma }_{C\mid D}}\big),\\ {} \boldsymbol{\Omega }& \sim {\mathcal{IW}_{{d_{\boldsymbol{X}}}}}({\boldsymbol{\Psi }_{\boldsymbol{X}}},{w_{\boldsymbol{X}}}),\\ {} {\boldsymbol{\Omega }_{0}}& \sim {\mathcal{IW}_{{p_{C}}-{d_{\boldsymbol{X}}}}}({\boldsymbol{\Psi }_{{\boldsymbol{X}_{0}}}},{w_{{\boldsymbol{X}_{0}}}}),\\ {} \boldsymbol{\Phi }& \sim {\mathcal{IW}_{{d_{\boldsymbol{Y}}}}}({\boldsymbol{\Psi }_{\boldsymbol{Y}}},{w_{\boldsymbol{Y}}}),\\ {} {\boldsymbol{\Phi }_{0}}& \sim {\mathcal{IW}_{r-{d_{\boldsymbol{Y}}}}}({\boldsymbol{\Psi }_{{\boldsymbol{Y}_{0}}}},{w_{{\boldsymbol{Y}_{0}}}}),\\ {} \mathbf{A}& \sim {\mathcal{MN}_{{p_{C}}-{d_{\boldsymbol{X}}},{d_{\boldsymbol{X}}}}}({\mathbf{A}_{0}},{\mathbf{K}_{\mathbf{A}}},{\boldsymbol{\Sigma }_{\mathbf{A}}}),\\ {} \mathbf{B}& \sim {\mathcal{MN}_{r-{d_{\boldsymbol{Y}}},{d_{\boldsymbol{Y}}}}}({\mathbf{B}_{0}},{\mathbf{K}_{\mathbf{B}}},{\boldsymbol{\Sigma }_{\mathbf{B}}}),\\ {} {\boldsymbol{\eta }_{C}}\mid \boldsymbol{\Phi }& \sim {\mathcal{MN}_{{d_{\boldsymbol{Y}}},{d_{\boldsymbol{X}}}}}\big(\mathbf{W}{\mathbf{E}^{-1}},\boldsymbol{\Phi },{\mathbf{E}^{-1}}\big),\\ {} {\boldsymbol{\eta }_{D}}\mid \boldsymbol{\Phi }& \sim {\mathcal{MN}_{{d_{\boldsymbol{Y}}},{p_{D}}}}\big(\mathbf{T}{\mathbf{Q}^{-1}},\boldsymbol{\Phi },{\mathbf{Q}^{-1}}\big),\end{aligned}\]
where $\mathcal{MN}$ and $\mathcal{IW}$ denote Matrix normal and Inverse-Wishart distributions. See Appendix E for the detailed introduction of these two distributions. Here we implicitly assume ${\mathbf{A}_{0}}\in {\mathbb{R}^{({p_{C}}-{d_{\boldsymbol{X}}})\times {d_{\boldsymbol{X}}}}}$, ${\mathbf{B}_{0}}\in {\mathbb{R}^{(r-{d_{\boldsymbol{Y}}})\times {d_{\boldsymbol{Y}}}}}$, $\mathbf{Z}\in {\mathbb{R}^{{p_{2}}\times r}}$, $\mathbf{F}\in {\mathbb{R}^{{p_{D}}\times {p_{C}}}}$, $\mathbf{W}\in {\mathbb{R}^{{d_{\boldsymbol{Y}}}\times {d_{\boldsymbol{X}}}}}$, $\mathbf{T}\in {\mathbb{R}^{{d_{\boldsymbol{Y}}}\times {p_{D}}}}$, ${w_{\boldsymbol{X}}}\gt {d_{\boldsymbol{X}}}-1$, ${w_{{\boldsymbol{X}_{0}}}}\gt {p_{C}}-{d_{\boldsymbol{X}}}-1$, ${w_{\boldsymbol{Y}}}\gt {d_{\boldsymbol{Y}}}-1$, ${w_{{\boldsymbol{Y}_{0}}}}\gt r-{d_{\boldsymbol{Y}}}-1$ and $\mathbf{M}\in {\mathbb{S}_{+}^{{p_{2}}\times {p_{2}}}}$, $\boldsymbol{\Lambda },\mathbf{Q}\in {\mathbb{S}_{+}^{{p_{D}}\times {p_{D}}}}$, ${\boldsymbol{\Psi }_{\boldsymbol{X}}},{\boldsymbol{\Sigma }_{\mathbf{A}}},\mathbf{E}\in {\mathbb{S}_{+}^{{d_{\boldsymbol{X}}}\times {d_{\boldsymbol{X}}}}}$, ${\boldsymbol{\Psi }_{\boldsymbol{Y}}},{\boldsymbol{\Sigma }_{\mathbf{B}}}\in {\mathbb{S}_{+}^{{d_{\boldsymbol{Y}}}\times {d_{\boldsymbol{Y}}}}}$, ${\boldsymbol{\Psi }_{{\boldsymbol{X}_{0}}}},{\mathbf{K}_{\mathbf{A}}}\in {\mathbb{S}_{+}^{({p_{C}}-{d_{\boldsymbol{X}}})\times ({p_{C}}-{d_{\boldsymbol{X}}})}}$, ${\boldsymbol{\Psi }_{{\boldsymbol{Y}_{0}}}},{\mathbf{K}_{\mathbf{B}}}\in {\mathbb{S}_{+}^{(r-{d_{\boldsymbol{Y}}})\times (r-{d_{\boldsymbol{Y}}})}}$. The prior information for the parameters could be easily incorporated through these hyperparameters. For example, if we know a priori that the partial predictor envelope ${\mathcal{E}_{C\mid D}}$ is likely to be ${\widehat{\mathcal{E}}_{C\mid D}^{prior}}$, then we could compute ${\widehat{\mathbf{A}}^{prior}}$ from the basis of ${\widehat{\mathcal{E}}_{C\mid D}^{prior}}$ by (3.3), and set ${\mathbf{A}_{0}}$ as ${\widehat{\mathbf{A}}^{prior}}$. Our confidence about the prior of A is encoded in ${\mathbf{K}_{\mathbf{A}}}$ and ${\boldsymbol{\Sigma }_{\mathbf{A}}}$. Specifically, the prior covariance matrices for the i-th row and the j-th column of A are ${\mathbf{K}_{\mathbf{A},ii}}{\boldsymbol{\Sigma }_{\mathbf{A}}}$ and ${\boldsymbol{\Sigma }_{\mathbf{A},jj}}{\mathbf{K}_{\mathbf{A}}}$ respectively, where ${\mathbf{K}_{\mathbf{A},ii}}$ and ${\boldsymbol{\Sigma }_{\mathbf{A},jj}}$ are the i-th and j-th diagonal elements of ${\mathbf{K}_{\mathbf{A}}}$ and ${\boldsymbol{\Sigma }_{\mathbf{A}}}$, and the small ${\mathbf{K}_{\mathbf{A},ii}}$’s or ${\boldsymbol{\Sigma }_{\mathbf{A},jj}}$’s can reflect our strong prior belief of the specific row(s) or column(s) of ${\widehat{\mathbf{A}}^{prior}}$. The Bayesian inference is made via the Metropolis-within-Gibbs algorithm to generate the sample from the posterior distribution of the parameters Θ. A number of posterior samples at early iterations are discarded as burn-in. Suppose the retained posterior samples are ${\mathcal{D}^{\ast }}={\{{\boldsymbol{\Theta }^{(s)}}\}_{s=1}^{S}}:=\{{\boldsymbol{\mu }_{1C}^{(s)}}$, ${\boldsymbol{\mu }_{\boldsymbol{Y}}^{(s)}}$, ${\boldsymbol{\beta }_{2}^{(s)}}$, ${\boldsymbol{\gamma }^{(s)}}$, ${\boldsymbol{\eta }_{C}^{(s)}}$, ${\boldsymbol{\eta }_{D}^{(s)}}$, ${\mathbf{A}^{(s)}}$, ${\mathbf{B}^{(s)}}$, ${\boldsymbol{\Omega }^{(s)}}$, ${\boldsymbol{\Omega }_{0}^{(s)}}$, ${\boldsymbol{\Phi }^{(s)}}$, ${\boldsymbol{\Phi }_{0}^{(s)}}{\}_{s=1}^{S}}$ with sample size S. The algorithm is provided in Appendix A, and the warm start initial estimator that we propose for the MCMC algorithm is detailed in Appendix B.5 Theoretical Properties
In this section, we establish two theoretical properties of our method. We first establish the propriety of the joint posterior density of all parameters in Theorem 1, although the priors of ${\boldsymbol{\mu }_{1C}}$ and ${\boldsymbol{\mu }_{\boldsymbol{Y}}}$ are improper. Proofs are left to Appendix D.
Theorem 1.
The posterior density of $({\boldsymbol{\mu }_{1C}},{\boldsymbol{\mu }_{\boldsymbol{Y}}},{\boldsymbol{\beta }_{2}},\boldsymbol{\gamma },{\boldsymbol{\eta }_{C}}$, ${\boldsymbol{\eta }_{D}}$, A, B, Ω, ${\boldsymbol{\Omega }_{0}}$, Φ, ${\boldsymbol{\Phi }_{0}})$ with respect to the Lebesgue measure on ${\mathbb{R}^{{P_{C}}}}\times {\mathbb{R}^{r}}\times {\mathbb{R}^{{p_{2}}\times r}}\times {\mathbb{R}^{{p_{D}}\times {p_{C}}}}\times {\mathbb{R}^{{d_{\boldsymbol{Y}}}\times {d_{\boldsymbol{X}}}}}\times {\mathbb{R}^{{d_{\boldsymbol{Y}}}\times {p_{D}}}}\times {\mathbb{R}^{({p_{C}}-{d_{\boldsymbol{X}}})\times {d_{\boldsymbol{X}}}}}\times {\mathbb{R}^{(r-{d_{\boldsymbol{Y}}})\times {d_{\boldsymbol{Y}}}}}\times {\mathbb{S}_{+}^{{d_{\boldsymbol{X}}}\times {d_{\boldsymbol{X}}}}}\times {\mathbb{S}_{+}^{({p_{C}}-{d_{\boldsymbol{X}}})\times ({p_{C}}-{d_{\boldsymbol{X}}})}}\times {\mathbb{S}_{+}^{{d_{\boldsymbol{Y}}}\times {d_{\boldsymbol{Y}}}}}\times {\mathbb{S}_{+}^{(r-{d_{\boldsymbol{Y}}})\times (r-{d_{\boldsymbol{Y}}})}}$ is proper.
Theorem 2 establishes the Harris ergodicity of the Metropolis-within-Gibbs algorithm developed for the Bayesian inference of $\mathrm{SIMP}$. This property ensures the asymptotic convergence of the Markov chain generated by our algorithm to the joint posterior distribution, irrespective of the starting points. This convergence may fail on a set of measure zero without the guarantee of this property. Despite measure zero, some choices for starting points in this pathological null set can arise naturally, as illustrated in [48] for example. Technical details including definitions for probabilistic terminologies and proofs are left to Appendix D.
Theorem 2.
Whenever $0\le {d_{\boldsymbol{X}}}\le {p_{C}}$ and $0\le {d_{\boldsymbol{Y}}}\le r$, the Markov chain generated by the Metropolis-within-Gibbs algorithm for the posterior sampling of Θ in $\mathrm{SIMP}$, as detailed in Appendix A, is Harris ergodic, i.e. (a) ϕ-irreducible with respect to some measure ϕ, (b) Aperiodic, and (c) Harris recurrent.
6 Selection of Envelope Dimensions
To effectively search the envelope dimensions, we first select the minimum possible dimension of two envelope components, which is the rank of ${\boldsymbol{\beta }_{1C}}$, and narrow down the range of dimensions to be searched to save the computational cost.
6.1 Selecting $d=\mathrm{rank}({\boldsymbol{\beta }_{1C}})$
To determine d, we adapt the Bura-Cook estimator [4] to our model setting. This estimator includes a sequence of Chi-squared tests. For $k=0,1,\dots ,\min ({p_{C}},r)-1$, the test statistic is ${T_{k}}=n{\textstyle\sum _{j=k+1}^{\min ({p_{C}},r)}}{\varphi _{j}}$, where ${\varphi _{1}}\ge \cdots \ge {\varphi _{\min ({p_{C}},r)}}$ are eigenvalues of
\[ {\widehat{\boldsymbol{\beta }}_{1C,std}}={\big\{(n-{p_{C}}-1)/n\big\}^{1/2}}{\mathbf{S}_{1C}^{1/2}}{\widehat{\boldsymbol{\beta }}_{{R_{\boldsymbol{Y}\mid 1D,2}}\mid 1C}}{\mathbf{S}_{{R_{\boldsymbol{Y}\mid 1D,2}}\mid 1C}^{-1/2}},\]
where ${\mathbf{S}_{1C}}$ is the sample covariance matrix of ${\boldsymbol{X}_{1C}}$. The residuals of the regression from $\boldsymbol{Y}$ on ${\boldsymbol{X}_{1D}}$ and ${\boldsymbol{X}_{2}}$ are regressed on ${\boldsymbol{X}_{1C}}$ again, and the estimated regression coefficients and residual sample covariance matrix are ${\widehat{\boldsymbol{\beta }}_{{R_{\boldsymbol{Y}\mid 1D,2}}\mid 1C}}$ and ${\mathbf{S}_{{R_{\boldsymbol{Y}\mid 1D,2}}\mid 1C}}$ respectively. Hereinafter, if we don’t mention specifically for the method of a regression, it always refers to the frequentist ordinary least squares. [4] shows ${T_{k}}$ follows the ${\chi _{({p_{C}}-k)(r-k)}^{2}}$ distribution under the null hypothesis that $\mathrm{rank}({\boldsymbol{\beta }_{1C}})$ is equal to k. For each $k=0,1,\dots ,\min ({p_{C}},r)-1$, we compare ${T_{k}}$ with the upper α quantile of the ${\chi _{({p_{C}}-k)(r-k)}^{2}}$ distribution, where α is the pre-specified significance level, and default to be 0.05 in our numerical studies. We choose d as the first non-significant value of k or $d=\min ({p_{C}},r)$ otherwise. The Bura-Cook estimator exhibits excellent ability in selecting d, as illustrated in the third column of Table 2. Details of the simulation set-up and numerical performance are introduced in Section 7.1.6.2 Selecting Envelope Dimensions
Dimensions ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$ should be specified before fitting $\mathrm{SIMP}$. In Section 7.1, we investigate the performance of Bayesian cross-validation (CV) and four information criteria, Akaike information criterion (AIC-MCMC), Bayesian information criterion (BIC-MCMC), Deviance information criterion (DIC) and Watanabe-Akaike information criterion (WAIC) in selecting these two envelope dimensions for $\mathrm{SIMP}$. They are selected for their popularity in the Bayesian literature. Dimension selection is critical in envelope modeling. To the best of our knowledge, we are the first to investigate the performance of these interesting methods together for the dimension selection of the Bayesian envelope model, while previous Bayesian envelope literature simply chooses one of them directly [5, 38]. For each fixed ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$,
where $l({\widehat{\boldsymbol{\Theta }}_{{d_{\boldsymbol{X}}},{d_{\boldsymbol{Y}}}}^{max}})$ is the largest log-likelihood value attained by MCMC samples after burn-in, with envelope dimensions fixed at ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$ for associated parameters, and $K({d_{\boldsymbol{X}}},{d_{\boldsymbol{Y}}})={d_{\boldsymbol{Y}}}({d_{\boldsymbol{X}}}+{p_{D}})+r(r+2{p_{2}}+3)/2+{p_{C}}({p_{C}}+3)/2$ is the effective number of free parameters in SIMP. Note that the definitions of AIC-MCMC and BIC-MCMC here are slightly different from the ones that readers might be used to in the frequentist setting, where we replace the maximum likelihood estimates of the parameters by the empirical maximizers in the retained MCMC samples. Definitions of DIC, WAIC and Bayesian CV involve more notations, and hence are left to Appendix F. We choose ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$ as the minimizers of either the minus of the average out-of-sample estimate of the log predictive density for the Bayesian CV (see details in Appendix F), or one of these four information criteria, among each pair of ${d_{\boldsymbol{X}}}=\widehat{d},\dots ,{p_{C}}$ and ${d_{\boldsymbol{Y}}}=\widehat{d},\dots ,r$ by the grid search, where $\widehat{d}$ is the estimate of d from the Bura-Cook estimator introduced in Section 6.1.
Table 2
Percentages that the Bura-Cook estimator correctly identifies d (Column 3), and each of five methods correctly identifies ${d_{\boldsymbol{X}}}$ or ${d_{\boldsymbol{Y}}}$ individually or the $({d_{\boldsymbol{X}}},{d_{\boldsymbol{Y}}})$ pair (Columns 4-8) out of 500 repetitions. Among five methods, the best one for selecting $({d_{\boldsymbol{X}}},{d_{\boldsymbol{Y}}})$ pair under each scenario is highlighted in bold face.
$({d_{\boldsymbol{X}}},{d_{\boldsymbol{Y}}})$ | n | B-C | AIC-MCMC | BIC-MCMC | DIC | WAIC | 5-fold CV |
(2, 2) | 150 | 0.99 | 0.37/0.65/0.29 | 0.85/0.98/0.84 | 0.00/0.20/0.00 | 0.00/0.24/0.00 | 0.58/0.56/0.26 |
300 | 1.00 | 0.35/0.63/0.25 | 0.92/0.99/0.91 | 0.00/0.19/0.00 | 0.00/0.20/0.00 | 0.54/0.61/0.27 | |
500 | 1.00 | 0.44/0.68/0.35 | 0.96/0.99/0.95 | 0.00/0.15/0.00 | 0.00/0.24/0.00 | 0.53/0.63/0.30 | |
(6, 2) | 150 | 0.98 | 0.51/0.79/0.41 | 0.42/0.98/0.40 | 0.07/0.15/0.01 | 0.54/0.30/0.18 | 0.08/0.09/0.01 |
300 | 0.99 | 0.59/0.84/0.49 | 0.66/0.99/0.66 | 0.03/0.16/0.00 | 0.55/0.28/0.15 | 0.08/0.06/0.01 | |
500 | 0.99 | 0.61/0.87/0.53 | 0.75/0.99/0.75 | 0.02/0.13/0.00 | 0.53/0.27/0.15 | 0.15/0.05/0.01 | |
(2, 6) | 150 | 0.99 | 0.50/0.79/0.39 | 0.99/0.99/0.98 | 0.00/0.36/0.00 | 0.04/0.71/0.03 | 0.41/0.01/0.01 |
300 | 1.00 | 0.55/0.81/0.45 | 0.99/1.00/0.99 | 0.00/0.31/0.00 | 0.02/0.72/0.01 | 0.40/0.08/0.04 | |
500 | 1.00 | 0.63/0.81/0.51 | 1.00/1.00/0.99 | 0.00/0.31/0.00 | 0.01/0.69/0.01 | 0.44/0.16/0.08 | |
(6, 6) | 150 | 0.99 | 0.90/0.92/0.83 | 0.99/0.99/0.99 | 0.12/0.44/0.05 | 0.89/0.91/0.82 | 0.64/0.82/0.52 |
300 | 1.00 | 0.91/0.94/0.86 | 1.00/1.00/1.00 | 0.08/0.51/0.05 | 0.90/0.90/0.81 | 0.61/0.76/0.44 | |
500 | 0.99 | 0.89/0.95/0.85 | 0.99/0.99/0.99 | 0.05/0.46/0.01 | 0.87/0.88/0.78 | 0.57/0.71/0.41 |
7 Simulation Study
7.1 Performance of the Envelope Dimensions Selection
In this section, we investigate the numerical performance of the dimension selection methods for $\mathrm{SIMP}$, which are introduced in Section 6.2. We fix $r={p_{C}}=8$, ${p_{D}}={p_{2}}=2$, and we consider four cases where true dimensions $({d_{\boldsymbol{X}}},{d_{\boldsymbol{Y}}})$ are $(2,2)$, $(6,2)$, $(2,6)$ or $(6,6)$, with their corresponding effective numbers of parameters being 112, 120, 128 and 152. Three sample sizes, 150, 300 and 500, are considered, given the magnitude of the effective numbers of parameters. We generate Ω, ${\boldsymbol{\Omega }_{0}}$, Φ and ${\boldsymbol{\Phi }_{0}}$ all as diagonal matrices, with their associated ${L_{1,1}}$ norms being 100, 0.5, 1 and 10, and the assignment proportions to each diagonal element are generated from Dirichlet distributions with shape vectors of all 5, 5, 1 and 1 respectively. Entries of ${\boldsymbol{\beta }_{2}}$, $\boldsymbol{\gamma }$, ${\boldsymbol{\eta }_{C}}$, ${\boldsymbol{\eta }_{D}}$ are generated independently from $Unif(-2,2)$, and those of A and B are from $Unif(-1,1)$ independently. All elements of ${\boldsymbol{\mu }_{1C}}$ and ${\boldsymbol{\mu }_{\boldsymbol{Y}}}$ are independently from $Unif(0,10)$. Covariates of ${\mathbb{X}_{1D}}$ are generated independently from discrete $Unif\{0,1,2\}$, and samples for ${\mathbb{X}_{2}}$ are independently from ${\mathcal{N}_{{p_{2}}}}({\boldsymbol{\mu }_{2}},{\boldsymbol{\Sigma }_{2}})$ with ${\boldsymbol{\mu }_{2}}={(2,5)^{T}}$, and ${\boldsymbol{\Sigma }_{2}}$ as a realization from ${\mathcal{IW}_{{p_{2}}}}({\mathbf{I}_{{p_{2}}}},{p_{2}})$.
For our method, we specify the hyperparameters Z, F, W, T, ${\mathbf{A}_{0}}$ and ${\mathbf{B}_{0}}$ to be zero matrices, M, Λ, E and Q to be ${10^{-6}}$ times the identity matrices, ${\mathbf{K}_{\mathbf{A}}}$, ${\mathbf{K}_{\mathbf{B}}}$, ${\boldsymbol{\Sigma }_{\mathbf{A}}}$ and ${\boldsymbol{\Sigma }_{\mathbf{B}}}$ to be ${10^{6}}$ times the identity matrices, ${w_{\boldsymbol{X}}}$, ${w_{{\boldsymbol{X}_{0}}}}$, ${w_{\boldsymbol{Y}}}$ and ${w_{{\boldsymbol{Y}_{0}}}}$ to be the row numbers of Ω, ${\boldsymbol{\Omega }_{0}}$, Φ and ${\boldsymbol{\Phi }_{0}}$ respectively, and ${\boldsymbol{\Psi }_{\boldsymbol{X}}}$, ${\boldsymbol{\Psi }_{{\boldsymbol{X}_{0}}}}$, ${\boldsymbol{\Psi }_{\boldsymbol{Y}}}$ and ${\boldsymbol{\Psi }_{{\boldsymbol{Y}_{0}}}}$ to be identity matrices multiplied by ${10^{-6}}$ and their respective degrees of freedom (correspondingly, ${w_{\boldsymbol{X}}}$, ${w_{{\boldsymbol{X}_{0}}}}$, ${w_{\boldsymbol{Y}}}$ and ${w_{{\boldsymbol{Y}_{0}}}}$). The MCMC algorithm for our method is run for 20,000 iterations with the first 50% iterations as burn-in. For each case, their performance is averaged over 500 repetitions. Percentages that the Bura-Cook estimator correctly identifies d and each of five methods that correctly identifies true ${d_{\boldsymbol{X}}}$ or ${d_{\boldsymbol{Y}}}$ individually or true $({d_{\boldsymbol{X}}},{d_{\boldsymbol{Y}}})$ jointly are listed in Table 2.
Results in Table 2 reveal the advantage of BIC-MCMC in selecting ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$ for $\mathrm{SIMP}$, among various settings of true ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$ and sample sizes, except when ${d_{\boldsymbol{X}}}=6$, ${d_{\boldsymbol{Y}}}=2$ and the sample size is small. It suggests that the excellent performance of BIC-MCMC may need at least moderate sample sizes. AIC-MCMC and WAIC also perform relatively well in the large ${d_{\boldsymbol{X}}}$ and large ${d_{\boldsymbol{Y}}}$ setting. These observations for BIC-MCMC that needs large sample size to respond, and the better performance of AIC-MCMC under large envelope dimensions are consistent with the previous findings for envelope models ([58]). Compared with underestimation, a little over-estimating of these two envelope dimensions is acceptable, since only some estimation efficiency, rather than the material information, is lost [5]. The right panel of Figure 1 illustrates this phenomenon, since it is observed that the estimation biases (i.e., the square root of the vertical gaps between the green and the red lines, due to the Bias-variance decomposition) are large for underestimated envelope dimensions, whereas with correct or overestimated envelope dimensions, the biases are nearly zero and the estimated variances only increase linearly. Although all of AIC-MCMC, DIC and WAIC show some tendency in over-estimating ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$ in Table 2, the most significant one is DIC (see the left panel of Figure 1). Hence, AIC-MCMC and WAIC may also be considered in practice, especially for small sample sizes or the large ${d_{\boldsymbol{X}}}$ and large ${d_{\boldsymbol{Y}}}$ setting. The Bayesian CV chooses envelope dimensions based on the prediction performance. Although its performance in selecting the true envelope dimensions is far from the best in our simulation, it is still recommended in practice if the prediction performance is our utmost concern and the computational burden is manageable.
Figure 1
Left: Selection frequencies of ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$ (Rows 1 and 2) of five criteria, for different settings of true $({d_{\boldsymbol{X}}},{d_{\boldsymbol{Y}}})$ (Columns 1–4). Colors represent the selection frequency; Right: $\mathrm{MSE}$ and estimated variances (Var) of ${\widehat{\boldsymbol{\beta }}_{1}}$ from $\mathrm{SIMP}$ across all possible choices of ${d_{\boldsymbol{Y}}}$ (Row 1) or ${d_{\boldsymbol{X}}}$ (Row 2) with ${d_{\boldsymbol{X}}}$ or ${d_{\boldsymbol{Y}}}$ fixed at 2 (the true value) respectively. The horizontal blue dotted line in the figure that is at the bottom of the right panel shows the MSE of ${\widehat{\boldsymbol{\beta }}_{1}}$ from the frequentist partial response envelope with the envelope dimension as 2. Here the results of two panels are based on the sample size as $n=300$ over 500 repetitions. Results with $n=150$ and $n=500$ are similar.
7.2 Estimation Performance
7.2.1 Setting
In this section, we compare the estimation performance of $\mathrm{SIMP}$ with the Bayesian partial predictor envelope (PX-env), the Bayesian partial response envelope (PY-env) and several other classical envelope methods or multivariate regression approaches, including the frequentist predictor envelope (FX-env) [12], the frequentist response envelope (FY-env) [15], the frequentist simultaneous envelope (FS-env) [14], the frequentist partial response envelope (FPY-env) [57], the frequentist ordinary least squares (FOLS), the principal component regression (PCR) [32], the partial least squares regression (PLSR) [16], the canonical correlation analysis (CCA) [26] and the reduced rank regression (RRR) [29]. Note in Sections 7.2 and 8, PX-env and PY-env simply refer to $\mathrm{SIMP}$ with ${d_{\boldsymbol{Y}}}$ and ${d_{\boldsymbol{X}}}$ fixed at r and ${p_{C}}$ respectively, rather than the Bayesian versions of the partial predictor and partial response envelopes as exactly indicated in Table 1, however their differences are small (see the simulation results between FPY-env and PY-env in Sections 7.2.2–7.2.5) but will facilitate the comparison. We choose the posterior mean as the point estimator here and later in Section 8 for all the Bayesian (envelope) approaches. It is a decision-theoretic estimator and could be obtained easily once the posterior samples are available. In comparison, the posterior median and the maximum a posteriori (MAP) estimators are not considered in our paper, since they are expected to perform similarly (see Table C.4 in Appendix C for a numerical evidence of the posterior median estimator) due to the bell-shaped posterior distribution that is observed in both simulation studies (see Appendix G) and the real data application (see Appendix H.2), and the MAP estimator requires additional computation for the optimization.
In this section, we keep almost all set-ups from Section 7.1 except the following. We test on nine cases, for each combination of $r=3,8$ or 15 with $n=300,500$ or 1000. The covariates of ${\mathbb{X}_{1D}}$ are generated independently from discrete $Unif\{0,1\}$. Dimensions ${d_{\boldsymbol{X}}}$ and ${d_{\boldsymbol{Y}}}$ are both fixed at 2. The specification of the hyperparameters (if still existed), the number of iterations and the burn-in proportion of the MCMC algorithm for PX-env, PY-env and $\mathrm{SIMP}$ in this section are the same as those for $\mathrm{SIMP}$ in Section 7.1.
To make comparison fair, for competitors that cannot account for the partial structures (i.e., all methods excluding $\mathrm{SIMP}$, PX-env, PY-env and FPY-env), $\boldsymbol{Y}$ is regressed on ${\boldsymbol{X}_{2}}$ first, and the fitted residuals are the actual responses $\check{\boldsymbol{Y}}$ for the regression on our interested predictors ${\boldsymbol{X}_{1}}$. This is a common way to adjust the effects of prognostic factors that are not of main interest for IGP [68]. Tuning parameters or envelope dimensions for all methods are assumed to be unknown. The envelope dimensions for all Bayesian envelope methods ($\mathrm{SIMP}$, PX-env, PY-env) are chosen by BIC-MCMC, and those for the frequentist envelope methods (FPY-env, FX-env, FY-env and FS-env) are chosen by the traditional BIC. The numbers of components for PCR and PLSR are selected based on minimizing the Mean squared prediction error (MSPE) from 5-fold CV. For $k=1,2,\dots ,5$, let ${\mathcal{C}_{k}}$ represent the k-th index set among five partition sets of n samples. For $\ell =1,2,\dots ,{p_{1}}$, we determine the number of components as the ℓ that minimizes
For the sample i in ${\mathcal{C}_{k}}$, ${\check{y}_{ij}}$ represents the j-th response adjusted by ${\boldsymbol{X}_{2}}$, while ${\widehat{\check{y}}_{ij}^{-k}}(\ell )$ is the prediction of ${\check{y}_{ij}}$ by ${\boldsymbol{X}_{1C,i}}$ and ${\boldsymbol{X}_{1D,i}}$, for either PCR or PLSR trained on all samples except those in ${\mathcal{C}_{k}}$, and with number of components as ℓ. The rank for RRR is determined by the original Bura-Cook estimator [4]. As the estimation performance of ${\boldsymbol{\mu }_{1C}}$, ${\boldsymbol{\mu }_{\boldsymbol{Y}}}$ and ${\boldsymbol{\beta }_{2}}$ is very close among the competitors that we choose, and not of main interest to us, their results are not reported for compactness. For the true data generating mechanisms, we consider three models (M1-M3) from $\mathrm{SIMP}$, and one model (M4) from RRR which is to investigate the model mis-specification.
Note that we allow the true data generating models to be SIMP with $\| \boldsymbol{\Omega }\| \gg \| {\boldsymbol{\Omega }_{0}}\| $ and $\| \boldsymbol{\Phi }\| \ll \| {\boldsymbol{\Phi }_{0}}\| $ under (M1) or $\| \boldsymbol{\Omega }\| \ll \| {\boldsymbol{\Omega }_{0}}\| $ and $\| \boldsymbol{\Phi }\| \ll \| {\boldsymbol{\Phi }_{0}}\| $ under (M2) or $\| \boldsymbol{\Omega }\| \gg \| {\boldsymbol{\Omega }_{0}}\| $ and $\| \boldsymbol{\Phi }\| \gg \| {\boldsymbol{\Phi }_{0}}\| $ under (M3). (M4) assumes the true model to be RRR and is designed to test the robustness of $\mathrm{SIMP}$ under model mis-specification. Results under all scenarios are displayed from Section 7.2.2 to Section 7.2.5.
(7.1)
\[ \mathrm{MSPE}(\ell )=\frac{1}{n}{\sum \limits_{k=1}^{K}}\sum \limits_{i\in {\mathcal{C}_{k}}}{\sum \limits_{j=1}^{r}}{\big({\check{y}_{ij}}-{\widehat{\check{y}}_{ij}^{-k}}(\ell )\big)^{2}}.\]-
(M1) $\{\boldsymbol{\Omega },{\boldsymbol{\Omega }_{0}},\boldsymbol{\Phi },{\boldsymbol{\Phi }_{0}}\}=\{10{\mathbf{I}_{{d_{\boldsymbol{X}}}}},{\mathbf{I}_{{p_{C}}-{d_{\boldsymbol{X}}}}},{\mathbf{I}_{{d_{\boldsymbol{Y}}}}},5{\mathbf{I}_{r-{d_{\boldsymbol{Y}}}}}\}$,
-
(M2) $\{\boldsymbol{\Omega },{\boldsymbol{\Omega }_{0}},\boldsymbol{\Phi },{\boldsymbol{\Phi }_{0}}\}=\{{\mathbf{I}_{{d_{\boldsymbol{X}}}}},10{\mathbf{I}_{{p_{C}}-{d_{\boldsymbol{X}}}}},{\mathbf{I}_{{d_{\boldsymbol{Y}}}}},5{\mathbf{I}_{r-{d_{\boldsymbol{Y}}}}}\}$,
-
(M3) $\{\boldsymbol{\Omega },{\boldsymbol{\Omega }_{0}},\boldsymbol{\Phi },{\boldsymbol{\Phi }_{0}}\}=\{10{\mathbf{I}_{{d_{\boldsymbol{X}}}}},{\mathbf{I}_{{p_{C}}-{d_{\boldsymbol{X}}}}},5{\mathbf{I}_{{d_{\boldsymbol{Y}}}}},{\mathbf{I}_{r-{d_{\boldsymbol{Y}}}}}\}$,
-
(M4) ${\boldsymbol{\beta }_{1C}}={\mathbf{G}_{C}}\mathbf{H}$, ${\boldsymbol{\beta }_{1D}}={\mathbf{G}_{D}}\mathbf{H}$, where ${\mathbf{G}_{C}}\in {\mathbb{R}^{{p_{C}}\times {d_{1}}}}$, ${\mathbf{G}_{D}}\in {\mathbb{R}^{{p_{D}}\times {d_{1}}}}$, $\mathbf{H}\in {\mathbb{R}^{{d_{1}}\times r}}$, with ${d_{1}}=1$ and their elements are all generated independently from $Unif(-1,1)$. Samples of ${\mathbb{X}_{1C}}$ are generated from ${\mathcal{N}_{{p_{C}}}}({\boldsymbol{\mu }_{1C}},{\boldsymbol{\Sigma }_{C}})$, and ${\boldsymbol{\Sigma }_{\boldsymbol{Y}\mid \boldsymbol{X}}}$ and ${\boldsymbol{\Sigma }_{C}}$ are one realizations of $\mathcal{IW}({\mathbf{I}_{r}},r)$ and $\mathcal{IW}({\mathbf{I}_{{p_{C}}}},{p_{C}})$ respectively.
Table 3
$\mathrm{MSE}$ comparison between $\mathrm{SIMP}$ and other 11 competitors for estimating ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$ over 500 repetitions under the data generating mechanism (M1). The lowest $\mathrm{MSE}$ under each combination of r and n is in bold face.
r | n | FOLS | RRR | PCR | PLSR | CCA | FX-env | FY-env | FS-env | FPY-env | PX-env | PY-env | $\mathrm{SIMP}$ | |
3 | 300 | ${\boldsymbol{\beta }_{1C}}$ | 0.16 | 0.07 | 0.16 | 0.17 | 3.95 | 0.10 | 0.05 | 0.05 | 0.05 | 0.02 | 0.05 | 0.02 |
${\boldsymbol{\beta }_{1D}}$ | 0.55 | 0.27 | 0.63 | 0.69 | 13.52 | 0.98 | 0.18 | 0.33 | 0.16 | 0.23 | 0.17 | 0.08 | ||
500 | ${\boldsymbol{\beta }_{1C}}$ | 0.09 | 0.04 | 0.09 | 0.09 | 3.81 | 0.05 | 0.03 | 0.02 | 0.03 | 0.01 | 0.03 | 0.01 | |
${\boldsymbol{\beta }_{1D}}$ | 0.34 | 0.16 | 0.34 | 0.36 | 12.87 | 0.38 | 0.10 | 0.14 | 0.09 | 0.14 | 0.10 | 0.05 | ||
1000 | ${\boldsymbol{\beta }_{1C}}$ | 0.04 | 0.02 | 0.04 | 0.04 | 3.80 | 0.02 | 0.01 | 0.01 | 0.01 | 0.00 | 0.01 | 0.00 | |
${\boldsymbol{\beta }_{1D}}$ | 0.16 | 0.07 | 0.16 | 0.17 | 13.68 | 0.17 | 0.05 | 0.06 | 0.05 | 0.07 | 0.05 | 0.02 | ||
8 | 300 | ${\boldsymbol{\beta }_{1C}}$ | 0.69 | 0.10 | 0.30 | 0.27 | 23.17 | 0.23 | 0.06 | 0.07 | 0.05 | 0.03 | 0.05 | 0.02 |
${\boldsymbol{\beta }_{1D}}$ | 2.58 | 0.39 | 3.23 | 3.14 | 86.62 | 2.77 | 0.19 | 0.52 | 0.18 | 0.93 | 0.18 | 0.11 | ||
500 | ${\boldsymbol{\beta }_{1C}}$ | 0.40 | 0.05 | 0.28 | 0.27 | 22.80 | 0.15 | 0.03 | 0.03 | 0.03 | 0.02 | 0.03 | 0.01 | |
${\boldsymbol{\beta }_{1D}}$ | 1.53 | 0.21 | 3.14 | 2.96 | 85.83 | 1.57 | 0.12 | 0.21 | 0.11 | 0.57 | 0.11 | 0.06 | ||
1000 | ${\boldsymbol{\beta }_{1C}}$ | 0.20 | 0.02 | 0.21 | 0.22 | 22.82 | 0.08 | 0.02 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | |
${\boldsymbol{\beta }_{1D}}$ | 0.74 | 0.09 | 1.5 | 1.57 | 84.69 | 0.89 | 0.05 | 0.07 | 0.05 | 0.27 | 0.05 | 0.03 | ||
15 | 300 | ${\boldsymbol{\beta }_{1C}}$ | 1.45 | 0.16 | 0.16 | 0.11 | 30.38 | 0.10 | 0.09 | 0.13 | 0.08 | 0.06 | 0.08 | 0.05 |
${\boldsymbol{\beta }_{1D}}$ | 4.74 | 0.38 | 1.39 | 1.37 | 103.70 | 1.31 | 0.19 | 0.80 | 0.15 | 1.87 | 0.15 | 0.17 | ||
500 | ${\boldsymbol{\beta }_{1C}}$ | 0.86 | 0.09 | 0.16 | 0.10 | 30.18 | 0.08 | 0.05 | 0.06 | 0.04 | 0.04 | 0.04 | 0.03 | |
${\boldsymbol{\beta }_{1D}}$ | 2.86 | 0.21 | 1.34 | 1.32 | 103.07 | 1.31 | 0.10 | 0.42 | 0.09 | 1.13 | 0.09 | 0.09 | ||
1000 | ${\boldsymbol{\beta }_{1C}}$ | 0.42 | 0.04 | 0.14 | 0.11 | 30.20 | 0.07 | 0.02 | 0.03 | 0.02 | 0.02 | 0.02 | 0.01 | |
${\boldsymbol{\beta }_{1D}}$ | 1.39 | 0.09 | 1.29 | 1.24 | 102.97 | 1.30 | 0.05 | 0.15 | 0.04 | 0.54 | 0.04 | 0.04 |
7.2.2 Results on (M1): $\| \boldsymbol{\Omega }\| \gg \| {\boldsymbol{\Omega }_{0}}\| $ and $\| \boldsymbol{\Phi }\| \ll \| {\boldsymbol{\Phi }_{0}}\| $
Under (M1), the mean squared errors ($\mathrm{MSE}$) for ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$, two parameters that we are most interested in, are reported in Table 3. For almost every combination of r and n, $\mathrm{SIMP}$ always performs the best for the estimation of both ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$. PX-env or PY-env, two special cases of $\mathrm{SIMP}$ by setting ${d_{\boldsymbol{Y}}}=r$ or ${d_{\boldsymbol{X}}}={p_{C}}$, can achieve competitive performance for either only ${\boldsymbol{\beta }_{1C}}$, or whole ${\boldsymbol{\beta }_{1}}$ but is still less efficient than $\mathrm{SIMP}$. Comparing these two models, we find the former obtains a slightly better performance for ${\boldsymbol{\beta }_{1C}}$ while the latter achieves a much better performance for ${\boldsymbol{\beta }_{1D}}$. This observation is consistent with our assumptions for SIMP in (3.7), where we have assumed both the partial predictor and partial response envelope structures for ${\boldsymbol{\beta }_{1C}}$ with a slightly stronger signal from the former component here ($\| \boldsymbol{\Omega }\| /\| {\boldsymbol{\Omega }_{0}}\| \gt \| {\boldsymbol{\Phi }_{0}}\| /\| \boldsymbol{\Phi }\| $), and only the partial response envelope structure is imposed on ${\boldsymbol{\beta }_{1D}}$. Although PY-env and $\mathrm{SIMP}$ shares the same partial response envelope structure for ${\boldsymbol{\beta }_{1D}}$, it is surprising that the latter outperforms the former for the estimation of ${\boldsymbol{\beta }_{1D}}$ when $r=3$ and 8, which is possibly due to the reason that the more efficient estimator of ${\boldsymbol{\beta }_{1C}}$ improves the estimation of $\mathbf{R}(\mathbf{B})$, and hence the estimation of ${\boldsymbol{\beta }_{1D}}$ in $\mathrm{SIMP}$. Meanwhile, it is noteworthy that PY-env and FPY-env show similar estimation performance for both ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$ under all cases, but the former one allows more convenient posterior variability quantification and utilization of the prior information. FX-env, FY-env, FS-env and other five non-envelope methods cannot outperform their partial envelope counterparts and any partial envelope methods respectively.
To further illustrate the advantage of $\mathrm{SIMP}$ from the perspective of posterior variability, the average of the posterior standard deviations (PSD) of each coordinate of $\mathrm{vec}({\boldsymbol{\beta }_{1C}})$ and $\mathrm{vec}({\boldsymbol{\beta }_{1D}})$ is calculated, as the average of empirical standard deviations of posterior samples of $\mathrm{vec}({\boldsymbol{\beta }_{1C}})$ and $\mathrm{vec}({\boldsymbol{\beta }_{1D}})$ across 500 repetitions. The performance on average PSD is compared between $\mathrm{SIMP}$, Bayesian linear regression (BLR) by implementing $\mathrm{SIMP}$ with ${d_{\boldsymbol{X}}}={p_{C}}$ and ${d_{\boldsymbol{Y}}}=r$, PX-env and PY-env under (M1). The envelope dimensions are assumed to be unknown and selected by BIC-MCMC for all three envelope methods. The results are reported in Figure 2. In terms of both ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$, three envelope methods can always significantly outperform BLR under all cases. $\mathrm{SIMP}$ and PX-env obtain the lowest average PSD for ${\boldsymbol{\beta }_{1C}}$, and are better than PY-env especially when $r=3$. This is as expected since we have imposed both partial predictor and partial response envelope structures on ${\boldsymbol{\beta }_{1C}}$ but the signal of the former is stronger. As r becomes 8 or 15, the performance of PY-env is close to other two envelope methods. In terms of ${\boldsymbol{\beta }_{1D}}$, $\mathrm{SIMP}$ achieves the lowest PSD under all scenarios. When $r=8$ or 15, the performance of PY-env is close to $\mathrm{SIMP}$, and is significantly better than PX-env. This is as expected as well, since the partial response envelope structure is imposed on ${\boldsymbol{\beta }_{1D}}$ for both $\mathrm{SIMP}$ and PY-env, and the slight advantage of $\mathrm{SIMP}$ for the estimation of ${\boldsymbol{\beta }_{1D}}$ is possibly due to the more efficient estimation of ${\boldsymbol{\beta }_{1C}}$, which is one synergetic effect of imposing two envelope components simultaneously. Although no envelope assumption is imposed on ${\boldsymbol{\beta }_{1D}}$ for PX-env, the posterior variability for the estimator of ${\boldsymbol{\beta }_{1D}}$ from PX-env is still smaller than that of BLR.
Table 4
$\mathrm{MSE}$ comparison between $\mathrm{SIMP}$ and other 11 competitors for estimating ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$ over 500 repetitions under the data generating mechanism (M2). The lowest $\mathrm{MSE}$ under each combination of r and n is in bold face.
r | n | FOLS | RRR | PCR | PLSR | CCA | FX-env | FY-env | FS-env | FPY-env | PX-env | PY-env | $\mathrm{SIMP}$ | |
3 | 300 | ${\boldsymbol{\beta }_{1C}}$ | 0.06 | 0.05 | 0.06 | 0.06 | 1.31 | 0.05 | 0.04 | 0.02 | 0.04 | 0.06 | 0.04 | 0.03 |
${\boldsymbol{\beta }_{1D}}$ | 0.32 | 0.46 | 0.34 | 0.35 | 7.11 | 0.33 | 0.47 | 0.12 | 0.45 | 0.31 | 0.48 | 0.12 | ||
500 | ${\boldsymbol{\beta }_{1C}}$ | 0.04 | 0.03 | 0.04 | 0.04 | 1.29 | 0.03 | 0.01 | 0.01 | 0.01 | 0.04 | 0.01 | 0.02 | |
${\boldsymbol{\beta }_{1D}}$ | 0.20 | 0.22 | 0.20 | 0.20 | 7.04 | 0.21 | 0.07 | 0.07 | 0.07 | 0.19 | 0.08 | 0.06 | ||
1000 | ${\boldsymbol{\beta }_{1C}}$ | 0.02 | 0.01 | 0.02 | 0.02 | 1.26 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | |
${\boldsymbol{\beta }_{1D}}$ | 0.10 | 0.08 | 0.10 | 0.10 | 6.87 | 0.10 | 0.03 | 0.04 | 0.03 | 0.09 | 0.03 | 0.03 | ||
8 | 300 | ${\boldsymbol{\beta }_{1C}}$ | 0.30 | 0.17 | 0.31 | 0.31 | 5.74 | 0.25 | 0.04 | 0.05 | 0.04 | 0.23 | 0.04 | 0.04 |
${\boldsymbol{\beta }_{1D}}$ | 1.42 | 0.52 | 1.63 | 1.59 | 39.61 | 1.51 | 0.11 | 0.17 | 0.11 | 1.25 | 0.12 | 0.13 | ||
500 | ${\boldsymbol{\beta }_{1C}}$ | 0.17 | 0.09 | 0.18 | 0.18 | 5.63 | 0.14 | 0.02 | 0.03 | 0.02 | 0.13 | 0.02 | 0.02 | |
${\boldsymbol{\beta }_{1D}}$ | 0.86 | 0.29 | 1.00 | 1.00 | 39.68 | 0.91 | 0.07 | 0.10 | 0.07 | 0.76 | 0.08 | 0.08 | ||
1000 | ${\boldsymbol{\beta }_{1C}}$ | 0.08 | 0.04 | 0.08 | 0.08 | 5.42 | 0.07 | 0.01 | 0.01 | 0.01 | 0.06 | 0.01 | 0.01 | |
${\boldsymbol{\beta }_{1D}}$ | 0.41 | 0.13 | 0.41 | 0.43 | 38.19 | 0.41 | 0.03 | 0.04 | 0.03 | 0.37 | 0.04 | 0.04 | ||
15 | 300 | ${\boldsymbol{\beta }_{1C}}$ | 0.60 | 0.42 | 0.59 | 0.55 | 5.19 | 0.43 | 0.10 | 0.16 | 0.10 | 0.46 | 0.12 | 0.13 |
${\boldsymbol{\beta }_{1D}}$ | 2.32 | 0.29 | 1.83 | 1.19 | 49.09 | 1.36 | 0.09 | 0.40 | 0.09 | 2.04 | 0.11 | 0.17 | ||
500 | ${\boldsymbol{\beta }_{1C}}$ | 0.36 | 0.25 | 0.37 | 0.34 | 5.00 | 0.27 | 0.06 | 0.09 | 0.06 | 0.28 | 0.06 | 0.07 | |
${\boldsymbol{\beta }_{1D}}$ | 1.42 | 0.16 | 1.56 | 1.13 | 49.09 | 0.97 | 0.05 | 0.20 | 0.06 | 1.24 | 0.06 | 0.09 | ||
1000 | ${\boldsymbol{\beta }_{1C}}$ | 0.18 | 0.12 | 0.18 | 0.17 | 4.80 | 0.13 | 0.03 | 0.03 | 0.03 | 0.14 | 0.03 | 0.03 | |
${\boldsymbol{\beta }_{1D}}$ | 0.68 | 0.07 | 0.69 | 0.96 | 48.92 | 0.50 | 0.03 | 0.05 | 0.03 | 0.60 | 0.03 | 0.04 |
Table 5
$\mathrm{MSE}$ comparison between $\mathrm{SIMP}$ and other 11 competitors for estimating ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$ over 500 repetitions under the data generating mechanism (M3). The lowest $\mathrm{MSE}$ under each combination of r and n is in bold face.
r | n | FOLS | RRR | PCR | PLSR | CCA | FX-env | FY-env | FS-env | FPY-env | PX-env | PY-env | $\mathrm{SIMP}$ | |
3 | 300 | ${\boldsymbol{\beta }_{1C}}$ | 0.25 | 0.23 | 0.28 | 0.27 | 1.84 | 0.22 | 0.19 | 0.20 | 0.18 | 0.03 | 0.23 | 0.04 |
${\boldsymbol{\beta }_{1D}}$ | 0.89 | 0.84 | 1.90 | 1.65 | 9.37 | 3.56 | 1.57 | 2.63 | 1.56 | 0.34 | 0.88 | 0.38 | ||
500 | ${\boldsymbol{\beta }_{1C}}$ | 0.14 | 0.13 | 0.15 | 0.15 | 1.44 | 0.14 | 0.13 | 0.11 | 0.13 | 0.01 | 0.13 | 0.02 | |
${\boldsymbol{\beta }_{1D}}$ | 0.52 | 0.48 | 0.68 | 0.69 | 8.69 | 2.34 | 0.71 | 1.52 | 0.69 | 0.21 | 0.46 | 0.20 | ||
1000 | ${\boldsymbol{\beta }_{1C}}$ | 0.07 | 0.07 | 0.07 | 0.07 | 1.21 | 0.05 | 0.06 | 0.04 | 0.06 | 0.01 | 0.06 | 0.01 | |
${\boldsymbol{\beta }_{1D}}$ | 0.25 | 0.23 | 0.26 | 0.27 | 8.18 | 0.78 | 0.23 | 0.39 | 0.23 | 0.10 | 0.23 | 0.10 | ||
8 | 300 | ${\boldsymbol{\beta }_{1C}}$ | 0.35 | 0.23 | 0.29 | 0.29 | 5.25 | 0.06 | 0.22 | 0.18 | 0.22 | 0.02 | 0.22 | 0.04 |
${\boldsymbol{\beta }_{1D}}$ | 1.28 | 0.84 | 2.86 | 2.71 | 19.95 | 3.55 | 0.81 | 2.21 | 0.80 | 0.47 | 0.80 | 0.34 | ||
500 | ${\boldsymbol{\beta }_{1C}}$ | 0.21 | 0.14 | 0.22 | 0.22 | 4.96 | 0.08 | 0.13 | 0.11 | 0.13 | 0.01 | 0.13 | 0.02 | |
${\boldsymbol{\beta }_{1D}}$ | 0.78 | 0.51 | 1.79 | 1.66 | 19.02 | 3.47 | 0.50 | 1.37 | 0.49 | 0.29 | 0.49 | 0.21 | ||
1000 | ${\boldsymbol{\beta }_{1C}}$ | 0.10 | 0.06 | 0.10 | 0.10 | 4.82 | 0.15 | 0.06 | 0.05 | 0.06 | 0.01 | 0.06 | 0.01 | |
${\boldsymbol{\beta }_{1D}}$ | 0.37 | 0.24 | 0.44 | 0.47 | 18.25 | 2.69 | 0.23 | 0.61 | 0.23 | 0.14 | 0.23 | 0.09 | ||
15 | 300 | ${\boldsymbol{\beta }_{1C}}$ | 0.51 | 0.25 | 0.16 | 0.14 | 7.08 | 0.05 | 0.24 | 0.12 | 0.23 | 0.03 | 0.23 | 0.05 |
${\boldsymbol{\beta }_{1D}}$ | 1.64 | 0.77 | 1.32 | 1.27 | 22.01 | 1.39 | 0.73 | 1.37 | 0.70 | 0.66 | 0.70 | 0.36 | ||
500 | ${\boldsymbol{\beta }_{1C}}$ | 0.30 | 0.14 | 0.14 | 0.13 | 6.67 | 0.04 | 0.14 | 0.08 | 0.13 | 0.02 | 0.13 | 0.03 | |
${\boldsymbol{\beta }_{1D}}$ | 0.98 | 0.45 | 1.29 | 1.23 | 21.46 | 1.38 | 0.43 | 1.16 | 0.41 | 0.40 | 0.41 | 0.21 | ||
1000 | ${\boldsymbol{\beta }_{1C}}$ | 0.15 | 0.07 | 0.12 | 0.11 | 6.39 | 0.03 | 0.07 | 0.05 | 0.07 | 0.01 | 0.07 | 0.01 | |
${\boldsymbol{\beta }_{1D}}$ | 0.48 | 0.22 | 1.09 | 1.09 | 21.05 | 1.38 | 0.21 | 0.66 | 0.21 | 0.20 | 0.21 | 0.10 |
7.2.3 Results on (M2): $\| \boldsymbol{\Omega }\| \ll \| {\boldsymbol{\Omega }_{0}}\| $ and $\| \boldsymbol{\Phi }\| \ll \| {\boldsymbol{\Phi }_{0}}\| $
Table 4 shows the performance under (M2). Since $\| \boldsymbol{\Omega }\| \ll \| {\boldsymbol{\Omega }_{0}}\| $, the partial predictor envelope component cannot provide too many efficiency gains. Therefore, as expected, the best estimation performance comes from envelope models that contain the (partial) response envelope structure, including FY-env, FS-env, FPY-env, PY-env and $\mathrm{SIMP}$. Although the best performance is mostly achieved by FY-env and FPY-env, the difference between these five methods is small. Since the partial response envelope structure is imposed on both ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$ in (3.7), large efficiency gains could be observed in estimating both of these two parameters for these five methods, as expected. Among these five methods, the slightly worse performance of $\mathrm{SIMP}$ under some cases is largely due to the model selection. See the performance with the true envelope dimensions known in Table C.3 in Appendix C.
7.2.4 Results on (M3): $\| \boldsymbol{\Omega }\| \gg \| {\boldsymbol{\Omega }_{0}}\| $ and $\| \boldsymbol{\Phi }\| \gg \| {\boldsymbol{\Phi }_{0}}\| $
The estimation performance under (M3) is shown in Table 5. Without surprise, the best performance is obtained by envelope models containing the partial predictor envelope structure, i.e. PX-env and $\mathrm{SIMP}$. Similarly, their estimation performances for ${\boldsymbol{\beta }_{1C}}$ are close, and the tiny difference between them mainly comes from the model selection. It is worthy to notice that when $r=8$ or 15, some additional gains on estimating ${\boldsymbol{\beta }_{1D}}$ could be offered by $\mathrm{SIMP}$ compared with PX-env, since the partial response envelope structure is assumed in $\mathrm{SIMP}$ for ${\boldsymbol{\beta }_{1D}}$, although the signal may not be strong ($\| \boldsymbol{\Phi }\| \gg \| {\boldsymbol{\Phi }_{0}}\| $), however, the synergetic effect of $\mathrm{SIMP}$ that is mentioned above could possibly enhance this advantage.
Table 6
$\mathrm{MSE}$ comparison between $\mathrm{SIMP}$ and other 11 competitors for estimating ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$ over 500 repetitions under the data generating mechanism (M4). The lowest $\mathrm{MSE}$ under each combination of r and n among all methods and within envelope methods are in bold face and blue respectively.
7.2.5 Results on (M4): Model Mis-Specification
To test the robustness of our method, an additional simulation under the model mis-specification scenario (M4) is studied. From Table 6, RRR estimator almost always produces the lowest $\mathrm{MSE}$ when $r=3$ or 8. Although under the mis-specified model, envelope methods still perform quite well. Within envelope methods, $\mathrm{SIMP}$ obtains a relatively better performance for both ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$ when r is 8, and FX-env is better for ${\boldsymbol{\beta }_{1D}}$ when r is 3 or 15. Note that when $r=15$, several envelope methods could even outperform the RRR estimator. When we are more cautious for the results displayed here, we will find that the performance of all envelope methods especially $\mathrm{SIMP}$ is largely contaminated by the “bad” envelope dimensions that are selected under this model mis-specification scenario, since the likelihoods of them might significantly depart from the truth and hence are misleading in model selection. See Table C.5 in Appendix C for the estimation results if we rule out this factor by using the “optimal” tuning parameter or envelope dimension for each method, if we pretend to know the true values of ${\boldsymbol{\beta }_{1C}}$ and ${\boldsymbol{\beta }_{1D}}$. Under such scenario, the envelope methods almost always dominate other methods (even RRR), and $\mathrm{SIMP}$ is almost always the best one among envelope methods except when $r=3$. And when $r=3$, the performance of $\mathrm{SIMP}$ is very close to the best one. Therefore, the robustness of our method and other envelope methods to a general low rank model RRR is observed in our simulation.
Besides, in Appendix C, we also display additional numerical results for the estimation performance of FPY-env, PX-env, PY-env and $\mathrm{SIMP}$ with the true envelope dimensions known under (M1)–(M3) or RRR with the true rank known under (M4), and the results for ${p_{2}}=4$ or ${p_{D}}=4$ with other settings same as those in (M1).
8 Real Imaging Genetics Application on ADNI1
8.1 Background
Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). Led by Principal Investigator Michael W. Weiner, MD, ADNI was launched in 2003 as a public-private partnership, and lasts for 4 stages (ADNI1, ADNI-GO, ADNI2 and ADNI3) till now. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), genetic or other biological markers, clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). This study has strict enrollment standards, follow-up and data checking protocols. All raw or preprocessed data is available through the Image and Data Archive (IDA) at https://ida.loni.usc.edu/, upon approval of the application. In this real data application, we want to study the genetic effects, including the effect of a well-known AD gene Apolipoprotein E (APOE) ϵ4, on the volumes of some cortical or sub-cortical structures in brain, while also accounting for the effects of some other prognostic factors by applying $\mathrm{SIMP}$, and compare with several other methods in terms of the prediction performance. Following previous practices on analyzing ADNI data [68, 46], we focus on Caucasian participants in the ADNI1 phase, to reduce the population stratification.
8.2 Data Pre-processing Procedure
Imaging responses, i.e. $\boldsymbol{Y}$ in (3.2), we adopted are the volume measurements on 12 ROIs that are listed in [42] (i.e., the Amygdala, the Cerebral cortex, the Cerebral white matter, the Hippocampus, the Inferior lateral ventricle and the Lateral ventricle on two hemispheres), which are obtained from 1.5T MRI scan at the screening visit of ADNI1. All imaging data is downloaded from IDA, and has been pre-processed by FreeSurfer [21] already. Among 845 participants who took MRI scans, we only retain 593 participants who passed the quality control (QC). The log transformation is applied to reduce skewness.
Table 7
$\mathrm{MSPE}$ from 5-fold CV for various envelope and non envelope methods. Envelope dimensions, number of components ($\mathrm{ncomp}$) or rank selected by 5-fold CV are mentioned inside parenthesis.
Non envelope methods | Non-partial envelope methods | Partial envelope methods | |||
FOLS | 14.02 | FX-env (${d_{\boldsymbol{X}}}=1$) | 8.34 | PX-env (${d_{\boldsymbol{X}}}=1$) | 7.94 |
RRR ($\mathrm{rank}=1$) | 10.87 | FY-env (${d_{\boldsymbol{Y}}}=1$) | 9.77 | FPY-env (${d_{\boldsymbol{Y}}}=1$) | 9.82 |
PLSR ($\mathrm{ncomp}=1$) | 8.20 | FS-env (${d_{\boldsymbol{X}}}=7$, ${d_{\boldsymbol{Y}}}=1$) | 8.06 | $\mathrm{SIMP}$(${d_{\boldsymbol{X}}}=1$, ${d_{\boldsymbol{Y}}}=3$) | 7.88 |
620,901 SNPs on all autosomes and sex chromosomes were genotyped for 757 participants in ADNI1, who are not contained in but are overlapped with the 845 participants who have MRI scans. Among 11,632 SNPs meta-analyzed by International Genomics of Alzheimer’s Project (IGAP) [36], 852 autosome SNPs are selected as our candidate SNPs for model fitting by checking two conditions: (1) Genotyped by ADNI1; (2) P-values from IGAP are smaller than 0.01. Developed by Shaun Purcell and Christopher Chang, PLINK [7] is an open source software for performing QC and some routine analyses in GWAS in an efficient manner, and is available at https://www.cog-genomics.org/plink/1.9/. Following [68], we perform the following two lines of QC on the selected SNP data by PLINK. The first line of QC includes (1) call rate check per subject and per SNP marker, (2) gender check, (3) sibling pair identification, (4) Hardy-Weinberg equilibrium test, and (5) population stratification check. The second line of QC removes SNPs with (1) more than $5\% $ missing values, (2) minor allele frequency smaller than $10\% $, and (3) Hardy-Weinberg equilibrium p-value $\lt {10^{-6}}$. This leaves us with 732 SNPs that passed QC. A free genotype imputation and haplotype phasing program IMPUTE2 ([27], [28]) is used to impute SNPs with missingness. We choose HapMap 3 as our reference panel for the imputation, since it shares the same genome build b36 of National Center for Biotechnology Information (NCBI) with ADNI1. We delete 39 SNPs that are not shown in the reference panel, which leads to 693 SNPs finally. To reduce dimensionality, the principal component analysis (PCA) is applied on these 693 SNPs. By thresholding the eigenvalues of its sample covariance matrix by 1, the first 186 principal components (PCs), which explains 87.19% of the total variation of these 693 SNP predictors, are selected as our ${\boldsymbol{X}_{1C}}$, the continuous part of our predictors of main interest in (3.2).
APOE $\epsilon 4$, taking values in $\{0,1,2\}$, is included in ${\boldsymbol{X}_{1D}}$ as the only discrete (quantitative) genetic predictor in (3.2). We have incorporated six other important prognostic factors in ${\boldsymbol{X}_{2}}$ in (3.2), including gender, marital status, handness, age, years of education and intracranial volume (ICV). The intersection of participants from all types of abovementioned datasets leads to 498 samples in our analysis finally. All predictors and responses are standardized before the model fitting.
8.3 Prediction Performance
We first compare the prediction performance of $\mathrm{SIMP}$ with other two partial envelope methods (PX-env, FPY-env), three envelope models that cannot account for the partial structures (FX-env, FY-env, FS-env; we call them non-partial envelope methods), and three popular multivariate linear regression methods that are not based on the envelope models (FOLS, RRR and PLSR; we call them non envelope methods). Similar with Section 7.2, for all methods except the partial envelope methods (PX-env, FPY-env and $\mathrm{SIMP}$), $\boldsymbol{Y}$ is adjusted as the residuals from the regression on ${\boldsymbol{X}_{2}}$ first, and the model fitting will be done by the genetic markers ${\boldsymbol{X}_{1}}$ and the adjusted responses. For the best prediction performance, all associated envelope dimensions for envelope methods, or the number of components for PLSR, or the rank for RRR are selected by minimizing MSPE from 5-fold CV. The MCMC algorithm for $\mathrm{SIMP}$ and PX-env is run for 20,000 iterations with first 50% iterations as burn-in, and their specifications for the hyperparameters are the same as those in Section 7.
Table 7 reveals that the best prediction performance is achieved by $\mathrm{SIMP}$. All of the envelope methods in the last two columns significantly outperform FOLS and RRR. FS-env and $\mathrm{SIMP}$ achieve the best prediction performance within the non-partial envelope methods and partial envelope methods we considered respectively. This illustrates the advantages of combining the predictor and response envelopes in both non-partial and partial envelope contexts. Comparing partial envelope methods with their non-partial counterparts, PX-env and $\mathrm{SIMP}$ both obtain lower MSPEs than FX-env and FS-env respectively. The finding that FPY-env cannot outperform FY-env seems surprising, but is consistent with the previous theoretical result that if the envelope spaces of FPY-env and FY-env are equal, then the asymptotic variance of the estimator of ${\boldsymbol{\beta }_{1}}$ from FPY-env could not be smaller than that from FY-env (see the proposition 2 of [57]), and we have verified the estimated envelope spaces by FPY-env and FY-env are very close in this analysis. It is also interesting to note that FX-env shares similar $\mathrm{MSPE}$ with PLSR. Their connection has been discussed in [12].
Figure 3
Upper left: Indicator of significance of each (SNP, IP) pair or APOE $\epsilon 4$ with each IP. APOE $\epsilon 4$ corresponds to the last vertical line in red; Upper right: Indicator of significance of each (prognostic factor, IP) pair. Red signifies significance in the figures of the upper row. Bottom: Heatmap of the estimated regression coefficients matrix ${\widehat{\boldsymbol{\beta }}_{SNP}^{T}}$ between 693 SNPs and 12 IPs.
8.4 Posterior Estimation and Selection Performance
Both AIC-MCMC and BIC-MCMC select ${d_{\boldsymbol{X}}}=87$ and ${d_{\boldsymbol{Y}}}=1$ for $\mathrm{SIMP}$. Using ${d_{\boldsymbol{X}}}=87$ and ${d_{\boldsymbol{Y}}}=1$, $\mathrm{SIMP}$ is fitted with the number of iterations, burn-in proportion and specification of the hyperparameters same as those in Section 7 and Section 8.3. It costs 14.82 hours for a run of 20, 000 iterations under these envelope dimensions, for a Intel Xeon E5-2680 v3 processor (2.50 GHz CPU).
Aiming at interpreting our results with some scientifically meaningful findings and due to the large variation that is explained by the PCs that we choose, we intend to approximate the posterior samples of ${\boldsymbol{\beta }_{SNP}}\in {\mathbb{R}^{693\times 12}}$, the regression coefficients between the standardized version of the original 693 SNP predictors (i.e., the 693 SNP predictors before PCA) and 12 IPs. For $s=1,2,\dots ,10,000$, the s-th retained posterior sample ${\boldsymbol{\beta }_{1C}^{(s)}}\in {\mathbb{R}^{186\times 12}}$ is mapped to ${\boldsymbol{\beta }_{SNP}^{(s)}}$, the s-th approximate posterior sample of ${\boldsymbol{\beta }_{SNP}}$ by
where $\mathrm{diag}\{{\mathbf{S}_{SNP}}\}\in {\mathbb{S}_{+}^{693\times 693}}$ and $\mathrm{diag}\{{\mathbf{S}_{1C}}\}\in {\mathbb{S}_{+}^{186\times 186}}$ are diagonal matrices with diagonal elements being sample variances of the original 693 SNPs and 186 PCs respectively, and $\mathbf{LD}\in {\mathbb{R}^{693\times 186}}$ is the loading matrix from PCA.
(8.1)
\[ {\boldsymbol{\beta }_{SNP}^{(s)}}={\big(\mathrm{diag}\{{\mathbf{S}_{SNP}}\}\big)^{1/2}}\cdot \mathbf{LD}\cdot {\big(\mathrm{diag}\{{\mathbf{S}_{1C}}\}\big)^{-1/2}}\cdot {\boldsymbol{\beta }_{1C}^{(s)}},\]Table 8
Out of 37 significant SNPs from $\mathrm{SIMP}$ (besides APOE $\epsilon 4$), 17 SNPs are reported here for the previously reported associations of them (or their RefSeq gene or their closest RefSeq gene with names indicated in Columns 2 and 5) with AD in the literature. In the literature, one previous study on the reported association is selected for each of these SNPs and displayed in Columns 3 and 6.
SNP | Gene | SNP | Gene | ||
rs11685593 | BIN1 | [2] | rs10894473 | NTM | [66] |
rs7561528 | BIN1 | [24] | rs11064498 | C1S | [61] |
rs11706690 | CHL1 | [50] | rs757402 | OAS2 | [3] |
rs10513391 | P2RY14 | [53] | rs2274736 | PTPN21 | [62] |
rs2439538 | TBC1D7 | [17] | rs10498633 | SLC24A4/RIN3 | [60] |
rs9381563 | CD2AP | [30] | rs2554389 | ADAMTSL3 | [55] |
rs2280231 | NDUFS3 | [49] | rs4265771 | ADAMTSL3 | [55] |
rs7120548 | MTCH2 | [31] | rs17809911 | CCDC102B | [41] |
rs10501927 | CNTN5 | [24] |
The corresponding $(1-5.95\times {10^{-6}})\times 100\% $ two-sided credible intervals for ${\boldsymbol{\beta }_{SNP}}$, ${\boldsymbol{\beta }_{1D}}$ and ${\boldsymbol{\beta }_{2}}$ are calculated, where the confidence level is determined by 0.95 with the Bonferroni correction for all 8,400 regression coefficients. Significance of any SNP, APOE $\epsilon 4$ or prognostic factor with any one of 12 IPs is determined by the exclusion of zero for the associated credible interval.
The upper-left panel of Figure 3 shows the significance of a few SNP and APOE $\epsilon 4$ with all 12 IPs. Besides the well-known AD gene APOE $\epsilon 4$, we identify another 37 significant SNPs that are possibly related to AD (see Appendix H.3 for the full list) under the control of the Family-wise error rate (FWER, the probability of reporting at least one false positives) at 0.05. Among them, 17 SNPs are reported in Table 8 for their appearance in the previous studies. Meanwhile, under the same control of FWER, the key role of the age for AD and the impact of ICV on the volumes of brain structures are confirmed at the same time, by the upper-right panel of Figure 3. The significance of SNPs seems to be consistent over all IPs but this is not observed for prognostic factors. This is expected since the rows of ${\widehat{\boldsymbol{\beta }}_{1C}^{T}}$ (and also ${\widehat{\boldsymbol{\beta }}_{SNP}^{T}}$, the posterior mean estimator of ${\boldsymbol{\beta }_{SNP}^{T}}$) depend on the estimated partial response envelope only, which happens to contribute much weaker effects than the estimated partial predictor envelope in this dataset. Such pattern is not observed in ${\widehat{\boldsymbol{\beta }}_{2}^{T}}$, since we have not imposed any envelope structure on ${\boldsymbol{\beta }_{2}}$.
The heatmap of the estimated regression coefficients matrix ${\widehat{\boldsymbol{\beta }}_{SNP}^{T}}$ is shown in the bottom panel of Figure 3. The elements of ${\widehat{\boldsymbol{\beta }}_{1D}}$ are much larger in scale than those in ${\widehat{\boldsymbol{\beta }}_{SNP}^{T}}$, hence are not displayed in this figure (The elements of ${\widehat{\boldsymbol{\beta }}_{1D}}$ are listed here instead. Amygdala.L: −0.19, Amygdala.R: −0.19, Cerebral cortex.L: −0.12, Cerebral cortex.R: −0.13, Cerebral white matter.L: −0.11, Cerebral white matter.R: −0.11, Hippocampus.L: −0.22, Hippocampus.R: −0.22, Inferior lateral ventricle.L: 0.18, Inferior lateral ventricle.R: 0.19, Lateral ventricle.L: 0.14, Lateral ventricle.R: 0.15). It is interesting that for each genetic predictor, the estimated coefficients related to each pair of brain measures on two hemispheres are close. This is reasonable since the bilateral correlations within all pairs of brain measures over two hemispheres are strong (see Appendix H.4 for the numerical evidence and [52] for another discussion on this bilateral correlation). The ROIs that have relatively large estimated effects with some SNPs are the Inferior lateral ventricle, the Hippocampus and the Amygdala (in both hemispheres). The Hippocampus and the Amygdala are related to memory and motor behavior respectively, and the importance of these three brain structures for AD have already been verified in the past studies [20, 47, 43].
8.5 Shrinkage Estimation by Incorporating the Prior Information of Weak Imaging Genetics Relationship
The past studies on AD have verified the weak effects of the SNP predictors on predicting AD outcomes, either the disease status [36] or the imaging phenotypes ([68, 46] and etc). This pattern of weak effects has also been verified by us from the bottom panel of Figure 3 in Section 8.4. Therefore, it is reasonable to incorporate the prior information of “weak signals in ${\boldsymbol{\beta }_{1C}}$” (and hence “weak signals in ${\boldsymbol{\beta }_{SNP}}$”) into our analysis, and it should also improve the estimation performance from the standpoint of the shrinkage effect offered for this high dimensional problem, and help to illustrate the advantage of our Bayesian envelope method on prior information incorporation, compared with the frequentist envelope methods. The envelope dimensions are fixed at ${d_{\boldsymbol{X}}}=87$ and ${d_{\boldsymbol{Y}}}=1$. We adjust the (all equal) diagonal elements of the hyperparameter E (we set to be a diagonal matrix) in the prior of ${\boldsymbol{\eta }_{C}}$ from ${10^{-6}}$ to ${10^{6}}$ while keeping the specification for other hyperparameters same as previous sections. This adjustment strengthens our prior belief of ${\boldsymbol{\eta }_{C}}$ (and hence ${\boldsymbol{\beta }_{1C}}$) to be a zero matrix gradually, since the prior mean of ${\boldsymbol{\eta }_{C}}$ is fixed at the zero matrix by assigning W to be the zero matrix, and the prior covariance matrices of rows of ${\boldsymbol{\eta }_{C}}$ are proportional to ${\mathbf{E}^{-1}}$. From the frequentist perspective, the Bayesian estimator (strictly speaking, the MAP estimator, but here we use the posterior mean estimator for simplicity) with increasingly stronger prior belief in the weak signals of ${\boldsymbol{\eta }_{C}}$ corresponds to an estimator with gradually stronger shrinkage to the zero matrix.
The left panel of Figure 4 shows improved prediction performance as the diagonal elements of E increase from ${10^{-6}}$ to ${10^{3}}$, and the MSPE deteriorates slightly as the prior belief increases further beyond ${10^{3}}$, possibly due to the bias that is caused by the over-strong prior information of weak signals, or the over-penalization from the frequentist perspective. The middle and right panels of Figure 4 display the regularization paths for the row sums and the column sums of ${\widehat{\boldsymbol{\beta }}_{SNP}}$ respectively. The more significant shrinkage effects could be observed for the stronger prior belief of the weak signals in ${\boldsymbol{\beta }_{1C}}$ (and hence in ${\boldsymbol{\beta }_{SNP}}$).
Figure 4
MSPE from the 5-fold CV (left), the row sums of ${\widehat{\boldsymbol{\beta }}_{SNP}}$ (middle; each line corresponds to one SNP) and the column sums of ${\widehat{\boldsymbol{\beta }}_{SNP}}$ (right; each line corresponds to one IP) with respect to the diagonal elements of E (in the ${\log _{10}}$ scale).
9 Discussion
In this paper, we propose a new unified Bayesian envelope model by integrating the partial predictor envelope and the partial response envelope under a convenient Bayesian framework. The proposed model degenerates to several well established envelope models in the literature under specific conditions. It addresses the limitations mentioned for the simultaneous and the partial response envelopes. Specifically, our method improves the efficiencies for estimating ${\boldsymbol{\beta }_{1}}$ and predicting $\boldsymbol{Y}$ in (3.2), and has no restrictions on ${\boldsymbol{X}_{1}}$ to be continuous or Normal, by a subtle construction of separating the discrete part from the whole predictors of interest. Compared with the frequentist envelope approaches, our method is more flexible in incorporating prior information and quantifying uncertainty through posterior distribution of parameters. Overall, our method is inclusive and could be regarded as a building block for the future theoretical research of the envelope model, and an ideal solution for practitioners who seek dimension reduction and want to apply envelope methods to their application problems, including almost any problems that could be formulated by the multivariate linear regression with predictors either continuous or discrete or a mix of them and even containing nuisance covariates, but are worried about which specific envelope model to use. Meanwhile, we are the first to investigate the performance of several popular dimension selection methods together, among the Bayesian envelope literature, to the best of our knowledge.
To be less conservative and make more meaningful discoveries, a well-designed multiple comparisons procedure to control the False discovery rate, rather than FWER, could be considered in our real data application. A natural extension of our work is to generalize $\mathrm{SIMP}$ to the generalized linear model setting. However, for the multivariate probit model as pointed in [9], the identification issue exists for the error covariance matrix, which is suggested to be restricted to the correlation matrix. Hence, the extension of $\mathrm{SIMP}$ to the generalized linear model setting might require some special strategies, for example the parameter expansion for data augmentation technique ([40], [59]) under a marginally uniform prior for the error correlation matrix [1].