Bayesian Simultaneous Partial Envelope Model with Application to an Imaging Genetics Analysis

As a prominent dimension reduction method for multivariate linear regression, the envelope model has received increased attention over the past decade due to its modeling ﬂexibility and success in enhancing estimation and prediction eﬃciencies. Several enveloping approaches have been proposed in the literature; among these, the partial response envelope model [57] that focuses on only enveloping the coeﬃcients for predictors of interest, and the simultaneous envelope model [14] that combines the predictor and the response envelope models within a uniﬁed modeling framework, are noteworthy. In this article we incorporate these two approaches within a Bayesian framework, and propose a novel Bayesian simultaneous partial envelope model that generalizes and addresses some limitations of the two approaches. Our method oﬀers the ﬂexibility of incorporating prior information if available, and aids coherent quantiﬁcation of all modeling uncertainty through the posterior distribution of model parameters. A block Metropolis-within-Gibbs algorithm for Markov chain Monte Carlo (MCMC) sampling from the posterior is developed. The utility of our model is corroborated by theoretical results, comprehensive simulations, and a real imaging genetics data application for the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study.


Consider the standard multivariate linear regression
where the responses Y ∈ R r and the centered predictors X ∈ R p are both multivariate, the means of responses μ Y ∈ R r , the regression coefficients β ∈ R p×r , and the random error vector Y |X ∼ N r (0, Σ Y |X ).It has applications in broad fields of scientific research, including economics [6], chemistry [35], agriculture [51], engineering [44], bioinformatics [34], etc.Specifically, we consider the application to the imaging genetics problem (IGP).Thanks to the rapid technological development in medical imaging and genome sequencing, the diagnosis, prevention and treatment of mental illness have been greatly improved.Compared with traditional medical imaging analysis and genome-wide association studies (GWAS) that make use of only imaging or only genome data and were studied to a large extent [67,63], another more promising yet more challenging study is to directly relate the imaging phenotypes (IPs) as responses to genotypes (GPs) as predictors of interest, while also considering the effects of some demographic covariates or other risk factors.This is called IGP and is less explored in the statistical community [42].Because of the Central Dogma that is often stated as DNA makes RNA and RNA makes protein, genetic markers should be promising in revealing the abnormal protein expression in cortical or sub-cortical structures of brain, whose volumes should be more informative quantitative traits than simple disease status.
Some early practices in studying IGP focus on univariate [54] or voxel-wise type of methods [25], which explores the relationship between every (voxel, SNP) pair (SNP: Singlenucleotide polymorphism), or between each individual voxel and all genetic markers together, respectively.These types of methods enjoy simplicity and can handle high dimensional imaging or genomic datasets, but cannot leverage spatial dependence among voxels.To this end, several multivariate regression models have been proposed to jointly relate tens of brain summary measures, like the volumes of Regions of Interest (ROIs), to hundreds of SNPs [65,68,23,64,46].Among them, [46] is the only work that employs the envelope model to IGP, to the best of our knowledge.
As a dimension reduction method developed in the past decade, the first envelope model [15] aims at providing an efficient estimator of the regression coefficients β in the multivariate linear regression (1.1), by assuming that there are linear combinations of the responses Y that are not related to the regression, and removing its variation as immaterial information (see Section 2.2 for more details).With a great success in achieving efficient estimation and modeling flexibility, the original envelope model has been adapted or extended to predictor envelope [12], groupwise envelope [46], partial response envelope [57], simultaneous envelope [14], sparsity learning [56], matrix-valued [18] or tensor-valued [39] responses or predictors, quantile regression [19], generalized linear models [11], etc.The envelope methods were recently reviewed by [37] and more details were introduced in [13].The aforementioned envelope methods are all developed from the frequentist perspective.However, the literature addressing Bayesian envelope methods is rare, but still interesting for its capabilities of incorporating prior information into inference, and implementing posterior uncertainty quantification, compared with the unnatural prior information incorporation and the bootstrap or asymptotic variance in the frequentist setting.[33] introduced the first framework for the Bayesian envelope model.However, it is hard to apply this framework to other envelope contexts, since it requires the specific formulation of the response envelope to construct the conjugacy.Furthermore, sampling from the generalized matrix Bingham distributions and the truncated inverse gamma distributions is required in this framework, which makes the MCMC algorithm computationally expensive.Later, [5] proposed a new flexible and handy Bayesian framework for the envelope model, which bypasses the manifold restriction on the basis matrix of the envelope space by an unconstrained matrix re-parameterization.Since all parameters under this new framework are either vectors, or unconstrained, or positive definite matrices, the computation is much more efficient without sampling from the generalized matrix Bingham distributions and the truncated inverse gamma distributions.
The rapid development in the envelope family requires a model to unify various perspectives, and a more clear choice to practitioners on which envelope model to apply, especially when we have a particular interest for a subset of predictors that could be either continuous or discrete or a mix of them.The existing simultaneous envelope model [14] integrates both the response and predictor envelopes within a unified modeling framework.It simultaneously enjoys the benefits from two envelope components, i.e., more efficient estimation of the regression coefficients offered by the response envelope, and improved prediction efficiency of the responses obtained by the predictor envelope, but it cannot focus on the predictors of main interest, and therefore cannot leverage the idea from the partial response envelope [57] to obtain further estimation efficiency.More importantly, without additional assumptions imposed, it is guaranteed to outperform the predictor or the response envelope only for the Normal predictors ([14], Propositions 2 and 5).However, the discrete predictors will not follow the Normal distribution and including them in the simultaneous envelope by implicitly regarding them as Normal will lead to biased estimates and invalid inference.The partial response envelope [57] focuses on enveloping only the coefficients of the predictors of main interest X 1 (where in (1.1), X is partitioned into (X T 1 , X T 2 ) T , and X 1 denotes the predictor of main interest to us and X 2 are nuisance covariates), for example the genetic markers in IGP.Compared with full response envelope [15] that indistinguishably envelopes the coefficients of all predictors X, enhanced estimation efficiency is possible by this model since the envelope space could be shrunk if only concentrating on the coefficients of X 1 , but it could not leverage the benefits of the predictor envelope [12], which is known to have the potential to increase prediction efficiency.The envelope model that is proposed in this paper could address these limitations while still providing a valid inference procedure.
To address the aforementioned limitations, and improve the efficiencies for the estimation of the regression coefficients of main interest (the coefficients of X 1 , which we denote as β 1 in (3.2)) and the prediction of the responses (Y in (3.2)), this paper proposes a new envelope model, called the simultaneous partial envelope model (3.7)- (3.10), in Section 3. The proposed envelope model combines the partial response envelope model [57] and the simultaneous envelope model [14] within a more general modeling framework, by further partitioning the predictors of main interest X 1 into X 1C and X 1D , the continuous part and the discrete part.Our proposed envelope model simultaneously imposes the partial response envelope structure on the coefficients of X 1 , and the partial predictor envelope only on the coefficients of X 1C , instead of the coefficients of whole X 1 .By this construction, our proposed model could address the aforementioned limitations for the simultaneous and the partial response envelopes.Firstly, two envelope components that are simultaneously imposed on our model are partial envelopes, which both focus on the predictors of interest only.Secondly, it avoids the Normal requirement mentioned above for the discrete predictors in the simultaneous envelope model while still providing a valid inference, by only imposing the partial response envelope (i.e.no partial predictor envelope) on the coefficients of X 1D .It is also worthy to note that X 1C is Normal with means depending on X 1D in our model.Lastly, to further improve the estimation and prediction efficiencies upon the partial response envelope, the simultaneous envelope structure is considered in our model for the coefficients of X 1C , which turns out to enhance not only the estimation of the coefficients of X 1C , but also that of the coefficients of X 1D in our simulation (see the comparisons on the estimation of the coefficents of X 1D between our method and the Bayesian partial response envelope in Table 3 and Figure 2), as a possible synergetic effect.
Our proposed model includes a number of important models in the envelope family as special cases, including the (partial) response envelope, the (partial) predictor envelope and the simultaneous envelope (see Table 1 for their relationship).In Section 4, we develop an efficient Bayesian inference procedure based on [5], which allows convenient prior information incorporation and variability quantification, compared with the frequentist envelope methods.
The contribution of this paper is fivefold.
1. We propose a new simultaneous partial envelope model (3.7)-(3.10)that unifies several existing envelope models as special cases.Our proposed model allows both predictors of interest X 1 and nuisance covariates X 2 to be either continuous or discrete or a mix of them.
Compared with any envelopes that our model degenerates to, our method provides a more efficient while still valid estimator for the regression coefficients of X 1 and improved prediction for Y in (3.2).These advantages address the limitations that are mentioned above for the simultaneous envelope model [14] and the partial response envelope model [57].The corresponding block Metropolis-within-Gibbs algorithm is developed for the posterior inference.2. We establish the theoretical properties including the posterior propriety for our model and the Harris ergodicity for our algorithm.3. We are the first to investigate the performance of several popular dimension selection methods together, among the Bayesian envelope literature, to the best of our knowledge.4. We apply our method to an imaging genetics analysis for the ADNI study.We show the improved prediction accuracy of our method compared with all other competitors, and obtain some scientifically meaningful findings from posterior selection.By incorporating the weak imaging genetics relationship as prior information, the prediction accuracy is further improved and the associated shrinkage effect is also investigated.5. We build the R package SIMP that implements our algorithm, which is available at https://github.com/yanbowisc/SIMP.git.
Some notations frequently used in this paper are summarized here.For any a = 1, 2, . .., let P S be a projection matrix onto a subspace S ⊆ R a and P S ⊥ = I a −P S be the projection matrix onto S ⊥ , where S ⊥ denotes the orthogonal complement of S and I a denotes the a-dimensional identity matrix.Let S a×a + denote the class of all a × a real valued positive definite matrices.Let 1 n denote a vector of all ones with length n.For a matrix A ∈ R a×a and a subspace E ⊆ R a , AE is defined as AE = {A : ∈ E}.Notations |A|, A , vec(A) and span(A) denote the determinant, the spectral norm, the vectorization by columns and the column space of the matrix A respectively.For a 1 , a 2 , . . ., a n ∈ R and a square matrix S ∈ R a×a , we describe a diagonal matrix by either specifying its diagonal elements by the notation diag(a 1 , . . ., a n ), or using the notation diag{S} to indicate that the diagonal elements of the square matrix S will be taken out to constitute this diagonal matrix.Moreover, S ii denotes the i-th diagonal element of the square matrix S. For a random vector Y , the distribution of Y | X is interpreted as the distribution of Y given the fixed value of X for a nonstochastic vector X, or the distribution of Y conditional on X for a random vector X.For a, b, c = 1, 2, . .., and random vectors X ∈ R a , Y ∈ R b , we use Cov((X, Y ) | Z) to denote the a × b dimensional covariance matrix of X and Y conditional on a random (or given a non-stochastic) vector Z ∈ R c .If multiple random (or non-stochastic) vectors are conditional on (or given), they are bracketed together after the vertical line, like Y | (•) and Cov((X, Y ) | (•)).Also, := and X D = Y indicate equating by definition and random vectors X and Y are equally distributed respectively.Lastly, for any x, y ∈ R, x y indicates that x is much less than y, while x y means that x is much greater than y.The rest of this paper is organized as follows.Section 2 reviews two existing envelope models, which are prototypes of two envelope components of our proposed model.Section 3 introduces the definition and the formulation of our proposed simultaneous partial envelope model.Section 4 develops a Bayesian inference procedure for our proposed model, with detailed procedure of MCMC algorithm left to Appendix A. Section 5 establishes the theoretical properties of our model and algorithm.Section 6 investigates the issue of model selection for our Bayesian envelope model.Sections 7 and 8 conduct comprehensive simulation studies and investigate a real data application respectively.Section 9 concludes our paper by further discussions.

Preliminaries
In this section, we introduce two definitions and one proposition, which are necessary for the formal introduction of the response envelope model [15] and the predictor envelope model [12].For convenience, the definition of reducing subspace [10] is firstly introduced.
If E is a reducing subspace of Σ, there is a crucial decomposition of Σ as illustrated in the following proposition.Next, for a positive definite matrix M, we introduce the general definition of the M-envelope of a subspace S. It will be used in defining the response and the predictor envelope models in Section 2.2 and Section 2. For convenience, we will frequently use the notation E M (S) to define envelope structures in the rest of this paper.Below we briefly review the response envelope model [15] in Section 2.2, and mention how this idea was adapted to define the predictor envelope model [12] in Section 2.3.

The Response Envelope Model
In this section, we review the definitions and the coordinate form of the response envelope model [15], which aims to improve the estimation efficiency of β in (1.1) by assuming some linear combinations of Y do not depend on X.
Here, we assume X in (1.1) to be non-stochastic.The efficiency gains come from removing those redundant variation in Y by estimating the associated linear combination coefficients.Formally, the response envelope model is defined by the smallest subspace E Y |X that satisfies Condition 1.
Condition 1 ([15], p. 928).Assume in (1.1), the subspace where η, relative to Γ and Γ 0 respectively.The response envelope model could provide significant efficiency gains when Δ Δ 0 , since there will be a large amount of immaterial information for removal under this scenario.

The Predictor Envelope Model
In this section, we review the predictor envelope model [12] by showing how the idea of the response envelope could be adapted to reduce the predictor space.The predictor envelope model has the potential to improve the prediction of Y in (1.1).
To define the predictor envelope, we assume X to be stochastic following N (0, Σ X ).Then (1.1) is called the predictor envelope model if the envelope structure E Σ X (span(β)) is imposed.Unlike the response envelope, E Σ X (span(β)) is the intersection of all reducing subspaces of Σ X that contain span(β).Similarly, the material and immaterial parts of X could be defined.The immaterial part of X is assumed not to carry the information for the regression.Suppose the orthonormal basis matrices of E Σ X (span(β)) and its orthogonal complement subspace are Υ and Υ 0 , then the coordinate form of the predictor envelope model is where ψ, Ξ = Υ T Σ X Υ and Ξ 0 = Υ T 0 Σ X Υ 0 carry the coordinates of β relative to Υ, Σ X relative to Υ and Υ 0 respectively.The large efficiency gains could be obtained when Ξ Ξ 0 .

PROPOSED SIMULTANEOUS PARTIAL ENVELOPE MODEL
As mentioned in Section 1, suppose that our predictors X ∈ R p could be partitioned into three parts where is the predictors of main interest to us in the multivariate linear regression, with its continuous and discrete parts being denotes the set of predictors that is not of main interest (p = p 1 + p 2 ).By separating X 1D with X 1C , we could avoid the Normality restrictions for X 1D and hence whole X 1 , while still providing a valid simultaneous envelope estimator for the coefficients of X 1C .Assuming predictors are not centered, the standard multivariate linear model can be written as where  We firstly impose some distributional assumptions on 2) should only be interpreted as the expectation of X 1C (rather than the expectation of X 1C ) conditional on model parameters under this assumption, due to the non-stochasticity of X 1D .However, we will keep using μ 1C to avoid abusing notations.
To provide the efficiency gains on estimating β 1 and predicting Y , we combine the advantages of the response envelope on the estimation and the predictor envelope on the prediction, by further imposing the following two partial envelope structures on (3.2) (by Definition 2) simultaneously: where L and R denote span(β 1C ) and span(β T 1 ), and d and d 1 denote rank(β 1C ) and rank(β 1 ) respectively.(3.2) is called the Simultaneous Partial Envelope Model (SIMP), if these distributional assumptions and envelope structures are imposed.Note that SIMP avoids the Normality requirement for the whole X 1 (i.e., no other assumptions on X 1D besides the non-stochasticity, and X 1C is Normal with means depending on X 1D , which relaxes the identical Normal distribution assumption of each observation of predictors in [14]) by imposing E Y |1 on the row space of whole β 1 while E C|D on the column space of only β 1C .
It is worthy to note that SIMP is a broad class of envelopes that degenerates to several popular envelope models under certain conditions as listed in Table 1.Below, we introduce Conditions 3 and 4 in Section 3.1, which could equivalently define SIMP and will offer more intuitions on where the efficiency gains come from by assuming these two envelope components.For the convenience of developing the Bayesian inference procedure in Section 4, we give the coordinate form of SIMP in (3.7)-(3.10) in Section 3.3.For the preparation of Section 3.3, Section 3.2 will introduce a re-parameterization trick, to avoid the direct Bayesian inference on matrix parameters living in the Stiefel manifold.[15] pC = p2 = 0 d Y p + r(r + 3)/2 predictor envelope [12] pD = p2 = 0 and d Y = r r(r + 2d X + 3)/2 + p(p + 3)/2 simultaneous envelope [14] pD = p2 = 0 d Y d X + r(r + 3)/2 + p(p + 3)/2 partial response envelope [57] pC = 0 d Y p1 + r(r + 2p2 + 3)/2 partial predictor envelope [45] p2 = 0 and d Y = r r (r + 2d X + 2pD + 3)/2 + pC (pC + 3)/2

Equivalent Conditions for SIMP
In this section, we introduce Conditions 3 and 4, which can equivalently define SIMP for (3.2).These two conditions will not be used in the formulation of SIMP in Section 3.3, but will offer some intuitions on the source of efficiency gains.The equivalence between the conditions in this section and the envelope structures that we have given for SIMP (by Definition 2) shares the same reason with the equivalence between Conditions 1 and 2 for the response envelope model in Section 2.2.
Condition 3 excludes the direct and indirect partial effects (via P S C|D X 1C ) of P S ⊥ C|D X 1C on Y .The intersection of all subspaces that satisfy Condition 3 gives the same partial predictor envelope space E C|D as defined previously by E Σ C|D (L).We call P E C|D X 1C and P E ⊥ C|D X 1C the material part and immaterial part of X 1C .Given X 1D and X 2 , the immaterial part of X 1C is assumed not to affect Y by Condition 3. Similarly, the partial response envelope space E Y |1 previously defined by E Σ Y |X (R) is also the intersection of all subspaces that satisfy Condition 4. We call Y the material part and immaterial part of Y .Given X 2 , the immaterial part of Y is assumed not to be affected by X 1 .

Re-Parameterization of the Basis Matrices
We only consider the case of 0 < d X < p C and 0 < d Y < r in this section.When d X ∈ {0, p C } or d Y ∈ {0, r}, we can simply take the orthonormal bases of envelope space E C|D or E Y |1 and the orthogonal complement subspace to be either null or the identity matrix.When , respectively.To avoid the difficulty of direct Bayesian inference on L, L 0 , R and R 0 , which all live in the Stiefel manifold, reparameterization of L, L 0 , R and R 0 is considered in the coordinate form of SIMP in Section 3.3.We illustrate the re-parameterization of L and L 0 here, and this idea can be similarly exploited to re-parameterize R and R 0 .Let L 1 be the matrix formed by the upper d X rows of L, and L 2 be the matrix that contains the remaining rows.Without loss of generality, we assume that L 1 is nonsingular.Otherwise, we could reorder the rows of L to make L 1 nonsingular.Then, . [56] shows that A depends on L only through span(L) and there is a one-toone correspondence between E C|D and A. [8] further shows that if is a basis of E ⊥ C|D .Therefore, after normalization, are a pair of orthonormal bases of E C|D and E ⊥ C|D respectively.The re-parameterization

Coordinate Form of SIMP
where relative to R and L T .Also, for some η D ∈ R d Y ×p D , which carries the coordinate of β T 1D relative to R. Then the coordinate form of SIMP is given by ) where T , as illustrated in (3.5) and (3.6).This implies that the rows and columns of β 1C depend on the partial predictor and partial response envelopes only, and the columns of β 1D depend on the partial response envelope only.Note that This pair of equations shows that X 2 affects both material and immaterial parts of Y , X 1D affects only the material part of Y , and only the material part of X 1C affects the material part of Y only.The estimation of β 1C is benefitted from removing the redundant variations of both R 0 (B) T Y and L 0 (A) T X 1C , whereas the efficiency gains for estimating β 1D only come from removing the redundant variation of R 0 (B) T Y .Inherited from the (partial) predictor and the (partial) response envelope models, the partial predictor envelope component of SIMP offers large efficiency gains when Ω Ω 0 , and significant advantages of the partial response envelope component require Φ Φ 0 .

BAYESIAN INFERENCE
In this section, we develop a Bayesian procedure for the statistical inference of SIMP, hence our method is called the Bayesian simultaneous partial envelope model.It is implemented through sampling from the posterior distribution of parameters.The determination of the posterior distribution requires both the likelihood function and prior distributions.

Suppose that we have n independent observations
where const denotes a constant which does not depend on parameters of the model, To perform Bayesian inference on Θ related to SIMP, we put the following prior distributions on Θ: where MN and IW denote Matrix normal and Inverse-Wishart distributions.See Appendix E for the detailed introduction of these two distributions.Here we implicitly assume . The prior information for the parameters could be easily incorporated through these hyperparameters.For example, if we know a priori that the partial predictor envelope E C|D is likely to be E prior C|D , then we could compute A prior from the basis of E prior C|D by (3.3), and set A 0 as A prior .Our confidence about the prior of A is encoded in K A and Σ A .Specifically, the prior covariance matrices for the i-th row and the j-th column of A are K A,ii Σ A and Σ A,jj K A respectively, where K A,ii and Σ A,jj are the i-th and j-th diagonal elements of K A and Σ A , and the small K A,ii 's or Σ A,jj 's can reflect our strong prior belief of the specific row(s) or column(s) of A prior .The Bayesian inference is made via the Metropolis-within-Gibbs algorithm to generate the sample from the posterior distribution of the parameters Θ.A number of posterior samples at early iterations are discarded as burn-in.Suppose the retained posterior samples are with sample size S.The algorithm is provided in Appendix A, and the warm start initial estimator that we propose for the MCMC algorithm is detailed in Appendix B.

THEORETICAL PROPERTIES
In this section, we establish two theoretical properties of our method.We first establish the propriety of the joint posterior density of all parameters in Theorem 1, although the priors of μ 1C and μ Y are improper.Proofs are left to Appendix D.

Theorem 1. The posterior density of
Theorem 2 establishes the Harris ergodicity of the Metropolis-within-Gibbs algorithm developed for the Bayesian inference of SIMP.This property ensures the asymptotic convergence of the Markov chain generated by our algorithm to the joint posterior distribution, irrespective of the starting points.This convergence may fail on a set of measure zero without the guarantee of this property.Despite measure zero, some choices for starting points in this pathological null set can arise naturally, as illustrated in [48] for example.Technical details including definitions for probabilistic terminologies and proofs are left to Appendix D.

SELECTION OF ENVELOPE DIMENSIONS
To effectively search the envelope dimensions, we first select the minimum possible dimension of two envelope components, which is the rank of β 1C , and narrow down the range of dimensions to be searched to save the computational cost.

Selecting d = rank(β 1C )
To determine d, we adapt the Bura-Cook estimator [4] to our model setting.This estimator includes a sequence of Chi-squared tests.For where S 1C is the sample covariance matrix of X 1C .The residuals of the regression from Y on X 1D and X 2 are regressed on X 1C again, and the estimated regression coefficients and residual sample covariance matrix are and S R Y |1D,2 |1C respectively.Hereinafter, if we don't mention specifically for the method of a regression, it always refers to the frequentist ordinary least squares.[4] shows distribution, where α is the pre-specified significance level, and default to be 0.05 in our numerical studies.We choose d as the first non-significant value of k or d = min(p C , r) otherwise.The Bura-Cook estimator exhibits excellent ability in selecting d, as illustrated in the third column of Table 2. Details of the simulation set-up and numerical performance are introduced in Section 7.1.

Selecting Envelope Dimensions
Dimensions d X and d Y should be specified before fitting SIMP.In Section 7.1, we investigate the performance of Bayesian cross-validation (CV) and four information criteria, Akaike information criterion (AIC-MCMC), Bayesian information criterion (BIC-MCMC), Deviance information criterion (DIC) and Watanabe-Akaike information criterion (WAIC) in selecting these two envelope dimensions for SIMP.They are selected for their popularity in the Bayesian literature.Dimension selection is critical in envelope modeling.To the best of our knowledge, we are the first to investigate the performance of these interesting methods together for the dimension selection of the Bayesian envelope model, while previous Bayesian envelope literature simply chooses one of them directly [5,38].For each fixed d X and d Y , where l( Θ max d X ,d Y ) is the largest log-likelihood value attained by MCMC samples after burn-in, with envelope dimensions fixed at d X and d Y for associated parameters, and is the effective number of free parameters in SIMP.Note that the definitions of AIC-MCMC and BIC-MCMC here are slightly different from the ones that readers might be used to in the frequentist setting, where we replace the maximum likelihood estimates of the parameters by the empirical maximizers in the retained MCMC samples.Definitions of DIC, WAIC and Bayesian CV involve more notations, and hence are left to Appendix F. We choose d X and d Y as the minimizers of either the minus of the average out-of-sample estimate of the log predictive density for the Bayesian CV (see details in Appendix F), or one of these four information criteria, among each pair of d X = d, . . ., p C and d Y = d, . . ., r by the grid search, where d is the estimate of d from the Bura-Cook estimator introduced in Section 6.1.
For our method, we specify the hyperparameters Z, F, W, T, A 0 and B 0 to be zero matrices, M, Λ, E and Q to be 10 −6 times the identity matrices, K A , K B , Σ A and Σ B to be 10 6 times the identity matrices, w X , w X0 , w Y and w Y0 to be the row numbers of Ω, Ω 0 , Φ and Φ 0 respectively, and Ψ X , Ψ X0 , Ψ Y and Ψ Y0 to be identity matrices multiplied by 10 −6 and their respective degrees of freedom (correspondingly, w X , w X0 , w Y and w Y0 ).The MCMC algorithm for our method is run for 20,000 iterations with the first 50% iterations as burn-in.For each case, their performance is averaged over 500 repetitions.Percentages that the Bura-Cook estimator correctly identifies d and each of five methods that correctly identifies true d X or d Y individually or true (d X , d Y ) jointly are listed in Table 2.
Results in Table 2 reveal the advantage of BIC-MCMC in selecting d X and d Y for SIMP, among various settings of true d X and d Y and sample sizes, except when d X = 6, d Y = 2 and the sample size is small.It suggests that the excellent performance of BIC-MCMC may need at least moderate sample sizes.AIC-MCMC and WAIC also perform relatively well in the large d X and large d Y setting.These observations for BIC-MCMC that needs large sample size to respond, and the better performance of AIC-MCMC under large envelope dimensions are consistent with the previous findings for envelope models ( [58]).Compared with underestimation, a little over-estimating of these two envelope dimensions is acceptable, since only some estimation efficiency, rather than the material information, is lost [5].The  right panel of Figure 1 illustrates this phenomenon, since it is observed that the estimation biases (i.e., the square root of the vertical gaps between the green and the red lines, due to the Bias-variance decomposition) are large for underestimated envelope dimensions, whereas with correct or overestimated envelope dimensions, the biases are nearly zero and the estimated variances only increase linearly.Although all of AIC-MCMC, DIC and WAIC show some tendency in over-estimating d X and d Y in Table 2, the most significant one is DIC (see the left panel of Figure 1).Hence, AIC-MCMC and WAIC may also be considered in practice, especially for small sample sizes or the large d X and large d Y setting.The Bayesian CV chooses envelope dimensions based on the prediction performance.Although its performance in selecting the true envelope dimensions is far from the best in our simulation, it is still recommended in practice if the prediction performance is our utmost concern and the computational burden is manageable.

Setting
In this section, we compare the estimation performance of SIMP with the Bayesian partial predictor envelope (PXenv), the Bayesian partial response envelope (PY-env) and several other classical envelope methods or multivariate regression approaches, including the frequentist predictor envelope (FX-env) [12], the frequentist response envelope (FY-env) [15], the frequentist simultaneous envelope (FSenv) [14], the frequentist partial response envelope (FPYenv) [57], the frequentist ordinary least squares (FOLS), the principal component regression (PCR) [32], the partial least squares regression (PLSR) [16], the canonical correlation analysis (CCA) [26] and the reduced rank regression (RRR) [29].Note in Sections 7.2 and 8, PX-env and PYenv simply refer to SIMP with d Y and d X fixed at r and p C respectively, rather than the Bayesian versions of the partial predictor and partial response envelopes as exactly indicated in Table 1, however their differences are small (see the simulation results between FPY-env and PY-env in Sections 7.2.2-7.2.5) but will facilitate the comparison.We choose the posterior mean as the point estimator here and later in Section 8 for all the Bayesian (envelope) approaches.It is a decision-theoretic estimator and could be obtained easily once the posterior samples are available.In comparison, the posterior median and the maximum a posteriori (MAP) estimators are not considered in our paper, since they are expected to perform similarly (see Table C.4 in Appendix C for a numerical evidence of the posterior median estimator) due to the bell-shaped posterior distribution that is observed in both simulation studies (see Appendix G) and the real data application (see Appendix H.2), and the MAP estimator requires additional computation for the optimization.
In this section, we keep almost all set-ups from Section 7.1 except the following.We test on nine cases, for each combination of r = 3, 8 or 15 with n = 300, 500 or 1000.The covariates of X 1D are generated independently from discrete Unif{0, 1}.Dimensions d X and d Y are both fixed at 2. The specification of the hyperparameters (if still existed), the number of iterations and the burn-in proportion of the MCMC algorithm for PX-env, PY-env and SIMP in this section are the same as those for SIMP in Section 7.1.
To make comparison fair, for competitors that cannot account for the partial structures (i.e., all methods excluding SIMP, PX-env, PY-env and FPY-env), Y is regressed on X 2 first, and the fitted residuals are the actual responses Y for the regression on our interested predictors X 1 .This is a common way to adjust the effects of prognostic factors that are not of main interest for IGP [68].Tuning parameters or envelope dimensions for all methods are assumed to be unknown.The envelope dimensions for all Bayesian envelope methods (SIMP, PX-env, PY-env) are chosen by BIC-MCMC, and those for the frequentist envelope methods (FPY-env, FX-env, FY-env and FS-env) are chosen by the traditional BIC.The numbers of components for PCR and PLSR are selected based on minimizing the Mean squared prediction error (MSPE) from 5-fold CV.For k = 1, 2, . . ., 5, let C k represent the k-th index set among five partition sets of n samples.For = 1, 2, . . ., p 1 , we determine the number of components as the that minimizes For the sample i in C k , yij represents the j-th response adjusted by X 2 , while y−k ij ( ) is the prediction of yij by X 1C,i and X 1D,i , for either PCR or PLSR trained on all samples except those in C k , and with number of components as .The rank for RRR is determined by the original Bura-Cook estimator [4].As the estimation performance of μ 1C , μ Y and β 2 is very close among the competitors that we choose, and not of main interest to us, their results are not reported for compactness.For the true data generating mechanisms, we consider three models (M1-M3) from SIMP, and one model (M4) from RRR which is to investigate the model mis-specification.Under (M1), the mean squared errors (MSE) for β 1C and β 1D , two parameters that we are most interested in, are reported in Table 3.For almost every combination of r and n, SIMP always performs the best for the estimation of both β 1C and β 1D .PX-env or PY-env, two special cases of SIMP by setting d Y = r or d X = p C , can achieve competitive performance for either only β 1C , or whole β 1 but is still less efficient than SIMP.Comparing these two models, we find the former obtains a slightly better performance for β 1C while the latter achieves a much better performance for β 1D .This observation is consistent with our assumptions for SIMP in (3.7),where we have assumed both the partial predictor and partial response envelope structures for β 1C with a slightly stronger signal from the former component here ( Ω / Ω 0 > Φ 0 / Φ ), and only the partial response envelope structure is imposed on β 1D .Although PY-env and SIMP shares the same partial response envelope structure for β 1D , it is surprising that the latter outperforms the former for the estimation of β 1D when r = 3 and 8, which is possibly due to the reason that the more efficient estimator of β 1C improves the estimation of R(B), and hence the estimation of β 1D in SIMP.Meanwhile, it is noteworthy that PY-env and FPY-env show similar estimation performance for both β 1C and β 1D under all cases, but the former one allows more convenient posterior variability quantification and utilization of the prior information.FX-env, FY-env, FS-env and other five non-envelope methods cannot outperform their partial envelope counterparts and any partial envelope methods respectively.
To further illustrate the advantage of SIMP from the perspective of posterior variability, the average of the posterior standard deviations (PSD) of each coordinate of vec(β 1C ) and vec(β 1D ) is calculated, as the average of empirical standard deviations of posterior samples of vec(β 1C ) and vec(β 1D ) across 500 repetitions.The performance on average PSD is compared between SIMP, Bayesian linear regression (BLR) by implementing SIMP with d X = p C and d Y = r, PX-env and PY-env under (M1).The envelope dimensions are assumed to be unknown and selected by BIC-MCMC for all three envelope methods.The results are reported in Figure 2. In terms of both β 1C and β 1D , three envelope methods can always significantly outperform BLR under all cases.SIMP and PX-env obtain the lowest average PSD for β 1C , and are better than PY-env especially when r = 3.This is as expected since we have imposed both partial predictor and partial response envelope structures on β 1C but the signal of the former is stronger.As r becomes 8 or 15, the performance of PY-env is close to other  two envelope methods.In terms of β 1D , SIMP achieves the lowest PSD under all scenarios.When r = 8 or 15, the performance of PY-env is close to SIMP, and is significantly better than PX-env.This is as expected as well, since the partial response envelope structure is imposed on β 1D for both SIMP and PY-env, and the slight advantage of SIMP for the estimation of β 1D is possibly due to the more efficient estimation of β 1C , which is one synergetic effect of imposing two envelope components simultaneously.Although no envelope assumption is imposed on β 1D for PX-env, the posterior variability for the estimator of β 1D from PX-env is still smaller than that of BLR.

Results on (M2):
Ω Ω 0 and Φ Φ 0 Table 4 shows the performance under (M2).Since Ω Ω 0 , the partial predictor envelope component cannot provide too many efficiency gains.Therefore, as expected, the best estimation performance comes from envelope models that contain the (partial) response envelope structure, including FY-env, FS-env, FPY-env, PY-env and SIMP.Although the best performance is mostly achieved by FY-env and FPY-env, the difference between these five methods is small.Since the partial response envelope structure is imposed on both β 1C and β 1D in (3.7), large efficiency gains could be observed in estimating both of these two parame- The estimation performance under (M3) is shown in Table 5.Without surprise, the best performance is obtained by envelope models containing the partial predictor envelope structure, i.e.PX-env and SIMP.Similarly, their estimation performances for β 1C are close, and the tiny difference between them mainly comes from the model selection.It is worthy to notice that when r = 8 or 15, some additional gains on estimating β 1D could be offered by SIMP compared with PX-env, since the partial response envelope structure is assumed in SIMP for β 1D , although the signal may not be strong ( Φ Φ 0 ), however, the synergetic effect of SIMP that is mentioned above could possibly enhance this advantage.

Results on (M4): Model Mis-Specification
To test the robustness of our method, an additional simulation under the model mis-specification scenario (M4) is studied.From Table 6, RRR estimator almost always produces the lowest MSE when r = 3 or 8.Although under the mis-specified model, envelope methods still perform quite well.Within envelope methods, SIMP obtains a relatively better performance for both β 1C and β 1D when r is 8, and FX-env is better for β 1D when r is 3 or 15.Note that when r = 15, several envelope methods could even outperform the RRR estimator.When we are more cautious for the results displayed here, we will find that the performance of all envelope methods especially SIMP is largely contaminated by the "bad" envelope dimensions that are selected under this model mis-specification scenario, since the likelihoods of them might significantly depart from the truth and hence are misleading in model selection.See Table C.5 in Appendix C for the estimation results if we rule out this factor by using the "optimal" tuning parameter or envelope dimension for each method, if we pretend to know the true values of β 1C and β 1D .Under such scenario, the envelope methods almost always dominate other methods (even RRR), and SIMP is almost always the best one among envelope methods except when r = 3.And when r = 3, the performance of SIMP is very close to the best one.Therefore, the robustness of our method and other envelope methods to a general low rank model RRR is observed in our simulation.
Besides, in Appendix C, we also display additional numerical results for the estimation performance of FPY-env, PX-env, PY-env and SIMP with the true envelope dimensions known under (M1)-(M3) or RRR with the true rank known under (M4), and the results for p 2 = 4 or p D = 4 with other settings same as those in (M1).

Background
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).Led by Principal Investigator Michael W. Weiner, MD, ADNI was launched in 2003 as a public-private partnership, and lasts for 4 stages (ADNI1, ADNI-GO, ADNI2 and ADNI3) till now.The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), genetic or other biological markers, clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).This study has strict enrollment standards, follow-up and data checking protocols.All raw or preprocessed data is available through the Image and Data Archive (IDA) at https://ida.loni.usc.edu/,upon approval of the application.In this real data application, we want to study the genetic effects, including the effect of a well-known AD gene Apolipoprotein E (APOE) 4, on the volumes of some cortical or sub-cortical structures in brain, while also accounting for the effects of some other prognostic factors by applying SIMP, and compare with several other methods in terms of the prediction performance.Following previous practices on analyzing ADNI data [68,46], we focus on Caucasian participants in the ADNI1 phase, to reduce the population stratification.

Data Pre-processing Procedure
Imaging responses, i.e.Y in (3.2), we adopted are the volume measurements on 12 ROIs that are listed in [42] (i.e., the Amygdala, the Cerebral cortex, the Cerebral white matter, the Hippocampus, the Inferior lateral ventricle and the Lateral ventricle on two hemispheres), which are obtained from 1.5T MRI scan at the screening visit of ADNI1.All imaging data is downloaded from IDA, and has been pre-processed by FreeSurfer [21] already.Among 845 participants who took MRI scans, we only retain 593 participants who passed the quality control (QC).The log transformation is applied to reduce skewness.620,901 SNPs on all autosomes and sex chromosomes were genotyped for 757 participants in ADNI1, who are not contained in but are overlapped with the 845 participants who have MRI scans.Among 11,632 SNPs meta-analyzed by International Genomics of Alzheimer's Project (IGAP) [36], 852 autosome SNPs are selected as our candidate SNPs for model fitting by checking two conditions: (1) Genotyped by ADNI1; (2) P-values from IGAP are smaller than 0.01.Developed by Shaun Purcell and Christopher Chang, PLINK [7] is an open source software for performing QC and some routine analyses in GWAS in an efficient manner, and is available at https://www.cog-genomics.org/plink/1.9/.Following [68], we perform the following two lines of QC on the selected SNP data by PLINK.The first line of QC includes (1) call rate check per subject and per SNP marker, (2) gender check, (3) sibling pair identification, (4) Hardy-Weinberg equilibrium test, and (5) population stratification check.The second line of QC removes SNPs with (1) more than 5% missing values, (2) minor allele frequency smaller than 10%, and (3) Hardy-Weinberg equilibrium p-value < 10 −6 .This leaves us with 732 SNPs that passed QC.A free genotype imputation and haplotype phasing program IMPUTE2 ( [27], [28]) is used to impute SNPs with missingness.We choose HapMap 3 as our reference panel for the imputation, since it shares the same genome build b36 of National Center for Biotechnology Information (NCBI) with ADNI1.We delete 39 SNPs that are not shown in the reference panel, which leads to 693 SNPs finally.To reduce dimensionality, the principal component analysis (PCA) is applied on these 693 SNPs.By thresholding the eigenvalues of its sample covariance matrix by 1, the first 186 principal components (PCs), which explains 87.19% of the total variation of these 693 SNP predictors, are selected as our X 1C , the continuous part of our predictors of main interest in (3.2).APOE 4, taking values in {0, 1, 2}, is included in X 1D as the only discrete (quantitative) genetic predictor in (3.2).We have incorporated six other important prognostic factors in X 2 in (3.2), including gender, marital status, handness, age, years of education and intracranial volume (ICV).The intersection of participants from all types of abovementioned datasets leads to 498 samples in our analysis finally.All predictors and responses are standardized before the model fitting.

Prediction Performance
We first compare the prediction performance of SIMP with other two partial envelope methods (PX-env, FPYenv), three envelope models that cannot account for the partial structures (FX-env, FY-env, FS-env; we call them non-partial envelope methods), and three popular multivariate linear regression methods that are not based on the envelope models (FOLS, RRR and PLSR; we call them non envelope methods).Similar with Section 7.2, for all methods except the partial envelope methods (PX-env, FPY-env and SIMP), Y is adjusted as the residuals from the regression on X 2 first, and the model fitting will be done by the genetic markers X 1 and the adjusted responses.For the best prediction performance, all associated envelope dimensions for envelope methods, or the number of components for PLSR, or the rank for RRR are selected by minimizing MSPE from 5-fold CV.The MCMC algorithm for SIMP and PX-env is run for 20,000 iterations with first 50% iterations as burnin, and their specifications for the hyperparameters are the same as those in Section 7.
Table 7 reveals that the best prediction performance is achieved by SIMP.All of the envelope methods in the last two columns significantly outperform FOLS and RRR.FS-env and SIMP achieve the best prediction performance within the non-partial envelope methods and partial envelope methods we considered respectively.This illustrates the advantages of combining the predictor and response envelopes in both non-partial and partial envelope contexts.Comparing partial envelope methods with their non-partial  counterparts, PX-env and SIMP both obtain lower MSPEs than FX-env and FS-env respectively.The finding that FPY-env cannot outperform FY-env seems surprising, but is consistent with the previous theoretical result that if the envelope spaces of FPY-env and FY-env are equal, then the asymptotic variance of the estimator of β 1 from FPY-env could not be smaller than that from FY-env (see the proposition 2 of [57]), and we have verified the estimated envelope spaces by FPY-env and FY-env are very close in this analysis.It is also interesting to note that FX-env shares similar MSPE with PLSR.Their connection has been discussed in [12].

Posterior Estimation and Selection Performance
Both AIC-MCMC and BIC-MCMC select d X = 87 and d Y = 1 for SIMP.Using d X = 87 and d Y = 1, SIMP is fitted with the number of iterations, burn-in proportion and specification of the hyperparameters same as those in Section 7 and Section 8.3.It costs 14.82 hours for a run of 20, 000 iterations under these envelope dimensions, for a Intel Xeon E5-2680 v3 processor (2.50 GHz CPU).
Aiming at interpreting our results with some scientifically meaningful findings and due to the large variation that is explained by the PCs that we choose, we intend to approximate the posterior samples of β SN P ∈ R 693×12 , the regression coefficients between the standardized version of the original 693 SNP predictors (i.e., the 693 SNP predictors before PCA) and 12 IPs.For s = 1, 2, . . ., 10, 000, the s-th retained posterior sample β SN P , the s-th approximate posterior sample of β SN P by β (s) The upper-left panel of Figure 3 shows the significance of a few SNP and APOE 4 with all 12 IPs.Besides the wellknown AD gene APOE 4, we identify another 37 significant SNPs that are possibly related to AD (see Appendix H.3 for the full list) under the control of the Family-wise error rate (FWER, the probability of reporting at least one positives) at 0.05.Among them, 17 SNPs are reported in Table 8 for their appearance in the previous studies.Meanwhile, under the same control of FWER, the key role of the age for AD and the impact of ICV on the volumes of brain structures are confirmed at the same time, by the upperright panel of Figure 3.The significance of SNPs seems to be consistent over all IPs but this is not observed for prognostic factors.This is expected since the rows of β T 1C (and also β T SN P , the posterior mean estimator of β T SN P ) depend on the estimated partial response envelope only, which happens to contribute much weaker effects than the estimated partial predictor envelope in this dataset.Such pattern is not observed in β T  2 , since we have not imposed any envelope structure on β 2 .
The heatmap of the estimated regression coefficients matrix β T SN P is shown in the bottom panel of Figure 3.The elements of β 1D are much larger in scale than those in β T SN P , hence are not displayed in this figure (The elements of β 1D are listed here instead.Amygdala.L: −0.19, Amygdala.R: −0.19, Cerebral cortex.L: −0.12, Cerebral cortex.R: −0.13, Cerebral white matter.L: −0.11, Cerebral white matter.R: −0.11, Hippocampus.L: −0.22,Hippocampus.R: −0.22,Inferior lateral ventricle.L: 0.18, Inferior lateral ventricle.R: 0.19, Lateral ventricle.L: 0.14, Lateral ventricle.R: 0.15).It is interesting that for each genetic predictor, the estimated coefficients related to each pair of brain measures on two hemispheres are close.This is reasonable since the bilateral correlations within all pairs of brain measures over two hemispheres are strong (see Appendix H.4 for the numerical evidence and [52] for another discussion on this bilateral correlation).The ROIs that have relatively large estimated effects with some SNPs are the Inferior lateral ventricle, the Hippocampus and the Amygdala (in both hemispheres).The Hippocampus and the Amygdala are related to memory and motor behavior respectively, and the importance of these three brain structures for AD have already been verified in the past studies [20,47,43].

Shrinkage Estimation by Incorporating the Prior Information of Weak Imaging Genetics Relationship
The past studies on AD have verified the weak effects of the SNP predictors on predicting AD outcomes, either the disease status [36] or the imaging phenotypes ( [68,46] and etc).This pattern of weak effects has also been verified by us from the bottom panel of Figure 3 in Section 8.4.Therefore, it is reasonable to incorporate the prior information of "weak signals in β 1C " (and hence "weak signals in β SN P ") into our analysis, and it should also improve the estimation performance from the standpoint of the shrinkage effect offered for this high dimensional problem, and help to illustrate the advantage of our Bayesian envelope method on prior information incorporation, compared with the frequentist envelope methods.The envelope dimensions are fixed at d X = 87 and d Y = 1.We adjust the (all equal) diagonal elements of the hyperparameter E (we set to be a diagonal matrix) in the prior of η C from 10 −6 to 10 6 while keeping the specification for other hyperparameters same as previous sections.This adjustment strengthens our prior belief of η C (and hence β 1C ) to be a zero matrix gradually, since the prior mean of η C is fixed at the zero matrix by assigning W to be the zero matrix, and the prior covariance matrices of rows of η C are proportional to E −1 .From the frequentist perspective, the Bayesian estimator (strictly speaking, the MAP estimator, but here we use the posterior mean estimator for simplicity) with increasingly stronger prior belief in the weak signals of η C corresponds to an estimator with gradually stronger shrinkage to the zero matrix.
The left panel of Figure 4 shows improved prediction performance as the diagonal elements of E increase from 10 −6 to 10 3 , and the MSPE deteriorates slightly as the prior belief increases further beyond 10 3 , possibly due to the bias that is caused by the over-strong prior information of weak signals, or the over-penalization from the frequentist perspective.The middle and right panels of Figure 4 display the regularization paths for the row sums and the column Figure 4: MSPE from the 5-fold CV (left), the row sums of β SN P (middle; each line corresponds to one SNP) and the column sums of β SN P (right; each line corresponds to one IP) with respect to the diagonal elements of E (in the log 10 scale).
sums of β SN P respectively.The more significant shrinkage effects could be observed for the stronger prior belief of the weak signals in β 1C (and hence in β SN P ).

DISCUSSION
In this paper, we propose a new unified Bayesian envelope model by integrating the partial predictor envelope and the partial response envelope under a convenient Bayesian framework.The proposed model degenerates to several well established envelope models in the literature under specific conditions.It addresses the limitations mentioned for the simultaneous and the partial response envelopes.Specifically, our method improves the efficiencies for estimating β 1 and predicting Y in (3.2), and has no restrictions on X 1 to be continuous or Normal, by a subtle construction of separating the discrete part from the whole predictors of interest.Compared with the frequentist envelope approaches, our method is more flexible in incorporating prior information and quantifying uncertainty through posterior distribution of parameters.Overall, our method is inclusive and could be regarded as a building block for the future theoretical research of the envelope model, and an ideal solution for practitioners who seek dimension reduction and want to apply envelope methods to their application problems, including almost any problems that could be formulated by the multivariate linear regression with predictors either continuous or discrete or a mix of them and even containing nuisance covariates, but are worried about which specific envelope model to use.Meanwhile, we are the first to investigate the performance of several popular dimension selection methods together, among the Bayesian envelope literature, to the best of our knowledge.
To be less conservative and make more meaningful discoveries, a well-designed multiple comparisons procedure to control the False discovery rate, rather than FWER, could be considered in our real data application.A natural extension of our work is to generalize SIMP to the generalized linear model setting.However, for the multivariate probit model as pointed in [9], the identification issue exists for the error covariance matrix, which is suggested to be restricted to the correlation matrix.Hence, the extension of SIMP to the generalized linear model setting might require some special strategies, for example the parameter expansion for data augmentation technique ( [40], [59]) under a marginally uniform prior for the error correlation matrix [1].

APPENDIX A A.1 Metropolis-within-Gibbs MCMC Algorithm for SIMP
In this section, we show how to generate a MCMC chain of length S on Θ from SIMP, for a pre-specified pair of dimensions (d X , d Y ) ∈ {1, . . ., p C − 1} × {1, . . ., r − 1}.We could arbitrarily choose the initial value Θ (0) , but a warm start initial estimator provided in Appendix B is recommended for faster convergence.For each s = 1, . . ., S, we iterate the following steps to update Θ.

Update
, where , w Y 0 ), where and w Y0 = n + p 2 + w Y0 .11.Let h (s) (A) be the log full conditional density of A at the s-th iteration, i.e., We want to generate a Markov chain realization for A from the stationary density proportional to exp(h (s) (A)).We update A (s−1) to A (s) columnwisely via Metropolis steps as follows.Let A Let τ A > 0 be a given tuning parameter and let {i 1 , . . ., i d X } be a random permutation of {1, . . ., d X }.
where A * is the resulting matrix after A 1) .(c) Generate a binary indicator with success probability r.If a success is achieved, update A (s−1) k to A * k as the k-th column of A (s) .Otherwise, retain A (s−1) k as the k-th column of A (s) .Update A (s−1) to A (s) after all columns are investigated.12. Let h (s) (B) be the log full conditional density of B at the s-th iteration, i.e., The procedure to generate a Markov chain realization for B from the stationary density proportional to exp(h (s) (B)) is similar as that for A in Step 11. does not exist) always and any terms involving either R (s) or Φ (s) In our actual implementation of this Metropolis-within-Gibbs algorithm, once any parameter is updated, it will be immediately used for the updating of other parameters in the same iteration.Even when we are updating A and B, the previously updated columns will be used immediately to update the rest of columns.Remark 3. The updating of each column of A and B in steps 11 and 12 is implemented in a random order as illustrated.Instead, it could be in a deterministic order as well.Meanwhile, the order of updating different parameter blocks in an iteration could also be random without affecting the convergence.

APPENDIX B B.1 Initial Estimator for MCMC Algorithm of SIMP
In this section, we propose a warm start initial estimator for all parameters in SIMP, including μ 1C , μ Y , β 2 , γ, η C , η D , A, B, Ω, Ω 0 , Φ and Φ 0 if all existed (i.e.0 < d X < p C , 0 < d Y < r and r, p C , p D , p 2 > 0).It will be used as the initial value Θ (0) for our implementation of the MCMC algorithm for SIMP, for faster convergence.Given the data matrices X 1C , X 1D , X 2 and Y, and any valid envelope dimensions d X , d Y as input, we implement the following steps.
Y ) T .3. Perform the frequentist ordinary least squares for X 1C on X 1D , and get γ (0) and Σ (0) C|D as the estimate for the regression coefficients matrix and the residual covariance matrix, respectively.Regress Y on X 1D and X 2 , Regress X 1C on X 1D and X 2 .
Estimates L(A (0) ), L 0 (A (0) ) could be immediately obtained from this regression.Then, Ω (0) and Ω (0) 0 could be calculated as C|D L 0 (A (0) ), and A (0) could be computed from L(A (0) ) by the idea in Section 3.  −3, 5) or p D = 4 respectively.We call these two data generating mechanisms as (M5) and (M6).As Table C.1 reveals, MSE for both β 1C and β 1D from our method and our competitors remain roughly the same or become little inflated as p 2 increases from 2 to 4, except few cases.Table C.2 suggests the MSE for β 1D grow linearly with respect to p D , and the MSE for β 1C remain almost the same for our method and most of our competitors, except few cases.Therefore, SIMP is observed to show more advanta- geous numerical performance with increased p D in our limited simulation studies.Table C.3 summarizes the performance of FPY-env, PXenv, PY-env and SIMP with the true envelope dimensions known under (M1)-(M3), (M5) and (M6), or the performance of RRR with the true rank known under (M4).With the true envelope dimensions known, SIMP almost always obtains the best estimation performance under all scenarios except the model mis-specification case (M4).
Table C. 4 shows the performance of all three Bayesian envelope estimators (PX-env, PY-env and SIMP) of β 1C and β 1D under the data generating mechanism (M1), except now the posterior median rather than the posterior mean is chosen as the point estimator.Comparing the results in this table and Table 3, we empirically find no observable difference in performance between the posterior mean and the posterior median as the point estimator under (M1).

APPENDIX D D.1 Theoretical Proofs
Proof of Theorem 1.The integrability of the posterior density could be immediately obtained following the proof of Theorem 2.
Proof of Theorem 2. To prove the Harris ergodicity of the Markov chain generated by our Metropolis-within-Gibbs algorithm, we need to prove 3 properties, namely, (a) φirreducibility with respect to some measure φ, (b) Aperiodicity, and (c) Harris recurrence.Firstly, we prove φirreducibility which requires that the Markov chain could reach any set with positive measure from any state in the state space.The rigorous definition is given in Definition 3.
Definition 3 (φ-irreducibility).For a Markov chain (X n ) ∞ n=1 , suppose that the state space is X with σ-algebra F, and the n-step transition kernel is κ n : X × F → [0, 1].Given a nonzero measure φ, the Markov chain is called φ-irreducible, if for every state x ∈ X and every subset A ∈ F with φ(A) > 0, there exists an integer n such that κ n (x, A) > 0.
To prove the φ-irreducibility of the Markov chain generated by the Metropolis-within-Gibbs algorithm that we developed, we just need to check that the proposal densities and the acceptance probabilities for updating all parameter blocks are positive, then the φ-irreducibility is proved under 1-step transition (take n = 1).For 1-dimensional MCMC, it is obvious that, as long as the proposal density q(x, y) > 0 and the acceptance probability a(x, y) > 0 always.To extend this conclusion to the multi-dimensional case, the technique of induction could be applied.See details in Section 4.1.8 of [22].Among 12 parameter blocks {μ 1C , μ Y , β 2 , γ, Ω, Ω 0 , Φ, Φ 0 , A, B, η C , η D }, if all existed, the updating of A and B requires the Metropolis steps, and that for the rest of parameter blocks only needs normal Gibbs steps by the conjugacy.Since the exponential function and the vector Normal density function are always positive, we can conclude that the acceptance probabilities and the proposal densities of the Metropolis steps, if present, are always positive.For the Gibbs steps, the proposal distributions are Matrix normal and Inverse-Wishart distributions, both with densities always positive.The acceptance probabilities are always 1, which is positive as well.This completes the proof for the φ-irreducibility.
Next, we intend to show the Aperiodicity, whose definition for a φ-irreducible Markov chain is given in Definition 4. Definition 4 (Aperiodicity).For a φ-irreducible Markov chain (X n ) ∞ n=1 , suppose that the state space is X with σalgebra F, the (1-step) transition probability is κ 1 : X ×F → [0, 1], and the stationary probability distribution is π(•).The period of (X n ) ∞ n=1 is defined as the largest positive integer T for which there exist disjoint subsets Aperiodicity could be immediately obtained, since every set with nonzero probability (under the stationary probability distribution π) in the σ-algebra of the state space could be reached from any point in the state space through one step transition in the Markov chain.
Lastly, we intend to prove the Harris recurrence.The definition of the Harris recurrence for a φ-irreducible Markov chain is given in Definition 5.

Definition 5 (Harris recurrence). For a φ-irreducible
Markov chain (X n ) ∞ n=1 , suppose that the state space is X with σ-algebra F, and the stationary probability distribution is π(•).The Markov chain is called Harris recurrent, if for all A ⊆ X with π(A) > 0 and any starting point x ∈ X , we have the conditional probability P When d X ∈ {0, p C } and d Y ∈ {0, r}, A and B don't exist, then the Metropolis-within-Gibbs algorithm is actually a normal Gibbs sampler.Then Harris recurrence is directly implied by φ-irreducibility ([48], Corollary 13).When either 0 < d X < p C or 0 < d Y < r, there exists Metropolis step for either A or B. The Harris recurrence could be obtained by proving the Lebesgue integrability of the posterior density over every combination of parameter blocks ( [48], Corollary 19).For simplicity, we assume 0 < d X < p C and 0 < d Y < r.Hence we only need to prove the posterior density Specifically, in the following term, what inside trace function is a product of two positive semi-definite matrices, so where We can conclude and where c 1 = 1 + ( 2π n ) r/2 and for Ξ = 0, 1, Next, we just need to show f Then it is enough to show f Therefore, the task right now is to prove f Then we intend to show f 3 and our proposed model in Section 3. Definition 2 ([15], Definition 2.1).Let a matrix M ∈ S a×a + and a subspace S ⊂ span(M).The M-envelope of S is the intersection of all reducing subspaces of M that contain S, and is denoted by E M (S).

−1/ 2 by
an unconstrained matrix B ∈ R (r−d Y )×d Y can be similarly constructed.
, and correspondingly rewrite (3.2), for the preparation of the Bayesian inference in Section 4. Let β 1C = UDV T be the singular value decomposition, where U ∈ R p C ×d and V ∈ R r×d are semi-orthogonal matrices, and D = diag(λ 1 , . . ., λ d ), with λ i ≥ 0, i = 1, . . ., d, being d singular values of β 1C where d = rank(β 1C ).Since L = span(β 1C ) = span(U) and R = span(β T 1 ) = span(β T 1C , β T 1D ) = span(V, β T 1D ), according to the definitions of E C|D and E Y |1 by Definition 2 at the start of Section 3, U = LO and V = RP for some O ∈ R d X ×d and P ∈ R d Y ×d .Hence the coordinate form of β 1C is and X 2 , respectively.Then, for fixed envelope dimensions d X and d Y , the log-likelihood function for SIMP (3.7)-(3.10) is given by

Theorem 2 .
Whenever 0 ≤ d X ≤ p C and 0 ≤ d Y ≤ r, the Markov chain generated by the Metropolis-within-Gibbs algorithm for the posterior sampling of Θ in SIMP, as detailed in Appendix A, is Harris ergodic, i.e.(a) φ-irreducible with respect to some measure φ, (b) Aperiodic, and (c) Harris recurrent.

Figure 1 :
Figure 1: Left: Selection frequencies of d X and d Y (Rows 1 and 2) of five criteria, for different settings of true (d X , d Y ) (Columns 1-4).Colors represent the selection frequency; Right: MSE and estimated variances (Var) of β 1 from SIMP across all possible choices of d Y (Row 1) or d X (Row 2) with d X or d Y fixed at 2 (the true value) respectively.The horizontal blue dotted line in the figure that is at the bottom of the right panel shows the MSE of β 1 from the frequentist partial response envelope with the envelope dimension as 2.Here the results of two panels are based on the sample size as n = 300 over 500 repetitions.Results with n = 150 and n = 500 are similar.

Figure 3 :
Figure 3: Upper left: Indicator of significance of each (SNP, IP) pair or APOE 4 with each IP.APOE 4 corresponds to the last vertical line in red; Upper right: Indicator of significance of each (prognostic factor, IP) pair.Red signifies significance in the figures of the upper row.Bottom: Heatmap of the estimated regression coefficients matrix β T SN P between 693 SNPs and 12 IPs.

. 1 )
where diag{S SN P } ∈ S 693×693 + and diag{S 1C } ∈ S 186×186 + are diagonal matrices with diagonal elements being sample variances of the original 693 SNPs and 186 PCs respectively, and LD ∈ R 693×186 is the loading matrix from PCA.The corresponding (1 − 5.95 × 10 −6 ) × 100% two-sided credible intervals for β SN P , β 1D and β 2 are calculated, where the confidence level is determined by 0.95 with the Bonferroni correction for all 8,400 regression coefficients.Significance of any SNP, APOE 4 or prognostic factor with any one of 12 IPs is determined by the exclusion of zero for the associated credible interval.

Remark 1 .
Note that when d X = 0 (or, d X = p C ), steps 5, 7 and 11 (or, steps 8 and 11) should be skipped, L (s) does not exist, L (s) 0 = I p C (or, L (s) = I p C and L (s) 0 does not exist) always and any terms involving either L (s) or Ω (s) or η A (s) are omitted.Similarly, when d Y = 0 (or, d Y = r), steps 5, 6, 9 and 12 (or, steps 10 and 12) should be skipped, R (s) does not exist, R (s) 0 = I r (or, R (s) = I r and R (s) 0

4 . 2 ,
Perform the frequentist partial response envelope model with dimension d Y (by the function penv() in the R library Renvlp) on regressing Y on X 1C , X 1D and X 2 ,where we treat X 1C and X 1D together as our predictors of interest in this partial response envelope model.From this fitting, β R(B (0) ), R 0 (B (0) ), Φ(0) and Φ (0) 0 could be directly obtained.Compute B (0) from R(B(0) ) by the trick of re-parameterization, as illustrated in Section 3.2.Estimate η D by η B (0) )) T . 5. Perform the frequentist predictor envelope model with dimension d X (by the function xenv() in the R library Renvlp) on the regression from Y * on X * 1C , where Y * and X * 1C are the residual matrices from the following two frequentist ordinary least squares regressions respectively: 2 similarly.Estimate η (0) C = R(B (0) ) T (β (0) 1C ) T L(A (0) ).Note when d X = 0 (or, d X = p C ), Ω, A and η C (or, Ω 0 and A) don't exist.When d Y = 0 (or, d Y = r), Φ, B, η C and η D (or, Φ 0 and B) don't exist.Therefore, it is not necessary to initialize them in the corresponding cases.Tables C.1 and C.2 show the parameter estimation performance for β 1C and β 1D under exactly the same set-ting for (M1) in Section 7.2, except now p 2 = 4 with μ 2 = (2, −1, Theorem 2 under the other two cases(1) d X ∈ {0, p C } and 0 < d Y < r; (2) 0 < d X < p C and d Y ∈ {0,r} could be similarly proved with simple modifications of the following proof.Theorem 1 should be immediately proved by letting k to be the maximal value 12, i.e., the Lebesgue integrability

( 1 2 tr Σ − 1 Y |X β 2 − M − 1 Z T M β 2 − M −1 Z dβ 2 = 1 Y |X β 2 − M − 1 Z T M β 2 − M − 1 Z ≤ 1 , 4 (β 2 , 4 (β 2 , 2 ≤ c 5 f 1 B 1 F ≤ 1 , 1 AFigure G. 1
Figure G.1 shows the convergence of the proposed MCMC algorithm by showing the traceplots, the autocorrelation plots and the evolution of the Gelman-Rubin's shrink factor for a randomly selected and representative element of each of β 1C , β 1D , β 2 , Σ C|D and Σ Y |X from SIMP, under the data generating mechanism (M1) of Section 7.2, with r = 15 and n = 300 (the most challenging case).All of those plots indicate the convergence and weak autocorrelation between MCMC samples.Figure G.2 displays the empirical posterior density plots of the randomly selected and representative element of each of β 1C and β 1D under the same setting.These (estimated) posterior distributions are bell-shaped and the shape is very common among the (estimated) posterior distributions of all elements in β 1C and β 1D .

Figure G. 2
Figure G.1 shows the convergence of the proposed MCMC algorithm by showing the traceplots, the autocorrelation plots and the evolution of the Gelman-Rubin's shrink factor for a randomly selected and representative element of each of β 1C , β 1D , β 2 , Σ C|D and Σ Y |X from SIMP, under the data generating mechanism (M1) of Section 7.2, with r = 15 and n = 300 (the most challenging case).All of those plots indicate the convergence and weak autocorrelation between MCMC samples.Figure G.2 displays the empirical posterior density plots of the randomly selected and representative element of each of β 1C and β 1D under the same setting.These (estimated) posterior distributions are bell-shaped and the shape is very common among the (estimated) posterior distributions of all elements in β 1C and β 1D .

Figure G. 1 :
Figure G.1:The trace plots (left), the autocorrelation plots (middle; based only on Chain 1) and the evolution of the (median and 97.5% upper confidence limit of) Gelman-Rubin's shrink factor as the number of iterations increases (right; PSRF represents the point estimate, i.e. the potential scale reduction factor) for a randomly selected and representative element of each of β 1C , β 1D , β 2 , Σ C|D and Σ Y |X from SIMP (correspondingly from row 1 to row 5), under the data generating mechanism (M1) of Section 7.2, with r = 15 and n = 300 (the most challenging case).

Figure G. 2 :
Figure G.2:The (marginal) empirical posterior density plots for a randomly selected and representative element of each of β 1C and β 1D from SIMP, under the data generating mechanism (M1) of Section 7.2, with r = 15 and n = 300.These density plots under other simulated settings are similar.

Figure H. 1 :
Figure H.1: Plots of all selected participants with their i-th SNP PC and (i + 1)-th SNP PC as X and Y coordinates respectively, (i = 1, 3, 5, 7, 9, 11 for the upper left, upper middle, upper right, lower left, lower middle and lower right panels.The percentage of the total variation that is explained by that PC is included in the parenthesis) for the real data application.No population stratification is observed in all six plots.
Proposition 1 ([15], Proposition 2.1).E is a reducing subspace of Σ if and only if Σ could be decomposed as Σ = P E ΣP E + P E ⊥ ΣP E ⊥ , where P E and P E ⊥ are the projection matrices onto subspaces E and E ⊥ .
S Y |X Y carries all the material information and possibly some extra immaterial information for the regression.Instead of using any satisfied S Y |X , the definition of the response envelope model ensures the uniqueness of E Y |X by taking it to be the smallest one, i.e. the intersection of all satisfied subspaces.At the same time, by using E Y |X rather than any S Y |X , immaterial information is guaranteed to be not contained in P E Y |X Y anymore, and could be removed to the largest degree.So we call P E Y |X Y the material part and P E ⊥ Y |X Y the immaterial part of Y respectively.Although intuitive, it is not easy to formulate the response envelope model under Condition 1. [15] derived the following Condition 2, which is equivalent to Condition 1.The response envelope model could be equivalently defined by Condition 2 with subspace E Y |X , for (1.1).E Y |X is exactly E Σ Y |X (span(β T )), i.e. the Σ Y |X -envelope of span(β T ).Therefore, the response envelope model could be defined directly by E Σ Y |X (span(β T )) for (1.1).Suppose that the orthonormal basis matrices of E Y |X and E ⊥ Y |X are Γ and Γ 0 respectively.Then according to Condition 2 with subspace E Y |X , the coordinate form of the response envelope model is Condition 1(a) assumes the distribution of P S ⊥ Y |X Y does not depend on the value of X.Meanwhile, Condition 1(b) excludes the indirect effect of X on P S ⊥ Y |X Y via P S Y |X Y , by assuming P S Y |X Y and P S ⊥ T ) ⊆ S Y |X , (b ) Σ Y |X = P S Y |X Σ Y |X P S Y |X + P S ⊥ Y |X Σ Y |X P S ⊥ Y |X , i.e. S Y |X is a reducing subspace of Σ Y |X .Condition 2 assumes that S Y |X is a reducing subspace of Σ Y |X that contains span(β T ).Recalling Definition 2,

1D and X 2 . Replacing μ 1D and μ 2 with X 1D and X 2 respectively
μ 2 ∈ R p2 denote the unknown means of Y , X 1C , X 1D and X 2 respectively, β 1C ∈ R p C ×r , β 1D ∈ R p D ×r and β 2 ∈ R p2×r denote the unknown regression coefficients of X 1C , X 1D and X 2 , and Y |X is the random error vector.Specifically, combining β 1C and β 1D ,β 1 = (β T 1C , β T 1D )T denotes the coefficients of main interest to us.Assume X 1C to be stochastic, and X 1D and X 2 to be non-stochastic, with sample means X

Table 1 .
Relationship between SIMP (3.7)-(3.10)andseveral other envelope models.The meanings of r, p C , p D , p 1 and p 2 are introduced in (3.1), and d X and d Y are dimensions of E C|D and E Y |1respectively.The effective number of parameters for each model is also listed.Note for the response and the partial response envelopes, X 1D is required to be non-stochastic only and not necessary to be discrete.

Table 2 .
Percentages that the Bura-Cook estimator correctly identifies d (Column 3), and each of five methods correctly identifies d X or d Y individually or the (d X , d Y ) pair (Columns 4-8) out of 500 repetitions.Among five methods, the best one for selecting (d X , d Y ) pair under each scenario is highlighted in bold face.
under (M3).(M4) assumes the true model to be RRR and is designed to test the robustness of SIMP under model mis-specification.Results under all scenarios are displayed from Section 7.2.2 to Section 7.2.5.
1 and their elements are all generated independently from Unif(−1, 1).Samples of X 1C are generated from N p C (μ 1C , Σ C ), and Σ Y |X and Σ C are one realizations of IW(I r , r) and IW(I p C , p C ) respectively.

Table 3 .
MSE comparison between SIMP and other 11 competitors for estimating β 1C and β 1D over 500 repetitions under the data generating mechanism (M1).The lowest MSE under each combination of r and n is in bold face.

Table 4 .
MSE comparison between SIMP and other 11 competitors for estimating β 1C and β 1D over 500 repetitions under the data generating mechanism (M2).The lowest MSE under each combination of r and n is in bold face.

Table 5 .
MSE comparison between SIMP and other 11 competitors for estimating β 1C and β 1D over 500 repetitions under the data generating mechanism (M3).The lowest MSE under each combination of r and n is in bold face.r n FOLS RRR PCR PLSR CCA FX-env FY-env FS-env FPY-env PX-env PY-env SIMP

Table 6 .
MSE comparison between SIMP and other 11 competitors for estimating β 1C and β 1D over 500 repetitions under the data generating mechanism (M4).The lowest MSE under each combination of r and n among all methods and withinenvelope methods are in bold face and blue respectively.

Table 7 .
MSPE from 5-fold CV for various envelope and non envelope methods.Envelope dimensions, number of components (ncomp) or rank selected by 5-fold CV are mentioned inside parenthesis.

Table 8 .
Out of 37 significant SNPs from SIMP (besides APOE 4), 17 SNPs are reported here for the previously reported associations of them (or their RefSeq gene or their closest RefSeq gene with names indicated in Columns 2 and 5) with AD in the literature.In the literature, one previous study on the reported association is selected for each of these SNPs and displayed in Columns 3 and 6.

Table C .
1. MSE comparison for SIMP and other 11 estimators of β 1C and β 1D over 500 repetitions under the data generating mechanism (M5).The lowest MSE under each combination of r and n is in bold face.

Table C .
2. MSE comparison for SIMP and other 11 estimators of β 1C and β 1D over 500 repetitions under the data generating mechanism (M6).The lowest MSE under each combination of r and n is in bold face.Table C.3.MSE comparison between FPY-env, PX-env, PY-env and SIMP for estimating β 1C and β 1D over 500 repetitions, with the true envelope dimensions known under data generating mechanisms (M1)-(M3), (M5) and (M6), and the MSE of the RRR estimator with the true rank known under data generating mechanism (M4)."*" in the names of methods means that the true envelope dimension/rank is used.The lowest MSE for each combination of r and n under each true data generating mechanism is highlighted in bold face.

Table C .
4. MSE comparison for three Bayesian envelope estimators (PX-env, PY-env and SIMP) on estimating β 1Cand β 1D over 500 repetitions under the data generating mechanism (M1), except the posterior median rather than the posterior mean is chosen as the point estimator.The lowest MSE under each combination of r and n is in bold face.The tiny differences between the results in this table and those in Table3are colored in red.

Table C .
5. MSE comparison between SIMP and other 11 competitors for estimating β 1C and β 1D over 500 repetitions under the data generating mechanism (M4).Different from Table 6, this table displays the estimation performance of 12 methods, if we pretend that the true values of the parameters β 1C and β 1D are known, and the "optimal" tuning parameters or the envelope dimensions are selected by minimizing the 2 distances between the estimates and the true values of β 1C and β 1D separately, instead of applying the model selection methods that are introduced in Section 7.2.1.The lowest MSE under each combination of r and n among all methods and within envelope methods are in bold face and blue respectively.