1 Introduction
The field of data science has seen the growing importance of dimension reduction (DR) techniques as a preliminary step in processing large-scale biological datasets, such as single-cell RNA sequencing data. These techniques aid in tasks like data visualization, structural pattern discovery, and subsequent biological analyses. Within this broader context, Supervised Dimension Reduction (SDR, also known as sufficient dimension reduction) methodologies have gained significant attention [16, 37]. In the study by [61], a comparison is drawn between SDR techniques and unsupervised counterparts like Principal Component Analysis (PCA), Spherelets [40], and Spherical Rotation Component Analysis [46], emphasizing the increasing prominence and utility of SDR in contemporary data science.
Given paired observations $(x,y)\in {\mathbb{R}^{p}}\times \mathbb{R}$, where x consists of p covariates, and y is the corresponding response or output variable, the common assumption in SDR is that
where $V\in {\mathbb{R}^{p\times d}}$ with $d\ll p$ is the projection matrix from a high-dimensional to a low-dimensional space, ϵ is the measurement error independent of x, and φ is an arbitrary unknown function. For example, in a single-cell RNA sequencing dataset, x could be the expression of genes for a cell and y could be the cell type.
(1.1)
\[ y=\varphi \big({V^{\top }}x,\epsilon \big),\hspace{2.5pt}\text{for some function}\hspace{2.5pt}\varphi ,\]Under assumption (1.1), although the low-dimensional representation ${V^{\top }}x$ is determined by a linear transformation, the function φ is an arbitrary unknown function. In this paper, we stick to the assumption in (1.1) to focus on linear DR methods for two reasons. First, linear methods are computationally more efficient, particularly for large p and large n. Second, linear methods are more interpretable, which is an essential characteristic in scientific applications. For instance, in the example above, each column of ${V^{\top }}x$ is often considered as a genetic pathway [4]. Although our proposed method can be extended to nonlinear cases by the kernel method, we will leave this for future work.
Sliced Inverse Regression (SIR, [41]) is a well-established technique for supervised dimension reduction that is widely applicable in multiple scenarios due to its roots in regression analysis. It has been shown to have strong consistency results in both fixed dimensional [32] and high-dimensional [45] settings. The goal of SIR is to capture the most relevant low-dimensional linear subspace without any parametric or nonparametric model-fitting process for φ.
Moreover, SIR offers a geometric interpretation by conditioning on the sufficient statistics of the input distribution [41, 18]. SIR incorporates the idea of linear dimension reduction with statistical sufficiency. In SIR, given a pair of features $x\in {\mathbb{R}^{p}}$ and univariate response $y\in \mathbb{R}$, the goal is to find a matrix $V\in {\mathbb{R}^{p\times d}}$, $d\lt p$ such that y is conditionally independent of x given ${V^{\top }}x$. Although the matrix V is not identifiable, the column space of V, denoted $\mathcal{C}(V)$, is identifiable.
Motivated by emerging modern high-dimensional [25, 44, 64] and biological datasets [28, 42], SIR evolved and admitted several generalizations, including localized SIR [68], kernel SIR [67], SIR with regularization [42], SIR for longitudinal data [34, 43], metric response values [60], and online SIR [10].
In this article, we focus on a specific type of high-dimensional biological data, where the dataset consists of two groups — a foreground group, also known as treatment group or case group, and a background group, also known as control group. The goal is to identify the low-dimensional structure, variation, and information unique to the foreground group for downstream analysis. This situation arises naturally in many scientific experiments with two subpopulations. For example, in Electronic Health Record (EHR) data, the foreground data could be health-related covariates from patients who received certain medical treatment, while the background data could be measurements from healthy patients who did not receive any treatment. In this case, the goal is to identify a unique structure in patients who received the treatment that can predict future outcomes. In a genomics context, the foreground data could be gene expression measurements from patients with a disease, and the background data could be measurements from healthy people. In this case, the goal is to predict a certain phenotype for the diseased patient for disease analysis and future therapy.
Previous contrastive models, such as the contrastive latent variable model (CLVM, [73]), contrastive principal component analysis (CPCA, [1]), probabilistic contrastive principal component analysis (PCPCA, [38]), and the contrastive Poisson latent variable model (CPLVM, [35]), have shown that using the case-control structure between foreground and background groups can greatly improve the effectiveness of dimension reduction over standard DR methods such as PCA and its variants. However, to the best of our knowledge, none of these unsupervised contrastive dimension reduction methods is directly applicable to the SDR setting.
In this work, we move from these unsupervised contrastive dimension reduction methods to a supervised contrastive dimension reduction setting. By combining the idea of contrastive loss function and the sufficient dimension reduction considered in the SIR model, we propose the Contrastive Inverse Regression (CIR) model, which exactly recovers SIR when a certain parameter is zero. The CIR model sheds light on how to explore and exploit the contrastive structures in supervised dimension reduction.
Table 1
Categorization of DR methods by whether they are supervised or contrastive.
No | Yes | |
No | PCA, CCA | SIR, LDA, LASSO |
Yes | CPCA, PCPCA | CIR |
Table 1 lists several popular DR methods and their properties. The table categorizes these methods as “contrastive” and “supervised”, based on whether they are designed for case-control data and able to identify low-dimensional structure unique to the case group, and if they take the response variable y into consideration and use ${V^{\top }}x$ to predict y. For example, PCA, the most well known DR method, is neither contrastive nor supervised. Similarly, canonical correlation analysis (CCA, [31]) does not utilize y or the unique information of one group, which makes it neither contrastive nor supervised. Methods such as CLVM, CPCA, PCPCA, and CPLVM are contrastive but not supervised. On the other hand, classical supervised DR methods including SIR [41], linear discriminant analysis (LDA, [26]), and the least absolute shrinkage and selection operator (LASSO, [58]) are supervised but not contrastive. Our proposed method, CIR, combines both contrastive and supervised features by utilizing both the response y and the case-control structure.
It is important to note that the assumption (1.1) does not limit the response variable y to be continuous or categorical, and thus we do not distinguish between regression and classification. However, some methods listed in Table 1 are specifically designed for either continuous y (regression, LASSO) or categorical y (classification, LDA). CIR handles both scenarios with the only difference being in the choice of slices, as explained in Section 2. Furthermore, not all existing DR methods are included in this table. For example, the recently proposed linear optimal low-rank projection (LOL, [61]) is designed for the classification setting and requires the number of classes to be smaller than the reduced dimension d. This can be restrictive, for example, when applied to a single-cell RNA sequencing dataset, where d is required to be greater than the number of cell types. In contrast, CIR does not require such data-dependent constraints on the reduced dimension d. Similarly, data visualization algorithms that require $d=2$ such as the t-distributed stochastic neighbor embedding (tSNE, [59]) and uniform manifold approximation and projection (UMAP, [5]) are not listed in the table.
We now present our proposed methodology, including an algorithm for solving the associated nonconvex optimization problem on the Stiefel manifold. We also provide analysis of the convergence of the algorithm, and conduct extensive experiments to demonstrate its superior performance on high-dimensional biomedical datasets when compared to existing DR methods. All proofs are provided in the appendix, and additional experimental details are in the supplement material.
2 Method
To maintain consistency, we will use the terms “foreground group” and “background group” instead of “case-control” or “treatment-control” in the remaining sections. First, we briefly review SIR as our motivation.
Definition 1 (Stiefel manifold).
$\operatorname{St}(p,d):=\{V\in {\mathbb{R}^{p\times d}}:{V^{\top }}V={\operatorname{I}_{d}}\}$ admits a smooth manifold structure equipped with a Riemannian metric, called the Stiefel manifold.
Recall that under the assumption in Equation (1.1), the centered inverse regression curve, $\mathbb{E}[x|y]-\mathbb{E}[x]$, lies exactly in the linear space spanned by columns of ${\Sigma _{xx}}V$, denoted by $\mathcal{C}({\Sigma _{xx}}V)$, where ${\Sigma _{xx}}$ is the covariance matrix of x. This linear subspace is called the effective dimension reduced (e.d.r.) space [41]. As a result, the objective of SIR is to minimize the expected squared distance between $\mathbb{E}[x|y]$ and $\mathcal{C}({\Sigma _{xx}}V)$:
where d is the Euclidean distance.
In the contrastive setting, denote foreground data by $(x,y)\in {\mathbb{R}^{p}}\times \mathbb{R}$ and background data by $(\widetilde{x},\widetilde{y})\in {\mathbb{R}^{p}}\times \mathbb{R}$. For convenience, we assume that x and $\widetilde{x}$ are centered at the origin so that $\mathbb{E}[x]=\mathbb{E}[\widetilde{x}]=0$.
Our goal is to find a low-dimensional representation of x, denoted by ${V^{\top }}x$, such that y is determined by ${V^{\top }}x$ while $\widetilde{y}$ is not determined by ${V^{\top }}\widetilde{x}$. The goal of CIR is to find a low-dimensional subspace represented by V such that
That is, the column space of V captures the low-dimensional information unique to the foreground group so that we can use ${V^{\top }}x$ to predict y through φ, but cannot use ${V^{\top }}\widetilde{x}$ to predict $\widetilde{y}$ through any function ψ. Instead of optimizing a single loss similar to SIR, CIR aims at optimizing the subspace $\mathcal{C}({\Sigma _{xx}}V)$ to minimize the “contrastive loss function”:
where $\alpha \ge 0$, ${\Sigma _{xx}}=\operatorname{Cov}(X)\hspace{2.5pt}\text{and}\hspace{2.5pt}{\Sigma _{\widetilde{x}\widetilde{x}}}=\operatorname{Cov}(\widetilde{X})\in {\mathbb{R}^{p\times p}}$, and d is the Euclidean distance. Define the following notation:
(2.2)
\[ \left\{\begin{array}{l@{\hskip10.0pt}l}y=\varphi ({V^{\top }}x,\epsilon ),\hspace{1em}& \text{for some function}\hspace{2.5pt}\varphi ,\\ {} \widetilde{y}\ne \psi ({V^{\top }}\widetilde{x},\widetilde{\epsilon }),\hspace{1em}& \text{for any function}\hspace{2.5pt}\psi .\end{array}\right.\](2.3)
\[\begin{aligned}{}f(V)& :={\mathbb{E}_{y}}\big[{d^{2}}(\mathbb{E}[x\hspace{2.5pt}|\hspace{2.5pt}y],\mathcal{C}({\Sigma _{xx}}V)\big]\\ {} & \hspace{2.5pt}\hspace{2.5pt}\hspace{2.5pt}\hspace{2.5pt}-\alpha {\mathbb{E}_{\widetilde{y}}}\big[{d^{2}}(\mathbb{E}[\widetilde{x}\hspace{2.5pt}|\hspace{2.5pt}\widetilde{y}],\mathcal{C}({\Sigma _{\widetilde{x}\widetilde{x}}}V)\big],\end{aligned}\]
\[\begin{array}{l}\displaystyle {v_{y}}=\mathbb{E}[x\hspace{2.5pt}|\hspace{2.5pt}y],\hspace{2.5pt}{v_{\widetilde{y}}}=\mathbb{E}[\widetilde{x}\hspace{2.5pt}|\hspace{2.5pt}\widetilde{y}]\in {\mathbb{R}^{p}}\\ {} \displaystyle {\Sigma _{x}}=\operatorname{Cov}({v_{y}}),\hspace{2.5pt}{\Sigma _{\widetilde{x}}}=\operatorname{Cov}({v_{\widetilde{y}}})\in {\mathbb{R}^{p\times p}}\end{array}\]
${v_{y}}$ (and ${v_{\widetilde{y}}}$) are called the centered inverse regression curves [41, 60]. The resulting loss function f balances the effectiveness of dimension reduction between the foreground and background groups. We can adjust the hyperparameter α to express our belief in the importance of the background group. Note that the parameter α appears naturally in other contrastive DR methods, including CPCA and PCPCA.Proposition 1.
The objective function f given by Equation (2.3) is simplified as
\[ f(V)=-\operatorname{tr}\big({V^{\top }}AV{\big({V^{\top }}BV\big)^{-1}}\big)+\alpha \operatorname{tr}\big({V^{\top }}\widetilde{A}V{\big({V^{\top }}\widetilde{B}V\big)^{-1}}\big)\]
where $A={\Sigma _{xx}}{\Sigma _{x}}{\Sigma _{xx}}$, $B={\Sigma _{xx}^{2}}$, $\widetilde{A}={\Sigma _{\widetilde{x}\widetilde{x}}}{\Sigma _{\widetilde{x}}}{\Sigma _{\widetilde{x}\widetilde{x}}}$, and $\widetilde{B}={\Sigma _{\widetilde{x}\widetilde{x}}^{2}}$.Note that V is not identifiable, and is identifiable only up to a d-dimensional rotation. However, the contrastive loss f, determined by $V{V^{\top }}$, the projection matrix to the column space of V, is invariant under such rotations. This nonidentifiability issue is common in other DR methods, including PCA, CPCA, SIR, etc, where the convention is to refer to the column space of V. Therefore, this nonidentifiability is consistent with standard practices and does not impact the validity of our results.
Observe that in the case where $\alpha =0$, CIR reduces to SIR. In this case, the problem can be reparameterized by ${V^{\ast }}={B^{1/2}}V$ so that the columns are orthogonal, which reduces the loss function to a quadratic form, yielding a closed-form solution (as a generalized eigenproblem). In the case where $\alpha \gt 0$, however, we cannot perform the same trick for both B and $\widetilde{B}$, so we must resort to numerical approximations. We adopt gradient-based optimization algorithms on $\operatorname{St}(p,d)$, which are based on the gradient of f given by the following lemma.
Lemma 1.
The gradient of f is given by
\[\begin{aligned}{}& \operatorname{grad}f(V)\\ {} & =-2\big(AV{\big({V^{\top }}BV\big)^{-1}}\hspace{-0.1667em}-\hspace{-0.1667em}BV{\big({V^{\top }}BV\big)^{-1}}{V^{\top }}AV{\big({V^{\top }}BV\big)^{-1}}\big)\\ {} & \hspace{1em}+2\alpha \big(\widetilde{A}V{\big({V^{\top }}\widetilde{B}V\big)^{-1}}\\ {} & \hspace{1em}-\widetilde{B}V{\big({V^{\top }}\widetilde{B}V\big)^{-1}}{V^{\top }}\widetilde{A}V{\big({V^{\top }}\widetilde{B}V\big)^{-1}}\big).\end{aligned}\]
Note that the gradient $\operatorname{grad}f$ is different from the standard gradient in Euclidean space, denoted by $Df=\frac{\partial f}{\partial V}$. The difference is that $\operatorname{grad}f$ lies in the tangent space of $\operatorname{St}(p,d)$ at V, while the Euclidean version may escape from the tangent space.
Theorem 1.
If V is a local minimizer of the optimization problem (2.3), then
where $E(V)={({V^{\top }}BV)^{-1}}$, $\widetilde{E}(V)={({V^{\top }}\widetilde{B}V)^{-1}}$, $F(V)={({V^{\top }}BV)^{-1}}{V^{\top }}AV{({V^{\top }}BV)^{-1}}$, and $\widetilde{F}(V)={({V^{\top }}\widetilde{B}V)^{-1}}{V^{\top }}\widetilde{A}V{({V^{\top }}\widetilde{B}V)^{-1}}$.
Let $G(V)={V^{\top }}AV$ and $\widetilde{G}(V)={V^{\top }}\widetilde{A}V$. Note, then, that the local optimality condition is equivalent to
In Appendix G, we discuss how Equation (2.4) may lead to a gradient-free algorithm that involves solving an asymmetric algebraic Ricatti equation.
So far, we have discussed the population version, which relies on the distributions of x, $\widetilde{x}$, y, and $\widetilde{y}$ that are unknown in practice. In real applications, we observe finite samples ${({x_{i}},{y_{i}})_{i=1}^{n}}$ as foreground data and ${({\widetilde{x}_{j}},{\widetilde{y}_{j}})_{j=1}^{m}}$ as background data. We denote $X\in {\mathbb{R}^{n\times p}}$, $\widetilde{X}\in {\mathbb{R}^{m\times p}}$ where each row represents a sample; similarly, each entry of $Y\in {\mathbb{R}^{n}}$ and $\widetilde{Y}\in {\mathbb{R}^{m}}$ represents a response value. In this case, we can replace the expectation by the sample mean to get estimates of ${\Sigma _{x}}$, ${\Sigma _{\widetilde{x}}}$, ${\Sigma _{xx}}$, and ${\Sigma _{\widetilde{x}\widetilde{x}}}$ and have the corresponding plug-in estimates for A, B, $\widetilde{A}$, and $\widetilde{B}$. Throughout this paper, we assume $p\lt n$ so that all covariance matrices are nonsingular. The extension to $p\gt n$ is discussed in Section 5. After computing these matrices, the problem is reduced to a manifold optimization problem [2]. The estimates of ${\Sigma _{x}}$ and ${\Sigma _{\widetilde{x}}}$ deserve further discussion. As shown by [41] and [10] among others, for continuous response y, the observed support of response y can be discretized into slices ${I_{h}}=({q_{h-1}},{q_{h}}]$, for $h=1,\dots ,H$. An estimate of ${\Sigma _{x}}$ is given by ${\textstyle\sum _{h=1}^{H}}{m_{h}}{m_{h}^{\top }}$ where ${m_{h}}=\frac{1}{n{p_{h}}}{\textstyle\sum _{{y_{i}}\in {I_{h}}}}{x_{i}}$ with ${p_{h}}=\frac{1}{n}{\textstyle\sum _{i=1}^{n}}I({y_{i}}\in {I_{h}})$. On the other hand, if y and $\widetilde{y}$ are categorical, the slices are naturally chosen as all possible values of y and $\widetilde{y}$. Combining these pieces, we present our empirical version of the CIR algorithm in Algorithm 1.
It is worth noting that our optimization of f as a function of $V\in \operatorname{St}(p,d)$ cannot be considered as an optimization problem in ${\mathbb{R}^{p\times d}}$ with orthogonality constraints ${V^{\top }}V={\operatorname{I}_{d}}$ [23, 7]. Because the term ${({V^{\top }}BV)^{-1}}$ in f is not well defined unless V is full rank, our loss function f cannot be extended to the full Euclidean space ${\mathbb{R}^{p\times d}}$. We consider it as an optimization problem intrinsically defined on $\operatorname{St}(p,d)$ as laid out by [2]. This key property excludes some commonly used optimizers on manifolds, and we will discuss more details in the next section.
3 Theory
In this section, we discuss two concrete optimization algorithms for the last step in Algorithm 1 to find ${V^{\ast }}$ and show their convergence. The optimization problem outlined in Equation (2.3) does not follow the classic Li-Duan theorem for regression-based dimension reduction due to its nonconvex nature (see, e.g., [17, Prop. 8.1]). The convergence of the algorithm is discussed in detail below.
The first algorithm we consider is the scaled gradient projection method (SGPM) specifically designed for optimization on the Stiefel manifold [50]. We first define an analog to Lagrangian multiplier $\mathcal{L}(V,\Lambda ):=f(V)-\frac{1}{2}\langle \Lambda ,{V^{\top }}V-{\operatorname{I}_{d}}\rangle $, then the SGPM algorithm is summarized in Algorithm 2, where $\pi (X)=\arg {\min _{Q\in \operatorname{St}(p,d)}}||X-Q|{|_{F}}$ is the orthogonal projection to the Stiefel manifold. Note for Algorithm 2 that the update for μ and ${C_{k+1}}$ is intricate; see [50] for more details.
To study the convergence of Algorithm 2, we need to study the Karush-Kuhn-Tucker (KKT) conditions for CIR:
Lemma 2 ([50]).
Now we can state the convergence theorem of Algorithm 2:
Although Algorithm 2 is guaranteed to converge, there are two drawbacks. First, the accumulation point ${V_{\ast }}$ is only guaranteed to satisfy the KKT conditions, but not necessarily a critical point. Second, we do not know how fast ${V_{k}}$ will converge to ${V_{\ast }}$. Next, we introduce an accelerated line search (ALS) algorithm as an alternative to SGPM that converges to a critical point with a known convergence rate. ALS is summarized by Algorithm 3, where ${t_{k}^{A}}$ is the step size, called the Armijo step size for given $\overline{\alpha }$, β, σ, ${\eta _{k}}$ and R is a retraction to $\operatorname{St}(p,d)$, see [2] for more details.
Algorithm 3 can be shown to have linear convergence to critical points if the hyperparameters are chosen properly. For other choices of ${\eta _{k}}$, see Appendix E for more details. The following adaptation of Theorem 4.5.6 in [2] indicates linear convergence to stationary points.
Theorem 3.
Let ${\{{V_{k}}\}_{k=1}^{\infty }}$ be an infinite sequence generated by Algorithm 3 with ${\eta _{k}}=-\operatorname{grad}\hspace{2.5pt}f({V_{k}})$ converging to an accumulation point ${V_{\ast }}$ of ${\{{V_{k}}\}_{k=1}^{\infty }}$, then ${V_{\ast }}$ is a critical point of f, and ${\lim \nolimits_{k\to \infty }}\| \operatorname{grad}f({V_{k}})\| =0$.
Furthermore, assuming ${V_{\ast }}$ is a local minimizer of f with $0\lt {\lambda _{l}}:=\min \operatorname{eig}(\operatorname{Hess}(f)({V_{\ast }}))$ and ${\lambda _{u}}:=\max \operatorname{eig}(\operatorname{Hess}(f)({V_{\ast }}))$, then, for any $r\in ({r_{\ast }},1)$ where
\[ {r_{\ast }}:=1-\min \bigg(2\sigma \bar{\alpha }{\lambda _{l}},4\sigma (1-\sigma )\beta \frac{{\lambda _{l}}}{{\lambda _{u}}}\bigg),\]
there exists an integer $K\ne 0$ such that
for all $k\ge K$, where $\bar{\alpha }$, β, σ, c are the hyperparameters in Algorithm 3.The difference between Algorithm 2 and 3 deserves further comment. While we empirically observe that Algorithm 2 often converges faster, Algorithm 3 has theoretical properties which allow for a proof of linear convergence in terms of an upper bound on the number of iterations. In practice, for smaller datasets we suggest running Algorithm 3, while for larger datasets we recommend using Algorithm 2 for efficiency.
The computational complexity of CIR for both the SGPM-based optimization and the ALS-based optimization is compared to various competitors in the table below. Here, we assume that $1\le d\lt p\lt m,n$, where the background and foreground data have m and n samples, respectively. Specifically, we assume that ϵ denotes the stopping error such that $f({V_{k}})-f({V_{\ast }})\le \epsilon $. It is noteworthy that a p-dimensional singular value decomposition can be achieved within $\mathcal{O}({p^{3}})$. We present the comparison in the table below.
Table 2
Computational time-complexity of CIR and competitors.
Algorithm | Theoretical Algorithmic Complexity |
CIR, SGPM-based | $\mathcal{O}((m+n){p^{2}})$ |
CIR, ALS-based | $\mathcal{O}(-\log (\epsilon ){p^{3}}+(m+n){p^{2}})$ |
LDA | $\mathcal{O}(n{p^{2}})$ |
PCA | $\mathcal{O}(n{p^{2}})$ |
CPCA | $\mathcal{O}((m+n){p^{2}})$ |
SIR | $\mathcal{O}(n{p^{2}})$ |
4 Application
When applying CIR, several hyperparameters must be tuned, such as the weight α, the reduced dimension d, and the slices ${I_{h}}$ and ${I_{\widetilde{h}}}$ for estimation of ${\Sigma _{x}}$ and ${\Sigma _{\widetilde{x}}}$. In some cases, it may also be necessary to determine the definition of the foreground and background groups and to assign background labels $\widetilde{Y}$.
The value of $\alpha \ge 0$ can be determined by cross-validation. In particular, we suggest the following 2-step method. First, identify the rough range of α at the logarithmic scale. Because CIR coincides with SIR when $\alpha =0$, running SIR initially can provide insights: better performance of SIR suggests a smaller α, and vice versa. Once a rough range is identified, standard cross-validation can be used within this range. Our numerical experiments show that CIR is robust to the choice of α; that is, the performance of the method changes continuously with α. Tables supporting this observation are provided in the supplement material.
Additionally, the choice of reduced dimension d may depend on the goal of the investigator. If visualization is considered important, $d=2$ is appropriate. If the goal is prediction, the elbow point of the d versus prediction error plot may suggest an optimal d. However, as with other DR methods, determining the optimal value of d is still a topic of ongoing research [12, 13].
The definition of foreground data X and Y should be the data and the target variable of interest, while the choices of background data $\widetilde{X}$ and $\widetilde{Y}$ may not be as straightforward. These data are intended to represent “noise” that is “subtracted” from the foreground data. For example, in the biomedical context, if the population of interest is a group of sick patients, the background dataset may include observations of healthy individuals. In other contexts, however, it may be appropriate to use $\widetilde{X}=X$. In this case, the choice of background label $\widetilde{Y}$ may be unclear. If another outcome variable was collected, it could be used as $\widetilde{Y}$; otherwise, randomly selected values in the support of Y could be used to represent “noise”.
The estimates for ${\Sigma _{x}}$ and ${\Sigma _{\widetilde{x}}}$ are partly determined by whether y and $\widetilde{y}$ are categorical or continuous. If these variables are categorical, then each value of y (or $\widetilde{y}$) can be considered as a separate slice, resulting in $|\text{supp}(Y)|$ (or $|\text{supp}(\widetilde{Y})|$) total slices. On the other hand, if these variables are continuous, slices can be taken to represent an equally spaced partition of the range of Y (or $\widetilde{Y}$), with the number of slices being tunable hyperparameters.
4.1 Mouse Protein Expression
The first dataset we consider was collected for the purpose of identifying proteins critical to learning in a mouse model of Down syndrome [27]. The data contain 1095 observations of expression levels of 77 different proteins, along with genotype (t=Ts65Dn, c=control), behavior (CS=context-shock, SC=shock-context), and treatment (m=memantine, s=saline). The behavior of CS corresponds to the scenario in which the mouse was first placed in a new cage and permitted to explore for a few minutes before being exposed to a brief electric shock; conversely, SC corresponds to mice immediately given an electric shock upon being placed in a new cage, and then being permitted to explore. Of the data, 543 samples contain at least one missing value. Taking into account the relatively large sample size, we consider only the 552 observations with complete data. We do not perform any normalization or any other type of preprocessing to the raw data prior to analysis.
In this example, $X\in {\mathbb{R}^{552\times 77}}$ represents the expression of 77 proteins of all mice without a missing value, while ${y_{i}}\in \{0,1,\dots ,7\}$ represents the combination of 3 binary variables: genotype, treatment, and behavior. For example, ${y_{i}}=1$ means that the i-th mouse received memantine, was exposed to context-shock, and has genotype Ts65Dn. To visualize the data, we apply unsupervised DR algorithms PCA, tSNE and UMAP to X and supervised DR methods LDA, LASSO and SIR to $(X,Y)$, with $d=2$ for all algorithms. The 2-dimensional representation is given in Figure 1, where each color represents a class of mice among 8 total classes.
Figure 1
2-d representation of mouse protein data. Silhouette scores: (PCA, $-.20$), (CPCA, $-.13$), (LDA, .42), (LASSO, $-.17$), (SIR, .03), (CIR, .29), (tSNE, $-.14$), (UMAP, $-.00$).
PCA, LASSO, SIR, tSNE, and UMAP fail to distinguish between classes, whereas LDA successfully separates 5 classes but with 3 classes (c-CS-m, c-CS-s, t-CS-m) mixed together. Now we take advantage of the background data. We let $\widetilde{X}$ be the protein expression from mice with genotype = control, which coincides with the background group used in previous studies of this application [1, 38]. We set $\widetilde{Y}$ as the binary variable representing behavior and apply CPCA to $(X,\widetilde{X})$ and CIR to $(X,Y,\widetilde{X},\widetilde{Y})$ with $d=2$ as well. The 2-dimensional representations with their Silhouette scores [53] are shown in Figure 1, which indicates that CIR outperforms all other competitors except LDA in terms of the Silhouette score. In particular, the three classes that were not separated in LDA are less mixed in CIR, supported by the Silhouette scores for these three classes: $-0.10$ for LDA and 0.23 for CIR. We provide other objective scores [11, 22] in the supplementary material.
Next, we show the classification accuracy based on $XV$, the dimension-reduced data. Here, we vary d from 2 to 7 because for higher d, the accuracy is close to 1. The mean prediction accuracy of KNN, the best classifier for this example, over 10 replicates versus the reduced dimension d is shown in Figure 2, clearly indicating that CIR outperforms all competitors especially when d is small. We present the accuracy of other classifiers and their standard deviations in the supplement material.
4.2 Single Cell RNA Sequencing
The second dataset we consider is from a study of single-cell RNA sequencing used to classify cells into cell types based on their transcriptional profile [3]. The data include 3500 observations of expression levels of 32838 different genes, along with cell labels as one of the 9 different cell types, namely CD8 T cell, CD4 T cell, classical monocyte, B cell, NK cell, plasmacytoid dendritic cell, non-classical monocyte, classic dendritic cell, and plasma cell. We select the top 100 most variable genes for our analysis to be consistent with previous analyses of these data [71, 1]. In this example, $X\in {\mathbb{R}^{3500\times 100}}$ represents the expression of 100 genes, while ${y_{i}}\in \{0,1,\dots ,8\}$ represents the cell type. For example, ${y_{i}}=1$ means that the i-th cell is a CD4 T cell.
To visualize the data, we apply unsupervised DR algorithms PCA, tSNE, and UMAP to X and supervised DR methods LDA, LASSO, and SIR to $(X,Y)$, for $d=2$ for all algorithms. In this case, there is no obvious choice of background data. So, we use $\widetilde{X}=X$ and randomly draw independent and identically distributed samples $\widetilde{Y}\sim \text{uniform}\{0,\dots ,8\}$ in order to apply CPCA and CIR. The 2-dimensional representations with their Silhouette scores are given in Figure 3, which indicates that CIR has the best performance.
For each $d=2,\dots ,10$, we compare the accuracy of a KNN classifier based on dimension-reduced data among various methods, with the raw data as the baseline. We repeat this process 10 times to reduce the impact of random split in cross-validation. The prediction accuracy versus reduced dimension d is shown in Figure 4, where CIR has the best overall performance especially when $d=2,3$. We show the accuracy of other classifiers and their standard deviations in the supplement material.
Figure 3
2-d representation of single-cell RNA sequencing data. Silhouette scores: (PCA, $-.25$), (CPCA, $-.25$), (LDA, .11), (LASSO, $-.30$), (SIR, $-.10$), (CIR, .15), (tSNE, $-.17$), (UMAP, $-.17$).
The improved performance of CIR over SIR deserves further comment. While the background data and labels $(\widetilde{X},\widetilde{Y})$ used in CIR do not add new information beyond what SIR used, because the background label is chosen randomly, we attribute the improved performance to CIR “denoising” the foreground data.
4.3 COVID-19 Cell States
The third dataset we consider is also a single-cell RNA sequencing dataset, with samples from 90 patients with COVID-19 and 23 healthy volunteers [56]. In total, the dataset contains 48083 cells from diseased patients and 14426 cells from healthy volunteers. We treat the cells from the patients with disease as foreground and the cells from the healthy volunteers as background. On each cell, RNA expression levels on 24727 different genes were measured. For the features, we selected the 500 genes with the largest variances in RNA expression, in accordance with prior analysis with this dataset [21].
For each cell in the dataset, its cell type is identified, which we use as the labels. As recommended in a previous analysis [21], we consider only the cells for which at least 250 observations were available. This filtering resulted in 14 distinct cell types, with 40411 observations in the foreground and 13041 in the background. As in the previous example, we have $X\in {\mathbb{R}^{40411\times 500}}$ to represent the expressions of 500 genes in the cells of patients with COVID-19 and $\widetilde{X}\in {\mathbb{R}^{13041\times 500}}$ to represent the gene expressions in the healthy volunteers, while ${y_{i}},{\widetilde{y}_{j}}\in \{0,1,\dots ,13\}$ represents the cell type.
As with the previous two examples, we first apply DR methods to visualize the data. With $d=2$, we apply PCA, tSNE, and UMAP to X, and we apply LDA, LASSO, and SIR to $(X,Y)$. We also apply CPCA to $(X,\widetilde{X})$ and CIR to $(X,Y,\widetilde{X},\widetilde{Y})$. The 2-dimensional representations with their Silhouette scores are provided in Figure 5. Although CIR is not the best overall in terms of the Silhouette score, it outperforms its direct competitors, CPCA and SIR.
Figure 5
2-d representation of COVID-19 data. Silhouette scores: (PCA, $-.29$), (CPCA, $-.29$), (LDA, .02), (LASSO, $-.48$), (SIR, $-.27$), (CIR, $-.05$), (tSNE, $-.03$), (UMAP, .11).
For $d=2,\dots ,7$, we compare the accuracy of a KNN classifier based on the dimension-reduced data among various DR methods, with the raw data as baseline. As with the previous examples, we repeat this process 10 times for each method to reduce the effect of the cross-validation random split on the results. The prediction accuracy for each reduced dimension d is shown in Figure 6, where we see that CIR is an improvement over all other methods for $d=2,3$. We show the accuracy of both the KNN-based classifier and the tree-based classifier in the supplementary material.
4.4 Plasma Retinol
The final dataset we consider is the plasma retinol dataset [49]. The dataset contains 315 observations of 14 variables, including age, sex, smoking status, BMI, vitamin use, calories, fat, fiber, cholesterol, dietary beta-carotene, dietary retinol consumed per day, number of alcoholic drinks consumed per week, and levels of plasma beta-carotene and plasma retinol.
In this example, $X\in {\mathbb{R}^{315\times 12}}$ represents measurements of the first 12 variables listed for all subjects, while ${y_{i}}$ represents the measurement of plasma beta-carotene, a variable of particular interest to scientists [49]. In contrast to the previous two examples, note that here ${y_{i}}$ is continuous, not categorical.
We apply unsupervised DR algorithms PCA, tSNE, and UMAP to X and supervised DR algorithms LDA, LASSO, and SIR to $(X,Y)$ for $d=1,\dots ,8$. Similarly to the single-cell RNA sequencing application, we let $\widetilde{X}=X$ because there is no natural choice of background data. For the background label, we set $\widetilde{Y}$ as the continuous variable representing the level of plasma retinol, which shares certain information with ${y_{i}}$, and apply CPCA to $(X,\widetilde{X})$ and CIR to $(X,Y,\widetilde{X},\widetilde{Y})$ for $d=1,\dots ,8$. We skip the visualization step in this case due to the poor visibility in terms of ${y_{i}}$.
After trying a few regression methods, namely linear regression [24], regression trees [9], Gaussian process regression [14], and neural networks [30], we present the prediction mean squared error (MSE) for the best method for this dataset, linear regression. That is, for each d and the output V from each DR algorithm, we fit a linear regression model to $(XV,Y)$. We also compare to a linear regression model based on raw data $(X,Y)$ as the baseline. Figure 7 demonstrates that CIR outperforms all other competitors, but matches SIR when $d\ge 3$.
Note that because Y and $\widetilde{Y}$ are continuous, the number of slices to estimate ${\Sigma _{x}}$ and ${\Sigma _{\widetilde{x}}}$ needs to be carefully chosen and adjusted to ensure optimal performance. We use cross-validation to select 4 equally spaced partitions for the support of Y and $\widetilde{Y}$. In the three applications presented above, CIR demonstrates superior overall performance over its supervised, unsupervised, contrastive, and non-contrastive competitors, especially in low dimension, i.e., $d=2,3$, which are the most crucial dimensions for visualization purposes.
In all four examples we considered, we consistently observed that CIR is the best among all methods when $d=2,3$. However, the gain is not obvious for higher dimensions. A possible reason for this is that when d is relatively large, methods that use only the foreground data, such as SIR or LDA, capture both shared information and unique information in the foreground. Consequently, the improvement from incorporating the background group, or any contrastive model, becomes incremental. Fortunately, $d\le 3$ are often the most important dimensions because they allow for visualization and interpretation.
5 Discussion and Future Work
In this work, we propose the CIR model and the associated optimization algorithm for supervised dimension reduction for datasets that are split into foreground and background groups. We provide a theoretical guarantee of the convergence of the CIR algorithm under mild conditions. We have shown that our CIR model outperforms competitors in multiple biomedical datasets, including mouse protein expression data, single-cell RNA sequencing data, and plasma retinol data. However, there are several important future directions that remain unaddressed in this paper, as outlined below.
Multi-Treatment It is natural to consider how our model can be extended to studies with multiple treatments. For example, in medical treatment, there might be more than one treatment for patients with certain disease. In [61], it has been shown that the number of treatment groups puts a hard constraint on the target dimension. It is interesting to generalize from a single-treatment structure to a multi-treatment structure (e.g., [47]), where the loss function needs more sophisticated design.
Another direction in multi-group scenario is to combine multiple CIR models trained on different pairs of bi-groups. As pointed out by [65], the generalization error in the contrastive regression model stacking needs to be controlled, and one possible way is to follow divergence mixing as proposed by [29], with a careful normalization. The major difficulty in training such stacking model is how to devise a sequential optimization for model training.
Consistency and Sufficient Dimension Reduction The consistency of the proposed CIR model remains open. Theorems 2 and 3 ensure that the resulting solution must be a stationary point, but we did not discuss whether these stationary points are consistent estimates. The consistency of the estimates is also affected by the choice of α, because $\alpha =0$ will render this CIR into a classical SIR for the foreground group. This consistency problem also has practical importance, as it explicitly expresses the trade-off between the expressive contrastiveness and the emphasis on the effective lower-dimensional structures. The group information and statistical sufficiency compete against each other, as we observed in the experiments, thus a range of α that balances between these two factors are of interest and might be answered by the consistency result.
Furthermore, SIR has the drawback of missing the totality central subspace when the symmetry assumption in x is lost [37]. [18] proposed the sliced average variance estimator (SAVE) estimator for addressing this problem, which raises the natural question of how to generalize this high-moment SDR method to the contrastive setting.
Loss Function Our loss function (2.3) is nonstandard, which raises many questions. For example, the relation between number of local minima and A, B, $\widetilde{A}$, $\widetilde{B}$, α remains open. Moreover, although f cannot be continuously extended to the full Euclidean space ${\mathbb{R}^{p\times d}}$, if we restrict the domain to be a submanifold of $\operatorname{St}(p,d)$, f might be extended to the convex hull of the submanifold. This extension will enable us to apply some other efficient optimization algorithms with strong theoretical guarantee [7]. Furthermore, Appendix G raises the question of the validity of a fixed-point algorithm based on Ricatti equations that may lead to a more efficient algorithm to minimize f without involving the gradient.
Scalability The method we presented in this paper does not handle high-dimensional data in the sense of $p\gt n,m$, because matrices B and $\widetilde{B}$ are singular in this situation. A possible extension of CIR to $p\gt n$ is to use the same technique as sparse PCA [72], which introduces a penalty term to enforce sparsity.