1 Introduction
Binary regression models are essential tools in statistical modeling, particularly for understanding relationships between binary outcomes and predictor variables. Traditional approaches often rely on the logit or probit link functions due to their theoretical appeal and interpretability. However, these standard link functions may fail when the sigmoid function of the underlying data exhibits heavy tails or extreme responses [10, 21, 29], as often encountered in genetics and biomedical research. For example, binary outcomes in genetic studies, such as the presence or absence of a specific disease, are influenced by complex biological processes that standard links may inadequately capture. Moreover, inference results from these models may become unreliable in the presence of outlying covariates or incoherent responses [22, 24, 40]. Addressing these limitations requires exploring alternative approaches that better account for the complexities of such data.
In this context, the t-distribution emerges as an attractive alternative due to its flexibility in modeling heavy-tailed data, controlled by the degrees of freedom parameter v. This study aims to further investigate the t-link model’s potential to enhance robustness and mitigate sensitivity to outliers. Specifically, we explore its application to high-dimensional variable selection, a scenario that has received little, if any, attention in the current literature.
Bayesian variable selection is a theoretically intricate and computationally demanding problem. These challenges stem from the balance required between numerical computation, prior specification, and analytical evaluation, as highlighted in earlier works [18, 12, 35]. High-dimensional variable selection introduces additional computational burdens due to the increasing number of covariates. To address this, [32] proposed the skinny Gibbs sampler using the continuous spike-and-slab prior introduced by [31], demonstrating its consistency in selecting the true model in logistic regression. [38] extended this algorithm to the probit model, further proving its theoretical properties.
Building on these earlier work, we propose the hierarchical skinny Gibbs sampler with t-link (HSGT). The HSGT method extends the skinny Gibbs sampler framework by incorporating the t-distribution in the link function. Furthermore, we enhance the algorithms of [32] and [38] by introducing a hierarchical prior on the variance component of active covariates. This update eliminates the time-consuming and problematic tuning process required in previous approaches. The advantages of our method are twofold: (1) the t-link provides a flexible and robust family of link models, with probit and logit as special cases, enabling a more accurate representation of underlying biological and medical phenomena. Theoretically, the probit link can be recovered as $v\to \infty $, while the logit link can be approximated by a t-link with a specific value of v [1, 32]. And its robustness in estimation and inference under outliers - both in covariates and responses - is well established in the literature [40]. and (2) the skinny Gibbs sampler ensures computational efficiency and scalability to large datasets, a critical requirement in genomics and other high-dimensional fields.
The paper is organized as follows. In Section 2, we introduce the model for variable selection in t-link binary regression and discuss the hierarchical Gibbs algorithms, including the exact Gibbs (HEGT) and the skinny Gibbs (HSGT) methods. The theoretical foundations of the HSGT method, along with consistency results, are detailed in Section 3, with comprehensive proofs provided in Appendix A. Simulation studies evaluating model selection under various scenarios are presented in Section 4, with a particular focus on comparing the performance of the t-link and probit link models, both with and without outliers. In Section 5, we apply the proposed method to a genomic dataset, providing a thorough comparison in prediction accuracy among models. Section 6 concludes with a discussion.
2 Methodology
We consider a scenario of binary classification in high-dimensional analysis, specifically focusing on a t-link function to model binary response in medicine or genomics studies where outliers and heavy-tailed distributions are observed. For instance, identifying key genetic variants (SNPs) associated with the presence or absence of a disease (e.g., cancer). Let ${E_{i}}\in \{0,1\}$ represent the binary response variable, where ${E_{i}}=1$ indicates that the ith subject has the disease, and ${E_{i}}=0$ otherwise, $i=1,\dots ,n$. The covariate vector ${\boldsymbol{x}_{i}}={({x_{i1}},{x_{i2}},\dots ,{x_{ip}})^{\prime }}\in {\mathcal{R}^{p}}$ contains the expression levels of p genes for the ith subject. The probability of ${E_{i}}=1$ is modeled through the generalized liner regression model framework as
where $\Psi (\cdot )$ is the cumulative distribution function of a distribution, such as the standard normal distribution for the probit link and the standard t-distribution for the t-link. The parameter v may be a scalar or a vector of parameters that is fixed or estimated in the regression model. In the t-link regression, v represents the degrees of freedom. The regression coefficients $\boldsymbol{\beta }$ is a $p\times 1$ vector, which satisfies $\| \boldsymbol{\beta }{\| _{0}}\lt \lt p$, meaning that only a small subset of the components of $\boldsymbol{\beta }$ are non-zero. This sparsity assumption is especially relevant in high-dimensional contexts where the number of predictors p is much larger than the sample size n.
(2.1)
\[ P({E_{i}}=1\mid {x_{i}},\boldsymbol{\beta },v)=\Psi ({\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },v),\]The observed-data likelihood function of all n observations is given as
Using the latent variable approach [1] and representing the t-distribution as a scale mixture of Gaussian, the complete-data likelihood function is expressed as:
where ${Z_{i}}|{\omega _{i}^{2}}$ is a normal with mean ${\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta }$ and variance ${\omega _{i}^{2}}$ and ${\omega _{i}^{2}}$ has an inverse-gamma distribution with the scale and the shape as $v/2$. The aim is to efficiently infer the sparsity pattern encoded in the regression coefficients $\boldsymbol{\beta }$ through a hierarchical Bayesian model with consistency guarantees. Throughout this methodological exposition, we assume that the columns of $\boldsymbol{X}$ are standardized to have zero mean and unit variance.
(2.2)
\[ {L_{n}}\left(\boldsymbol{\beta }\mid \boldsymbol{E},\boldsymbol{X},v\right)={\prod \limits_{i=1}^{n}}{\left[\Psi ({\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },\hspace{2.5pt}v)\right]^{{E_{i}}}}{\left[1-\Psi ({\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },\hspace{2.5pt}v)\right]^{(1-{E_{i}})}}.\](2.3)
\[\begin{array}{r}\displaystyle L(\boldsymbol{\beta },\boldsymbol{Z},\boldsymbol{\omega }\mid \boldsymbol{E},\boldsymbol{X},v)\hspace{99.58464pt}\\ {} \displaystyle \hspace{2em}={\prod \limits_{i=1}^{n}}\mathcal{N}({Z_{i}}\mid {\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },{\omega _{i}^{2}})\mathcal{IG}({\omega _{i}^{2}}\mid v/2,v/2)\times \\ {} \displaystyle \hspace{2em}\bigg({E_{i}}\mathcal{I}({Z_{i}}\ge 0)+(1-{E_{i}})\mathcal{I}({Z_{i}}\lt 0)\bigg),\end{array}\]2.1 Variable Selection with Spike-and-Slab Priors
The spike-and-slab prior is used to induce sparsity on $\boldsymbol{\beta }$ [17, 25]. This approach employs a binary latent vector $\boldsymbol{\gamma }$ of dimension p, corresponding to the number of covariates, where each element ${\gamma _{j}}$ indicates whether the jth covariate is active $({\gamma _{j}}=1)$ or inactive $({\gamma _{j}}=0)$, $j=1,\dots ,p$. The prior distribution for the regression coefficients reflects this status: a “spike” component centered near zero when ${\gamma _{j}}=0$, or a “slab” component with a diffuse probability density when ${\gamma _{j}}=1$. Building on this framework, [33] demonstrated that allowing the variances of the spike-and-slab components to depend on the sample size ensures that, as the sample size grows, the posterior probability of selecting the true model converges to one.
Let ${\boldsymbol{\beta }_{k}}$ be the set of active covariates and ${\boldsymbol{\beta }_{{k^{c}}}}$ the set of inactive covariates such that $k=\{j:\mathcal{I}({\gamma _{j}}\ne 0)\hspace{0.2778em}\}$. Then ${\textstyle\sum _{j=1}^{p}}{\gamma _{j}}=|k|$ is the number of non-zero entries in $\boldsymbol{\gamma }$. Similarly, let ${\boldsymbol{X}_{k}}$ represent a sub-matrix of $\boldsymbol{X}$ with columns corresponding to the nonzero indices of $\boldsymbol{\gamma }$ while ${\boldsymbol{X}_{{k^{c}}}}$ includes the remaining columns associated with ${\gamma _{j}}=0$. The priors for the binary latent variables ${\gamma _{j}}$ and the corresponding regression coefficients ${\beta _{j}}$, $j=1,\dots ,p$, are specified as follows:
for some positive constants ${\tau _{0}^{2}}$, r, s and $q\in (0,1)$. The hyperparameters r and s are the shape and scale parameters of the inverse gamma prior on ${\tau _{1}^{2}}$. The parameter ${\tau _{0}^{2}}$ is the variance of the spike part and ${\tau _{1}^{2}}$ controls the variance of the slab part. The prior probability q is the probability that ${\beta _{j}}\ne 0$. The rates determining ${\tau _{0}^{2}}$ and q are as specified in [33, 32] and for asymptotic considerations (i.e. $n{\tau _{0}^{2}}=o(1)$, and $q\sim {p^{-1}}$).
2.2 Hierarchical Exact Gibbs Sampler (HEGT)
The full joint posterior distributions of $\boldsymbol{\beta }$, $\boldsymbol{Z}$, $\boldsymbol{\gamma }$, $\boldsymbol{\omega }$ and ${\tau _{1}^{2}}$ conditioned on X, E and v using the complete-data likelihood in (2.3) and the spike-and-slab prior specified in (2.4)-(2.7) is given as
where $\mathbf{W}=\text{Diag}\left(1/{\omega _{1}^{2}},\dots ,1/{\omega _{n}^{2}}\right)$, $|\boldsymbol{k}|={\textstyle\sum _{j=1}^{p}}{\gamma _{j}}$ and $\mathbf{D}=\text{Diag}\left({\tau _{0}^{-2}}(\mathbf{1}-\boldsymbol{\gamma })+{\tau _{1}^{-2}}\boldsymbol{\gamma }\right)$. (i) The conditional posterior distribution of ${\gamma _{j}}$ is
(2.8)
\[\begin{array}{r}\displaystyle \pi (\boldsymbol{\beta }\mathbf{,}\boldsymbol{Z}\mathbf{,}\boldsymbol{\gamma }\mathbf{,}\boldsymbol{\omega },{\tau _{1}^{2}}\mid \boldsymbol{E}\mathbf{,}\boldsymbol{X},v)\hspace{119.50148pt}\\ {} \displaystyle \propto L(\boldsymbol{\beta },\boldsymbol{Z}\mathbf{,}\boldsymbol{\omega }\mid \boldsymbol{E}\mathbf{,}\boldsymbol{X},v)\cdot \pi (\boldsymbol{\beta }\mid \boldsymbol{\gamma },{\tau _{1}^{2}})\cdot \pi (\boldsymbol{\gamma })\cdot \pi ({\tau _{1}^{2}})\hspace{5.69046pt}\\ {} \displaystyle \hspace{2em}\propto \exp \left\{-\frac{1}{2}\Big[{(\mathbf{Z}-\boldsymbol{X}\boldsymbol{\beta })^{\prime }}\mathbf{W}(\mathbf{Z}-\boldsymbol{X}\boldsymbol{\beta })+{\boldsymbol{\beta }^{\mathbf{\prime }}}\boldsymbol{D}\boldsymbol{\beta }\Big]\right\}\times \\ {} \displaystyle {\prod \limits_{i=1}^{n}}\bigg({E_{i}}\mathcal{I}({Z_{i}}\ge 0)+(1-{E_{i}})\mathcal{I}({Z_{i}}\lt 0)\bigg)\times \hspace{28.45274pt}\\ {} \displaystyle {\prod \limits_{i=1}^{n}}\left({\left(\frac{1}{{\omega _{i}^{2}}}\right)^{\frac{v+3}{2}}}\exp \left(-\frac{v}{2{\omega _{i}^{2}}}\right)\right){\left[\frac{1-q}{{\tau _{0}}\sqrt{2\pi }}\right]^{p}}\times \hspace{19.91684pt}\\ {} \displaystyle {\left[\frac{{\tau _{0}}q}{{\tau _{1}}(1-q)}\right]^{|\boldsymbol{k}|}}\left({\bigg[\frac{1}{{\tau _{1}^{2}}}\bigg]^{r+1}}\exp \left(-\frac{s}{{\tau _{1}^{2}}}\right)\right),\hspace{36.98866pt}\end{array}\]
\[ {\gamma _{j}}|\cdot \sim \text{Bernoulli}\left(\frac{{d_{j}}}{1+{d_{j}}}\right),\hspace{0.2222em}\hspace{0.2222em}\hspace{0.2222em}j=1,\dots ,p,\]
where
\[ {d_{j}}=\frac{\pi ({\gamma _{j}}=1\mid {\boldsymbol{\gamma }_{-j}},\boldsymbol{\beta }\mathbf{,}\boldsymbol{X}\mathbf{,}\boldsymbol{W}\mathbf{,}\boldsymbol{E})}{\pi ({\gamma _{j}}=0\mid {\boldsymbol{\gamma }_{-j}},\boldsymbol{\beta }\mathbf{,}\boldsymbol{X}\mathbf{,}\boldsymbol{W}\mathbf{,}\boldsymbol{E})}=\frac{q\phi \left({\beta _{j}};\hspace{0.2222em}\hspace{0.2222em}0,{\tau _{1}^{2}}\right)}{(1-q)\phi \left({\beta _{j}};\hspace{0.2222em}\hspace{0.2222em}0,{\tau _{0}^{2}}\right)}.\]
(ii) The conditional posterior distribution of ${\omega _{i}^{2}}$ is
\[ {\omega _{i}^{2}}\mid \cdot \sim \mathcal{IG}\left(\frac{1+v}{2},\hspace{1em}\frac{v+{({Z_{i}}-{\boldsymbol{x}_{\boldsymbol{i}}^{\mathbf{\prime }}}\boldsymbol{\beta })^{2}}}{2}\right).\]
(iii) The conditional posterior distribution of $\boldsymbol{\beta }$ is
\[ \boldsymbol{\beta }\mid \cdot \sim {\mathcal{N}_{p}}\left({\boldsymbol{V}^{\boldsymbol{-}\mathbf{1}}}{\boldsymbol{X}^{\mathbf{\prime }}}\boldsymbol{W}\boldsymbol{Z},\hspace{2.5pt}{\boldsymbol{V}^{-1}}\right),\]
where $\mathbf{V}={\mathbf{X}^{\prime }}W\mathbf{X}+\mathbf{D}$.(iv) The conditional posterior of ${Z_{i}}$ is
\[ \pi ({Z_{i}}\mid \cdot )\propto \left\{\begin{array}{l}\mathcal{N}({Z_{i}};\hspace{0.1667em}{\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },\hspace{0.1667em}{\omega _{i}^{2}})\mathcal{I}({Z_{i}}\ge 0),\hspace{1em}\text{if}\hspace{2.5pt}{E_{i}}=1,\hspace{1em}\\ {} \mathcal{N}({Z_{i}};\hspace{0.1667em}{\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },\hspace{0.1667em}{\omega _{i}^{2}})\mathcal{I}({Z_{i}}\lt 0),\hspace{1em}\text{if}\hspace{2.5pt}{E_{i}}=0.\hspace{1em}\end{array}\right.\]
(v) The conditional posterior distribution of ${\tau _{1}^{2}}$ is
2.3 Hierarchical Skinny Gibbs Sampler(HSGT)
The computational complexity of the above algorithm is primarily driven by sampling $\boldsymbol{\beta }$ in step (iii), which involves handling a $p\times p$ precision matrix $\boldsymbol{V}$. Calculating its inverse $({\boldsymbol{V}^{-1}})$ has a complexity of $O({p^{3}})$, while computing the product ${\boldsymbol{V}^{-1}}{\boldsymbol{X}^{\prime }}\boldsymbol{W}\boldsymbol{Z}$ introduces an additional complexity of $O({p^{2}}n)$. As a result, the total computational cost of HEGT is $O({p^{2}}(p\vee n))$, which increases rapidly with p. This high complexity makes the algorithm computationally expensive in terms of both time and memory, particularly within the context of an Markov chain Monte Carlo (MCMC) framework. [32] introduces the skinny Gibbs algorithm to handle this computational bottleneck. The skinny Gibbs algorithm is a simple yet effective modification of the Gibbs sampler, designed for high-dimensional settings with many predictors. At each MCMC iteration, $\boldsymbol{\beta }$ is divided into two components: the active component (${\boldsymbol{\beta }_{k}}$) and the inactive component (${\boldsymbol{\beta }_{{k^{c}}}}$). The sparsity is imposed by structuring the precision matrix ($\boldsymbol{V}$) of $\boldsymbol{\beta }$ as a two-block diagonal matrix, assuming full independence between the active and inactive components of $\boldsymbol{\beta }$. That is,
\[\begin{aligned}{}\boldsymbol{V}& =\left(\begin{array}{c@{\hskip10.0pt}c}{\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{W}\boldsymbol{X}_{k}}+{\tau _{1}^{-2}}{\boldsymbol{I}_{|\boldsymbol{k}|}}& {\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{W}\boldsymbol{X}_{{k^{c}}}}\\ {} {\boldsymbol{X}^{\prime }_{{k^{c}}}}{\boldsymbol{W}\boldsymbol{X}_{k}}& {\mathbf{X}^{\prime }_{{k^{c}}}}{\boldsymbol{X}_{{k^{c}}}}+{\tau _{o}^{-2}}{\boldsymbol{I}_{(p-|\boldsymbol{k}|)}}\end{array}\right)\\ {} & \approx \left(\begin{array}{c@{\hskip10.0pt}c}{\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{W}\boldsymbol{X}_{k}}+{\tau _{1}^{-2}}{\boldsymbol{I}_{|\boldsymbol{k}|}}& \mathbf{0}\\ {} \mathbf{0}& (n-1+{\tau _{o}^{-2}}){\boldsymbol{I}_{(p-|\boldsymbol{k}|)}}\end{array}\right).\end{aligned}\]
The joint skinny posterior of the skinny $\boldsymbol{\beta }$, $\boldsymbol{Z}$ $\boldsymbol{\omega }$, $\boldsymbol{\gamma }$ and ${\tau _{1}^{2}}$ conditioned on $\boldsymbol{X}$, $\boldsymbol{E}$ and v is given as
(i) The conditional posterior distribution of $\boldsymbol{\beta }$ is
(2.9)
\[\begin{array}{r}\displaystyle \pi (\boldsymbol{\beta },\boldsymbol{Z},\boldsymbol{\gamma },\boldsymbol{\omega },{\tau _{1}^{2}}\mid \boldsymbol{X},\boldsymbol{E},v)\hspace{142.26378pt}\\ {} \displaystyle \propto L({\boldsymbol{\beta }_{k}},\hspace{2.5pt}\boldsymbol{Z},\boldsymbol{\omega }\mid \boldsymbol{E},\hspace{2.5pt}{\boldsymbol{X}_{k}},v)\cdot \pi (\boldsymbol{\beta }\mid \boldsymbol{\gamma },{\tau _{1}^{2}})\cdot \pi (\boldsymbol{\gamma })\pi ({\tau _{1}^{2}})\hspace{14.22636pt}\\ {} \displaystyle \propto \exp \left\{-\frac{1}{2}\left({(\boldsymbol{Z}-{\boldsymbol{X}_{\boldsymbol{k}}}{\boldsymbol{\beta }_{\boldsymbol{k}}})^{\prime }}\boldsymbol{W}(\boldsymbol{Z}-{\boldsymbol{X}_{\boldsymbol{k}}}{\boldsymbol{\beta }_{\boldsymbol{k}}})+\frac{{\boldsymbol{\beta }^{\prime }_{k}}{\boldsymbol{\beta }_{k}}}{{\tau _{1}^{2}}}\right)\right\}\\ {} \displaystyle \times \exp \left\{-\frac{1}{2}\left(n-1+{\tau _{0}^{-2}}\right){\boldsymbol{\beta }^{\prime }_{{k^{c}}}}{\boldsymbol{\beta }_{{k^{c}}}}\right\}{\left[\frac{(1-q){\tau _{1}}}{q{\tau _{0}}}\right]^{-|\boldsymbol{k}|}}\times \\ {} \displaystyle {\prod \limits_{i=1}^{n}}\bigg({E_{i}}\mathcal{I}({Z_{i}}\ge 0)+(1-{E_{i}})\mathcal{I}({Z_{i}}\lt 0)\bigg)\times \hspace{48.36958pt}\\ {} \displaystyle {\prod \limits_{i=1}^{n}}\left({\left(\frac{1}{{\omega _{i}^{2}}}\right)^{\frac{v+3}{2}}}\exp \left(-\frac{v}{2{\omega _{i}^{2}}}\right)\right){\bigg[\frac{1}{{\tau _{1}^{2}}}\bigg]^{r+1}}\exp \left\{-\frac{s}{{\tau _{1}^{2}}}\right\}.\end{array}\]
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\boldsymbol{\beta }_{k}}\mid \cdot & \displaystyle \sim & \displaystyle {\mathcal{N}_{|\boldsymbol{k}|}}\left({\boldsymbol{V}_{k}^{-1}}{\boldsymbol{X}^{\prime }_{k}}\boldsymbol{W}\boldsymbol{Z},\hspace{2.5pt}{\mathbf{V}_{k}^{-1}}\right),\\ {} \displaystyle {\boldsymbol{\beta }_{{k^{c}}}}\mid \cdot & \displaystyle \sim & \displaystyle {\mathcal{N}_{p-|\boldsymbol{k}|}}\left(0,\hspace{2.5pt}{\boldsymbol{V}_{{k^{c}}}^{-1}}\right),\end{array}\]
where ${\boldsymbol{V}_{k}}={\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{W}\boldsymbol{X}_{k}}+{\tau _{1}^{-2}}{\boldsymbol{I}_{|\boldsymbol{k}|}}$ and ${\boldsymbol{V}_{{k^{c}}}}=(n-1+{\tau _{0}^{-2}}){\boldsymbol{I}_{p-|\boldsymbol{k}|}}$.Comparing to Step (iii) in Section 2.2, the computation here has a much lower complexity of $O(n(p\vee |\boldsymbol{k}{|^{2}}))$. This reduction not only minimizes memory usage, but also improves computational efficiency, making the MCMC algorithm more practical and desirable for high-dimensional data settings.
(ii) The conditional posterior distribution of ${\omega _{i}^{2}}$ is
\[ {\omega _{i}^{2}}\mid \cdot \sim \mathcal{IG}\left(\frac{1+v}{2},\hspace{1em}\frac{v+{({Z_{i}}-{\boldsymbol{x}^{\prime }_{ik}}{\boldsymbol{\beta }_{k}})^{2}}}{2}\right),\]
for $i=1,\dots ,n$.(iii) The conditional posterior distribution of ${Z_{i}}$ is
\[ \pi ({Z_{i}}\mid \cdot )\propto \left\{\begin{array}{l}\mathcal{N}({Z_{i}};\hspace{0.1667em}{\boldsymbol{x}^{\prime }_{i,k}}{\boldsymbol{\beta }_{k}},\hspace{0.1667em}{\omega _{i}^{2}})\mathcal{I}({Z_{i}}\ge 0),\hspace{1em}\text{if}\hspace{2.5pt}{E_{i}}=1,\hspace{1em}\\ {} \mathcal{N}({Z_{i}};\hspace{0.1667em}{\boldsymbol{x}^{\prime }_{i,k}}{\boldsymbol{\beta }_{k}},\hspace{0.1667em}{\omega _{i}^{2}})\mathcal{I}({Z_{i}}\lt 0),\hspace{1em}\text{if}\hspace{2.5pt}{E_{i}}=0,\hspace{1em}\end{array}\right.\]
for $i=1,\dots ,n$.(iv) The conditional posterior distribution of ${\gamma _{j}}$ is
\[ {\gamma _{j}}|\cdot \sim \text{Bernoulli}\left(\frac{{\tilde{d}_{j}}}{1+{\tilde{d}_{j}}}\right),\hspace{0.2778em}\hspace{0.2222em}\hspace{0.2222em}j=1,\dots ,p,\]
where
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\tilde{d}_{j}}& \displaystyle =& \displaystyle \frac{\pi ({\gamma _{j}}=1\mid {\boldsymbol{\gamma }_{-j}},\boldsymbol{\beta }\mathbf{,}\boldsymbol{X}\mathbf{,}\boldsymbol{W}\mathbf{,}\boldsymbol{E})}{\pi ({\gamma _{j}}=0\mid {\boldsymbol{\gamma }_{-j}},\boldsymbol{\beta }\mathbf{,}\boldsymbol{X}\mathbf{,}\boldsymbol{W}\mathbf{,}\boldsymbol{E})}\\ {} & \displaystyle =& \displaystyle {d_{j}}\cdot \exp \left\{\frac{1}{2}{X^{\prime }_{j}}(\boldsymbol{I}-\boldsymbol{W}){\boldsymbol{X}_{j}}{\beta _{j}^{2}}+\right.\\ {} & & \displaystyle \left.(\boldsymbol{Z}-{\boldsymbol{\beta }^{\prime }_{{k_{-j}}}}{\boldsymbol{X}^{\prime }_{{k_{-j}}}}){\boldsymbol{W}\boldsymbol{X}_{j}}{\beta _{j}}\right\}.\end{array}\]
(v) The conditional posterior distribution of ${\tau _{1}^{2}}$ is
3 Theoretical Properties
In this section, we present the asymptotic properties of the hierarchical prior setup of the algorithm on binary regression with the student t-link. Let $\boldsymbol{t}=\{{t_{1}},{t_{2}},\dots ,{t_{t}}\}\subseteq [p]$ represent the true model, indicating that the non-zero locations of the true coefficient vector correspond to $\boldsymbol{t}=(j,\hspace{0.2222em}\hspace{0.2222em}j\in \boldsymbol{t})$. We assume $|\boldsymbol{t}|$ is fixed. Let ${\boldsymbol{\beta }_{0}}\in {\mathcal{R}^{p}}$ be the true coefficient vector, and ${\boldsymbol{\beta }_{0,t}}\in {\mathcal{R}^{|\boldsymbol{t}|}}$ represent the vector of true non-zero coefficients.
For a given model k with active components indexed by $\boldsymbol{k}\subseteq [p]$ with size $|\boldsymbol{k}|$ at most ${m_{n}}$, we define ${\ell _{n}}({\boldsymbol{\beta }_{k}})$ as the log-likelihood and ${S_{n}}({\boldsymbol{\beta }_{k}})=\partial {\ell _{n}}({\boldsymbol{\beta }_{k}})/\partial {\boldsymbol{\beta }_{k}}$ as the score function. The restriction on the largest model size is a common practice in Bayesian high-dimensional asymptotic theory to effectively control the posterior ratio and achieve desirable consistency results. The variable selection consistency by the skinny Gibbs has been shown under the logit link [32] and the probit link [38], respectively. Here, we extend the consistency results to a t-link model with a known degrees of freedom.
The Hessian matrix of ${\ell _{n}}({\boldsymbol{\beta }_{k}})$ is defined as
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {H_{n}}({\boldsymbol{\beta }_{k}})=& \displaystyle -\frac{{\partial ^{2}}{\ell _{n}}}{\partial {\boldsymbol{\beta }_{k}}\partial {\boldsymbol{\beta }^{\prime }_{k}}}& \displaystyle ={\sum \limits_{i=1}^{n}}{\boldsymbol{x}_{i,k}}{\boldsymbol{x}^{\prime }_{i,k}}{w_{i}}({\boldsymbol{\beta }_{k}})\\ {} & & \displaystyle ={\boldsymbol{X}^{\prime }_{k}}\boldsymbol{W}({\boldsymbol{\beta }_{k}}){\boldsymbol{X}_{k}},\end{array}\]
where $\boldsymbol{W}({\boldsymbol{\beta }_{k}})=\text{Diag}\{{w_{i}}({\boldsymbol{\beta }_{k}})\}$ for $i=1,\dots ,n$ with ${w_{i}}({\boldsymbol{\beta }_{k}})$ as
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}& & \displaystyle {E_{i}}\left\{\frac{(v+1)}{v}{\left(1+\frac{{\mu _{i,k}^{2}}}{v}\right)^{-1}}{\mu _{i,k}}\hspace{2.5pt}{r_{i,k}}+{r_{i,k}^{2}}\right\}+\\ {} & & \displaystyle (1-{E_{i}})\cdot \left\{-\frac{(v+1)}{v}{\left(1+\frac{{\mu _{i,k}^{2}}}{v}\right)^{-1}}{\mu _{i,k}}\hspace{2.5pt}{R_{i,k}}+{R_{i,k}^{2}}\right\}\end{array}\]
with ${r_{i,k}}=\psi ({\mu _{i,k}},\hspace{2.5pt}v)/\Psi ({\mu _{i,k}},\hspace{2.5pt}v)$ and ${R_{i,k}}=\psi ({\mu _{i,k}},\hspace{2.5pt}v)/[1-\Psi ({\mu _{i,k}},\hspace{2.5pt}v)]$, $\psi ({\mu _{i,k}},\hspace{2.5pt}v)$ and $\Psi ({\mu _{i,k}},\hspace{2.5pt}v)$ as the density and the cumulative probability functions evaluated at ${\mu _{i,k}}$ of a standard t distribution with v degrees of freedom, and ${\mu _{i,k}}={\boldsymbol{x}_{i,k}}{\boldsymbol{\beta }_{i,k}}$.Before presenting the main results, we introduce the following notation: (i) for any $x,y\in \mathbb{R}$, $x\vee y$ and $x\wedge y$ represent the maximum and minimum of x and y, respectively. For any positive real sequences ${x_{n}}$ and ${y_{n}}$, (ii) ${x_{n}}\lesssim {y_{n}}$ or ${x_{n}}=O({y_{n}})$ implies there exists $C\gt 0$ a constant such that $|{x_{n}}|\le C|{y_{n}}|$ for all large n. (iii) ${x_{n}}\ll {y_{n}}$ or ${x_{n}}=o({y_{n}})$ if ${x_{n}}/{y_{n}}\to 0$ as $n\to \infty $. (iv) If ${x_{n}}\sim {y_{n}}$, there exists ${K_{1}}\gt {K_{2}}\gt 0$ such that ${K_{2}}\lt {y_{n}}/{x_{n}}\le {x_{n}}/{y_{n}}\lt {K_{1}}$.
Let $\boldsymbol{a}={({a_{1}},{a_{2}},\dots ,{a_{p}})^{\prime }}\in {\mathbb{R}^{p}}$ then $\| \boldsymbol{a}{\| _{2}}$ and $\| \boldsymbol{a}{\| _{1}}$ is the ${\ell _{2}}$ and ${\ell _{1}}$ norm of a respectively and $\| \boldsymbol{a}{\| _{\text{max}}}={\text{max}_{1\le j\le p}}|{a_{j}}|$. For any real symmetric matrix $\boldsymbol{M}$, ${\lambda _{\max }}(\boldsymbol{M})$ and ${\lambda _{\min }}(\boldsymbol{M})$ are the maximum and minimum eigenvalues of $\boldsymbol{M}$, respectively. The following conditions are assumed to obtain the asymptotic results:
Condition (C1) (Condition on dimension (n and p)) For some constant $0\lt {d^{\prime }}\lt 1$, let $p\gg n$, $\log \hspace{2.5pt}p=o(n)$ and ${m_{n}}=O\Big({(n/\log p)^{\frac{1-{d^{\prime }}}{2}}}\wedge p\Big)$, as $n\to \infty $.Condition 3 guarantees that our proposed method is capable of handling high-dimensional setting analysis where the number of predictors increases at a sub-exponential rate relative to n. The parameter ${m_{n}}$ restricts the analysis to a set of sufficiently large models. Also, to avoid over-fitting, it is required that ${m_{n}}\ll n$. Similar assumptions regarding model size limitations are frequently encountered in the sparse estimation literature [14, 34, 28].
Condition (C2) (Condition on Design Matrix $\boldsymbol{X}$) For ${\boldsymbol{x}_{\boldsymbol{i}}}\in {\mathcal{R}^{p}},\hspace{1em}i=1,2,\dots ,n$.
The condition 3 [i] requires each component of ${\boldsymbol{x}_{i}}$ is bounded with probability 1 as adopted by [32] for a deterministic design matrix. The condition [ii] of 3 implies that a linear combination of ${\boldsymbol{x}_{i}}={({x_{i1}},\dots ,{x_{ip}})^{\prime }}$ has a sufficiently light tail satisfying sub-Gaussianity. See similar conditions under the logistic regression in [32, 8].
-
i. (Boundedness) $P(||{\boldsymbol{x}_{\boldsymbol{i}}}|{|_{max}}\le {C_{0}})=1$ for some ${C_{0}}\gt 0$ and $P(\cdot )$ is the probability measure.
-
ii (Sub-Gaussianity) for every $\alpha \in {\mathcal{R}^{p}}$, there exists a constant ${C_{\alpha }}\gt 0$ such that $\bar{E}\exp ({\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\alpha })\le \exp ({C_{\alpha }}||\boldsymbol{\alpha }|{|_{2}^{2}})$
Condition (C3) (Conditions on True Model) For some constant ${c_{0}}\gt 0$,
\[ \underset{j\in t}{\min }\| {\beta _{0,j}}{\| _{2}^{2}}\ge {c_{0}}|\boldsymbol{t}|{\Lambda _{|\boldsymbol{t}|}}\Big(\frac{\log p}{n}\vee \frac{1}{\log p}\Big),\]
and ${\Lambda _{|\boldsymbol{t}|}}={\max _{\boldsymbol{t}:\hspace{0.2222em}|\boldsymbol{t}|\le {c_{0}}}}{\lambda _{\max }}({n^{-1}}{\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{X}_{k}})$ for any integer ${c_{0}}\gt 0$.This condition (also known as beta-min condition) guarantees true regression coefficients ${\boldsymbol{\beta }_{0}}$ have a finite number of nonzero entries and a bounded ${\ell _{1}}$-norm. It holds if we assume that the true inactive beta components (${\boldsymbol{\beta }_{{t^{c}}}}$) can be nonzero but $\| {\boldsymbol{X}_{{t^{c}}}}{\boldsymbol{\beta }_{{t^{c}}}}{\| _{2}}=O(1)$ in line with similar assumption in [33].
Condition (C4) By adapting the line of reasoning(Conditions on prior hyperparameters) $n{\tau _{0}^{2}}=O(1)$ and for some constants ${a_{1}},{a_{2}}\gt 0$, the hyperparameters satisfy ${a_{1}}\lt r,s\lt {a_{2}}$ and ${q^{2}}\sim {p^{-1}}$, where r and s are the shape and scale parameters of the inverse gamma prior on ${\tau _{1}^{2}}$ defined in Section 2.1.Condition 3 suggests appropriate conditions for the hyperparameters. Similar assumptions have also been considered in [43], [26] and [9]. We impose a prior on ${\tau _{1}^{2}}$ with its value estimated within the MCMC iterations. This eliminates the need for the tuning procedure required in [32].
Lemma 1 (Pseudoposterior of $\boldsymbol{\gamma }$).
The marginal posterior for any model $\boldsymbol{\gamma }$, ${m_{k}}(\boldsymbol{E})$, under the skinny Gibbs is given by
\[\begin{array}{r@{\hskip10.0pt}c}\displaystyle {m_{k}}(\boldsymbol{E})=\pi (\boldsymbol{k}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})\propto & \displaystyle {\big\{2\pi {v_{n}^{-2}}(n-1+{\tau _{0}^{-2}})\big\}^{\frac{|\boldsymbol{k}|}{2}}}\times \\ {} & \displaystyle \textstyle\int \textstyle\int {({\tau _{1}^{2}})^{-\frac{|\boldsymbol{k}|}{2}}}\pi ({\tau _{1}^{2}})\exp \big\{{\ell _{n}}({\boldsymbol{\beta }_{k}})\big\}\\ {} & \displaystyle \exp \Big(-\frac{1}{2{\tau _{1}^{2}}}\| {\boldsymbol{\beta }_{k}}{\| _{2}^{2}}\Big)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}},\end{array}\]
where ${v_{n}}=(1-q)/({\tau _{0}}q)$.
Proof of Lemma 1.
\[\begin{aligned}{}{m_{k}}(\boldsymbol{E})=& \pi (\boldsymbol{k}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})\\ {} \propto & {v_{n}^{-|\boldsymbol{k}|}}\int \int \exp \left\{{\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)\right\}\exp \left\{-\frac{1}{2{\tau _{1}^{2}}}{\boldsymbol{\beta }^{\prime }_{k}}{\boldsymbol{\beta }_{k}}\right\}\times \\ {} & \exp \left\{-\frac{1}{2}\left(n-1+{\tau _{0}^{-2}}\right){\boldsymbol{\beta }^{\prime }_{{k^{c}}}}{\boldsymbol{\beta }_{{k^{c}}}}\right\}d{\boldsymbol{\beta }_{{k^{c}}}}d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} \propto & {v_{n}^{-|\boldsymbol{k}|}}{\left\{2\pi \left(n-1+{\tau _{0}^{-2}}\right)\right\}^{-\frac{p-|\boldsymbol{k}|}{2}}}\times \\ {} & \int \int \exp \left\{{\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)\right\}\exp \left(-\frac{1}{2{\tau _{1}^{2}}}{\boldsymbol{\beta }^{\prime }_{k}}{\boldsymbol{\beta }_{k}}\right)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} \propto & {\left\{2\pi {v_{n}^{-2}}\left(n-1+{\tau _{0}^{-2}}\right)\right\}^{\frac{|\boldsymbol{k}|}{2}}}\times \\ {} & \int \int \exp \left\{{\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)\right\}\exp \left(-\frac{1}{2{\tau _{1}^{2}}}{\boldsymbol{\beta }^{\prime }_{k}}{\boldsymbol{\beta }_{k}}\right)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}.\end{aligned}\]
□
Lemma 1 is used to prove Theorem 1 which demonstrates the proposed method guarantees posterior ratio consistency.
3.1 Model Selection Consistency
Theorem 1 (Posterior ratio consistency).
Assume Conditions 3 - 3 hold, for any $\boldsymbol{k}\ne \boldsymbol{t}$, $PR(\boldsymbol{k},\boldsymbol{t})=\frac{\pi (\boldsymbol{k}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})}{\pi (\boldsymbol{t}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})}$,
where $\boldsymbol{t}\subseteq [p]$ is the true model.
(3.1)
\[ PR(\boldsymbol{k},\boldsymbol{t})\stackrel{P}{\to }0\hspace{1em}\hspace{2.5pt}\textit{as}\hspace{2.5pt}n\to \infty ,\]Theorem 1 shows that, asymptotically, our posterior avoids over-fitting by not including an excessive number of unnecessary variables. However, this does not guarantee that the posterior will concentrate on the true model. To capture all significant variables, the nonzero entries in ${\beta _{0,t}}$ must have sufficiently large magnitudes. In the following theorem, we establish that our posterior achieves strong selection consistency, meaning that the posterior probability assigned to the true model t converges to 1. This requires a suitable condition on the lower bound for the magnitudes of the nonzero entries in ${\boldsymbol{\beta }_{0,t}}$. It is important to note that Theorem 1 does not require the smallest nonzero entries to be bounded away from 0. Furthermore, Theorem 2 guarantees posterior ratio consistency, which implies that the true model t will become the mode of the posterior as the sample size grows.
The proofs for Theorems 1 and 2 are provided in the appendix. The authors in [8] established consistency under hierarchical nonlocal priors in logistic regression, and we adapt their proof strategy to demonstrate that the asymptotic results also hold under spike and slab priors within the skinny Gibbs framework.
The current consistency results are established for variable selection under a specified regression model. That is, under Conditions (C1)–(C4), posterior concentration for $\boldsymbol{\beta }$ holds with all the other components of the regression model fixed, including the random distribution of the response variable and the link function. Allowing the degrees of freedom parameter ν to be unknown incorporates the link function determination into the model selection process. Because of the interplay between the regression coefficients and the link function, additional conditions beyond (C1)–(C4) are required to ensure model identifiability and consistency.
4 Simulation
We assess the performance of the t-links, the logit link, and the probit link in variable selection, particularly in the presence of outlying observations. In addition, we evaluate the performance under both the exact and the skinny Gibbs algorithms. For each data replicate, ten models are fitted and compared, including the hierarchal skinny Gibbs t-link model(HSGT), the exact Gibbs t-link model (HEGT), the skinny Gibbs logit model (SLogit)[37], the exact Gibbs logit model (ELogit), the skinny Gibbs probit model (SProbit) and the exact Gibbs probit model (EProbit)[38]. For the t-link models, we consider $\nu =1(\text{Cauchy}),3$, and 7, respectively, where $\nu =1$ represents the heaviest tail case among integer values of ν, and $\nu =7$ is sometimes used as an approximation to the logit link. In addition, three LASSO models with different links are fitted to provide a frequentist benchmark and to further illustrate the robustness of the link functions under alternative estimation methods.
Here we assume ν as known and fix it at a specific value. The proposed algorithm is flexible and can be applied to any value of ν. The “best” ν for a given dataset can be determined ad hoc using standard model selection criteria, such as cross validation, DIC or WAIC. All the theoretical properties are proved for any fixed value of ν.
In the simulation, the first $|\boldsymbol{k}|$ columns of the design matrix $\boldsymbol{X}$ correspond to active covariates with non-zero coefficients, while the remaining columns represent inactive covariates with zero regression coefficients. Each row ${\boldsymbol{x}_{i}}$ of $\boldsymbol{X}$ is independently generated from a normal distribution with a p-dimensional covariance matrix Σ. The binary response $\boldsymbol{E}$ is generated from a Bernoulli distribution with the probability of success $\Psi (\boldsymbol{X}\boldsymbol{\beta },v)$, where $\Psi (\cdot )$ is the cumulative distribution function of a standard normal for the probit link ($v\to \infty $), a standard logistic distribution for the logit link, and a standard t distribution with the degrees of freedom $v=1,3,7$. We set $|\boldsymbol{k}|=4$, $n=100$, $\boldsymbol{\beta }={(3,1.5,1.0,0.5)^{\prime }}$. We vary the true link models ${\Psi ^{-1}}$ (the probit link, the logit link or the Cauchy link ($v=1$)), the dimension in $\boldsymbol{X}$ ($p=100$ or 500) and its correlation structure (independent or dependent). For independent $\boldsymbol{X}$, the covariance matrix $\boldsymbol{\Sigma }=\boldsymbol{I}$; in the case of correlated covariates, Σ is structured into a two-block matrix corresponding to ${\boldsymbol{X}_{k}}$ and ${\boldsymbol{X}_{{k^{c}}}}$: with a constant correlation of 0.1 between blocks, and within each block, an AR(1) correlation structure with autocorrelation ρ values of 0.5 across the covariates. Thus, there are a total of 12 scenarios to simulate. A total of 50 data replicates are generated under each scenario.
For each simulated dataset, we also generate two modified datasets identical to the original except for one observation The first dataset is altered to create a “bad leverage point” as described in [22]. This observation is randomly selected from among those with ${E_{i}}=1$ and the value of the first covariate is changed to −10 (note the regression coefficient associated with the first covariate is 3). It is referred to as a “bad leverage point” because it exerts leverage while contradicting the association between the response and explanatory variable observed in the rest of the data. The second dataset creates a “non-leverage outlier” by selecting the observation with the largest magnitude of ${E_{i}}-\Psi ({\mathbf{x}_{i}}\boldsymbol{\beta })$ and set the value of binary observation as $1-{E_{i}}$. A comparison of the fits from the 10 Bayesian models and three LASSO models, both with and without the outlier, highlights the robustness of the t-link model compared to the probit link.
For all Bayesian models, we set ${\tau _{0}^{2}}$ to $1/n$ and select $q=P({\gamma _{i}}=1)$ such that $P\left({\textstyle\sum _{j=1}^{p}}{\gamma _{j}}=1\gt {K_{\gamma }}\right)=0.1$, with ${K_{\gamma }}=\text{max}(10,\log (n))$. As discussed, unlike [32] and [38], the slab variance parameter ${\tau _{1}^{2}}$ is imposed an inverse gamma prior with the hyperparameters s and r set to 2 and 1, respectively, providing a prior mean of 1 and an infinite prior variance. A total of 6, 000 iterations are run for the posterior sampling with a burn-in period of 2,000 iterations.
The metrics used for evaluating the methods are divided into two categories: variable selection performance and prediction performance. The variable selection metrics include sensitivity (SEN), specificity (SPE), and Matthew’s correlation coefficient (MCC), defined as follows:
\[\begin{aligned}{}\text{SEN}& =\text{Sensitivity}=\frac{TP}{TP+FN},\\ {} \text{SPE}& =\text{Specificity}=\frac{TN}{TN+FP},\\ {} \text{MCC}& =\frac{TP\times TN\hspace{-0.1667em}-\hspace{-0.1667em}FN\times FP}{\sqrt{(TP\hspace{-0.1667em}+\hspace{-0.1667em}FP)(TP\hspace{-0.1667em}+\hspace{-0.1667em}FN)(TN\hspace{-0.1667em}+\hspace{-0.1667em}FP)(TN\hspace{-0.1667em}+\hspace{-0.1667em}FN)}},\end{aligned}\]
where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. The MCC score provides a more comprehensive measure of performance than other metrics, as it incorporates all elements of the confusion matrix into a single summary statistic [13]. In addition, we present the out-of-sample mean square prediction error (MSPE) defined as:
where $\hat{p}$ is the estimated $\text{Prob}(E=1)$.Table 1 presents the comparison results, reporting average metrics from simulations based on data generated from the probit link model under two covariate correlation structures with $p=500$. Results on the remaining 11 scenarios are provided in Tables S1–S5 of the Supplementary Materials.
Table 1
Simulation Results: Probit as the True Model (p = 500).
| Bayesian | LASSO | ||||||||||||
| Cauchy ($v=1$) | t ($v=3$) | t ($v=7$) | logit | probit | Cauchy | logit | probit | ||||||
| SG | EG | SG | EG | SG | EG | SG | EG | SG | EG | ||||
| $\Sigma =I$ | |||||||||||||
| No Outlier | |||||||||||||
| SEN | 0.6100 | 0.5850 | 0.6000 | 0.6100 | 0.6400 | 0.5900 | 0.6300 | 0.6050 | 0.6400 | 0.6000 | 0.7450 | 0.6850 | 0.6800 |
| SPE | 0.9995 | 0.9995 | 0.9996 | 0.9997 | 0.9995 | 0.9996 | 0.9994 | 0.9998 | 1.0000 | 0.9998 | 0.9729 | 0.9898 | 0.9878 |
| MCC | 0.7435 | 0.7316 | 0.7432 | 0.7574 | 0.7673 | 0.7386 | 0.7562 | 0.7621 | 0.7772 | 0.7574 | 0.4188 | 0.5590 | 0.5510 |
| FP | 0.2600 | 0.2600 | 0.2000 | 0.1600 | 0.2600 | 0.1800 | 0.2800 | 0.0800 | 0.1600 | 0.1200 | 13.4200 | 5.0600 | 5.9000 |
| FN | 1.6600 | 1.6600 | 1.6000 | 1.5600 | 1.4400 | 1.6400 | 1.4800 | 1.5800 | 1.4400 | 1.6000 | 1.0200 | 1.2600 | 1.2800 |
| MSPE | 0.0994 | 0.1034 | 0.0923 | 0.0994 | 0.0930 | 0.1027 | 0.0916 | 0.0947 | 0.0902 | 0.1022 | 0.1233 | 0.1222 | 0.1194 |
| Bad Leverage Outlier | |||||||||||||
| SEN | 0.6050 | 0.5750 | 0.5900 | 0.5600 | 0.5600 | 0.3950 | 0.5400 | 0.4200 | 0.4750 | 0.3550 | 0.7100 | 0.0400 | 0.0000 |
| SPE | 0.9999 | 0.9998 | 0.9999 | 0.9999 | 0.9999 | 0.9999 | 0.9999 | 0.9999 | 0.9982 | 1.0000 | 0.9864 | 1.0000 | 1.0000 |
| MCC | 0.7647 | 0.7427 | 0.7564 | 0.7374 | 0.7358 | 0.5946 | 0.7228 | 0.6098 | 0.5236 | 0.5260 | 0.5381 | 0.0626 | 0.0000 |
| FP | 0.0400 | 0.0800 | 0.0400 | 0.0400 | 0.0400 | 0.0400 | 0.0400 | 0.0600 | 1.4800 | 0.2200 | 6.7600 | 0.0000 | 0.0000 |
| FN | 1.5800 | 1.7000 | 1.6400 | 1.7600 | 1.7600 | 2.4200 | 1.8400 | 2.3200 | 2.1000 | 2.5800 | 1.1600 | 3.8400 | 4.0000 |
| MSPE | 0.0943 | 0.0956 | 0.0918 | 0.0984 | 0.0976 | 0.1480 | 0.1043 | 0.1377 | 0.1940 | 0.1838 | 0.1383 | 0.2472 | 0.2500 |
| Non-leverage Outlier | |||||||||||||
| SEN | 0.5450 | 0.5650 | 0.5400 | 0.4850 | 0.4950 | 0.4700 | 0.5000 | 0.4600 | 0.4850 | 0.4500 | 0.6600 | 0.4900 | 0.4450 |
| SPE | 0.9998 | 0.9998 | 0.9995 | 0.9998 | 0.9994 | 0.9994 | 0.9996 | 0.9995 | 0.9992 | 1.0000 | 0.9849 | 0.9978 | 0.9988 |
| MCC | 0.7189 | 0.7291 | 0.7122 | 0.6680 | 0.6583 | 0.6414 | 0.6774 | 0.6356 | 0.6150 | 0.6354 | 0.4916 | 0.5978 | 0.5974 |
| FP | 0.1000 | 0.1200 | 0.2400 | 0.1200 | 0.2800 | 0.3000 | 0.1800 | 0.2600 | 0.7000 | 0.2200 | 7.4800 | 1.2800 | 0.7800 |
| FN | 1.8200 | 1.7400 | 1.8400 | 2.0600 | 2.0200 | 2.1200 | 2.0000 | 2.1600 | 2.0600 | 2.2000 | 1.3600 | 2.0400 | 2.2200 |
| MSPE | 0.1001 | 0.0998 | 0.1014 | 0.1146 | 0.1138 | 0.1229 | 0.1073 | 0.1176 | 0.1236 | 0.1250 | 0.1415 | 0.1562 | 0.1652 |
| $\Sigma ={\Sigma _{ar1(\rho =0.50)}}$ | |||||||||||||
| No Outlier | |||||||||||||
| SEN | 0.5600 | 0.5400 | 0.5500 | 0.5350 | 0.6100 | 0.5650 | 0.5850 | 0.5700 | 0.5950 | 0.5450 | 0.8700 | 0.8450 | 0.8200 |
| SPE | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9996 | 0.9996 | 0.9997 | 0.9997 | 1.0000 | 1.0000 | 0.9690 | 0.9828 | 0.9846 |
| MCC | 0.7238 | 0.7116 | 0.7225 | 0.7088 | 0.7530 | 0.7245 | 0.7400 | 0.7355 | 0.7548 | 0.7322 | 0.4192 | 0.5624 | 0.5790 |
| FP | 0.1600 | 0.1600 | 0.1600 | 0.1400 | 0.2000 | 0.1800 | 0.1600 | 0.1400 | 0.1200 | 0.0400 | 15.3800 | 8.4200 | 7.7200 |
| FN | 1.7600 | 1.8400 | 1.8000 | 1.8600 | 1.5600 | 1.7400 | 1.6600 | 1.7200 | 1.6200 | 1.8200 | 0.5200 | 0.6200 | 0.7200 |
| MSPE | 0.0662 | 0.0689 | 0.0653 | 0.0719 | 0.0638 | 0.0719 | 0.0650 | 0.0672 | 0.0616 | 0.0702 | 0.0785 | 0.0806 | 0.0816 |
| Bad Leverage Outlier | |||||||||||||
| SEN | 0.5600 | 0.5300 | 0.5450 | 0.4650 | 0.5100 | 0.4000 | 0.5150 | 0.4300 | 0.5000 | 0.4200 | 0.8400 | 0.6200 | 0.1500 |
| SPE | 0.9997 | 0.9998 | 0.9998 | 0.9995 | 0.9988 | 0.9994 | 0.9995 | 0.9996 | 0.9996 | 1.0000 | 0.9888 | 0.9958 | 1.0000 |
| MCC | 0.7257 | 0.7115 | 0.7187 | 0.6447 | 0.6384 | 0.5788 | 0.6762 | 0.6211 | 0.6108 | 0.6104 | 0.6381 | 0.6844 | 0.2308 |
| FP | 0.1600 | 0.0800 | 0.1200 | 0.2400 | 0.6200 | 0.3200 | 0.2400 | 0.1800 | 0.7600 | 0.2000 | 5.5800 | 2.0832 | 0.0200 |
| FN | 1.7600 | 1.8800 | 1.8200 | 2.1400 | 1.9600 | 2.4000 | 1.9400 | 2.2800 | 2.0000 | 2.3200 | 0.6400 | 1.4400 | 3.4000 |
| MSPE | 0.0715 | 0.0783 | 0.0765 | 0.1133 | 0.1255 | 0.1491 | 0.1169 | 0.1365 | 0.1394 | 0.1454 | 0.1090 | 0.1694 | 0.2344 |
| Non-leverage Outlier | |||||||||||||
| SEN | 0.5700 | 0.5300 | 0.5550 | 0.4750 | 0.5200 | 0.3700 | 0.5100 | 0.4200 | 0.4450 | 0.3750 | 0.8500 | 0.7050 | 0.6300 |
| SPE | 0.9994 | 0.9998 | 0.9994 | 0.9999 | 0.9996 | 0.9998 | 0.9997 | 0.9996 | 0.9994 | 1.0000 | 0.9822 | 0.9984 | 0.9974 |
| MCC | 0.7197 | 0.7124 | 0.7133 | 0.6745 | 0.6890 | 0.5808 | 0.6925 | 0.6139 | 0.5472 | 0.5588 | 0.5287 | 0.7171 | 0.7176 |
| FP | 0.2800 | 0.0800 | 0.2800 | 0.0600 | 0.2200 | 0.1200 | 0.1400 | 0.2000 | 1.0200 | 0.3400 | 8.8400 | 2.0832 | 1.2600 |
| FN | 1.7200 | 1.8800 | 1.7800 | 2.1000 | 1.9200 | 2.5200 | 1.9600 | 2.3200 | 2.2200 | 2.5000 | 0.6000 | 1.1800 | 1.4800 |
| MSPE | 0.0697 | 0.0698 | 0.0697 | 0.0768 | 0.0775 | 0.0962 | 0.0784 | 0.0905 | 0.0996 | 0.1020 | 0.0944 | 0.1242 | 0.1460 |
Across all scenarios without data contamination (“No outlier”), the different link models exhibit similar performance. As expected, the true model generally performs slightly better than the others in terms of variable selection (higher MCC) and prediction accuracy (lower MSPE). Notably, the skinny Gibbs algorithm often outperforms the exact Gibbs algorithm for both the probit and t-link models, consistent with the findings of [32] and [38].
A comparison between the “No outlier” and “Bad Leverage Outlier” scenarios shows that the performance of the t-link models with $\nu =1$ or 3 experiences only minimal changes. In contrast, the results for the probit link show noticeable deterioration, both in variable selection (evidenced by lower MCC and higher FP and FN) and prediction (reflected in higher MSPE). Due to this one outlying observation, the probit model performs much worse than the t-link even when the datasets are generated under the probit link. These findings highlight the sensitivity of the probit link and the robustness of the t-link model. The t-link with $\nu =7$ and the logit link show generally comparable performance, both more robust than the probit link but less robust than the t-links with $\nu =1$ or $\nu =3$.
When comparing the “Non-leverage outlier” scenario with the other two cases, the t-links with $\nu =1$ and $\nu =3$ again demonstrate robustness in both variable selection and prediction. The other link functions show some deterioration, though to a lesser degree than under the “Bad Leverage Outlier” scenario, with the probit link remaining the most vulnerable.
Overall, the LASSO models perform worse than their Bayesian counterparts fitted with either the skinny Gibbs or exact Gibbs algorithms using the same link functions. Regarding the link functions, similar patterns are observed for the Cauchy, logit, and probit links under the LASSO method.
Figure 1 displays the runtime (in seconds) per iteration for the four methods, highlighting the computational efficiency of the skinny methods compared to the exact Gibbs methods. The computation time for HEGT and EProbit appears to increase quadratically with p, whereas HSGT and SProbit show a linear increase, consistent with their respective computational complexities. The t-link requires a slightly more time than the probit under either the exact or the skinny algorithms (by a margin of microseconds). This difference stems from the additional time spent sampling extra inverse-gamma latent variables, which is necessary when employing the t-link method.
The Fast Sampling Algorithm in [3] has a typical computational complexity of ${n^{2}}p$, which is slightly higher than the $n(p\vee |\boldsymbol{k}{|^{2}})$ complexity of Skinny Gibbs, where $|\boldsymbol{k}|$ denotes the active model size and typically satisfies $|\boldsymbol{k}|\ll p$. The Skinny Gibbs method is specifically designed to ensure scalability under sparsity, making it a more suitable choice for sparse Bayesian variable selection.
5 Application: PCR Data
The PCR dataset originates from a study conducted by [27] to explore the genetics of two inbred mouse populations, B6 and BTBR. It includes gene expression data for 22,575 genes measured across 60 F2 mice, comprising 31 females and 29 males. The physiological phenotype, glycerol-3-phosphate acyltransferase (GPAT), was quantified using quantitative real-time Polymerase Chain Reaction (PCR). Both the phenotypic data (including GPAT measurements) and the gene expression data are publicly available on the GEO website (http://www.ncbi.nlm.nih.gov/geo) under accession number GSE3330.
GPAT is known to influence Hepatic Steatosis, a condition associated with obesity, with lower GPAT levels reducing its prevalence. Following the approach in [32], we define a binary response variable (E) based on GPAT values as $E=\mathbf{I}(\text{GPAT}\lt Q(0.4))$, where $Q(0.4)$ represents the 40th percentile of GPAT.
Given the high dimensionality of the gene expression data, an initial screening step was performed using the simple binary regression. This process identified the 99 most significant genes based on p-values, which were combined with the sex variable, resulting in a total of 100 predictors for further analysis. Figure 2 displays the boxplots of the standardized covariates in both datasets. The plot highlights the presence of numerous covariates with outlying values, which is a common occurrence in real-world data applications.
The six models, incorporating three link functions and two computational algorithms, are then fitted to the data. The initial number of active genes influencing GPAT is set to 10 ($|k|=10$), and the $\boldsymbol{\beta }$ vector is initialized to zero. Bayesian inference is performed using 4,000 MCMC samplings, following a burn-in period of 2,000 iterations. To assess the performance of these variable selection methods, we use 10-fold cross-validation and evaluate prediction accuracy based on mean squared prediction errors (MSPE), prediction accuracy and area under the curve (AUC). The dataset T is partitioned into 10 subsets, denoted ${T_{1}},{T_{2}},\dots ,{T_{10}}$. For each subset ${T_{r}}$ (where $r=1,\dots ,10$), variable selection is performed on the remaining data ($T\setminus {T_{r}}$), and the model is used to compute predicted probabilities for the responses in ${T_{r}}$. The MSPE for subset r is calculated as ${\text{MSPE}_{CV(r)}}={\textstyle\sum _{i\in {S_{r}}}}{({E_{i}}-{\hat{p}_{i}})^{2}}/{n_{r}}$ and ${\text{Accuracy}_{CV(r)}}=(T{P_{r}}+T{N_{r}})/(T{P_{r}}+T{N_{r}}+F{P_{r}}+F{N_{r}})$ where ${n_{r}}=|{S_{r}}|$, ${\hat{p}_{i}}$ is the predicted probability for the ith observation, and ${E_{i}}$ is the observed response. The process is repeated for $r=1,\dots ,10$ and the overall cross-validated MSPE, accuracy and AUC are then obtained by averaging across all 10 subsets.
Figure 3 depicts the 10-fold cross-validation errors based on the selected number of covariates (model size). The ${\text{MSPE}_{CV}}$ is plotted against the different model sizes ranging from $|k|=1$ to 6 with lower values indicating better predictive performance. It is evident that the HSGT and HEGT yield the smallest ${\text{MSPE}_{CV}}$ although the SProbit has the smallest MSPE when the number of active covariates is assumed to be one. The overall out-of-sample prediction accuracy in Figure 4 shows that the two t-link models demonstrate superior classification accuracy compared to the logit and probit methods except at model size 1.
Figure 5 illustrates the performance measured by AUC (Area Under the Curve) across varying model sizes. The plot shows that HSGT consistently achieves the highest AUC values as the model size increases, and has the highest AUC value from model size 3 to 6. The second best performed model is the HEGT. These results demonstrate the robustness of the t-link model and the practical performance of the scalable skinny Gibbs algorithm.
6 Discussion
In this study, we establish the theoretical consistency results for the scalable algorithm to implement t-link (HSGT) in high-dimensional variable selection. The performance of different links and different algorithms is illustrated through simulation studies and an application to the real dataset. The t-link model is preferable, compared to the probit link, given its robustness in variable selection and prediction in the presence of outlying cases. HSGT methods offers competitive computational efficiency. As observed in [37], HSGT has a lower variability of the MCMC chains for active variables than HEGT and thus better performance in variable selection. Our application to real datasets further supports the effectiveness of HSGT. This outcome underscores its utility in handling high-dimensional sparse data, a common characteristic of modern datasets.
Our work opens the door to several promising research directions. First, it would be valuable to further investigate the interplay between link function choice and variable selection in high-dimensional data settings. While [35] explored this interplay using reversible jump Markov chain Monte Carlo (RJMCMC) to study the simultaneous selection of link functions and variables, we aim to address this question in high-dimensional settings by removing restrictions on the degrees of freedom v. Such an investigation would enhance the theoretical foundations and practical applications of Bayesian methods in modern data science.
Additionally, we are interested in establishing consistency results for more general link function models. We also plan to explore the use of Beta priors on q, as suggested by [41] and [11], and implement scalable spike-and-slab algorithms introduced by [4]. Another promising avenue is extending the methods to mixed-effects models, leveraging the normal scale mixture prior, as discussed by [45].
Robustness in generalized linear models can be pursued through the choice of link function or through robust estimation procedures such as M-estimators with adaptive weighting. In this study, we focus on the former. The weighting-based methods [23, 20] achieve robustness by downweighting observations with large residuals or leverage, thereby directly limiting the impact of contamination. Robustness by link modification stabilizes inference by the latent structures associated with the corresponding links. For example, [19] showed that the t-link’s influence function is bounded and its weight function down-weights extreme observations, thereby providing robust estimation under contamination and outperforming probit or logit links in such settings. While the weighting-based methods provides more flexibility, alternative links provide robustness while retaining standard estimation procedures.