Consistent and Scalable Variable Selection with Robust Link Functions

Odoom, Eric; Wang, Xia

doi:10.51387/26-NEJSDS102

The New England Journal of Statistics in Data Science

Consistent and Scalable Variable Selection with Robust Link Functions

Eric Odoom Xia Wang

https://doi.org/10.51387/26-NEJSDS102

Pub. online: 9 March 2026 Type: Methodology Article

Open Access

Area: Statistical Methodology

Accepted
3 February 2026

Published
9 March 2026

Abstract

This study explores the application of the t-link model in high-dimensional variable selection for binary regression. The t-link model provides flexibility in binary modeling and offers robust inference in the presence of outliers, making it a preferable alternative to the commonly used probit and logit links. To address the computational challenges posed by a large number of covariates, the skinny Gibbs algorithm is employed, and the consistency of variable selection under this approximate algorithm is established. These advancements in both computational and theoretical perspectives enhance the practicality and ease of implementing the t-link model. The performance of the t-link model, with a specified degrees of freedom, is compared to logit link and the probit link through simulation studies and an application to PCR data. The results demonstrate the robustness and computational efficiency of the proposed method.

1 Introduction

Binary regression models are essential tools in statistical modeling, particularly for understanding relationships between binary outcomes and predictor variables. Traditional approaches often rely on the logit or probit link functions due to their theoretical appeal and interpretability. However, these standard link functions may fail when the sigmoid function of the underlying data exhibits heavy tails or extreme responses [10, 21, 29], as often encountered in genetics and biomedical research. For example, binary outcomes in genetic studies, such as the presence or absence of a specific disease, are influenced by complex biological processes that standard links may inadequately capture. Moreover, inference results from these models may become unreliable in the presence of outlying covariates or incoherent responses [22, 24, 40]. Addressing these limitations requires exploring alternative approaches that better account for the complexities of such data.

In this context, the t-distribution emerges as an attractive alternative due to its flexibility in modeling heavy-tailed data, controlled by the degrees of freedom parameter v. This study aims to further investigate the t-link model’s potential to enhance robustness and mitigate sensitivity to outliers. Specifically, we explore its application to high-dimensional variable selection, a scenario that has received little, if any, attention in the current literature.

Bayesian variable selection is a theoretically intricate and computationally demanding problem. These challenges stem from the balance required between numerical computation, prior specification, and analytical evaluation, as highlighted in earlier works [18, 12, 35]. High-dimensional variable selection introduces additional computational burdens due to the increasing number of covariates. To address this, [32] proposed the skinny Gibbs sampler using the continuous spike-and-slab prior introduced by [31], demonstrating its consistency in selecting the true model in logistic regression. [38] extended this algorithm to the probit model, further proving its theoretical properties.

Building on these earlier work, we propose the hierarchical skinny Gibbs sampler with t-link (HSGT). The HSGT method extends the skinny Gibbs sampler framework by incorporating the t-distribution in the link function. Furthermore, we enhance the algorithms of [32] and [38] by introducing a hierarchical prior on the variance component of active covariates. This update eliminates the time-consuming and problematic tuning process required in previous approaches. The advantages of our method are twofold: (1) the t-link provides a flexible and robust family of link models, with probit and logit as special cases, enabling a more accurate representation of underlying biological and medical phenomena. Theoretically, the probit link can be recovered as $v\to \infty $, while the logit link can be approximated by a t-link with a specific value of v [1, 32]. And its robustness in estimation and inference under outliers - both in covariates and responses - is well established in the literature [40]. and (2) the skinny Gibbs sampler ensures computational efficiency and scalability to large datasets, a critical requirement in genomics and other high-dimensional fields.

The paper is organized as follows. In Section 2, we introduce the model for variable selection in t-link binary regression and discuss the hierarchical Gibbs algorithms, including the exact Gibbs (HEGT) and the skinny Gibbs (HSGT) methods. The theoretical foundations of the HSGT method, along with consistency results, are detailed in Section 3, with comprehensive proofs provided in Appendix A. Simulation studies evaluating model selection under various scenarios are presented in Section 4, with a particular focus on comparing the performance of the t-link and probit link models, both with and without outliers. In Section 5, we apply the proposed method to a genomic dataset, providing a thorough comparison in prediction accuracy among models. Section 6 concludes with a discussion.

2 Methodology

We consider a scenario of binary classification in high-dimensional analysis, specifically focusing on a t-link function to model binary response in medicine or genomics studies where outliers and heavy-tailed distributions are observed. For instance, identifying key genetic variants (SNPs) associated with the presence or absence of a disease (e.g., cancer). Let ${E_{i}}\in \{0,1\}$ represent the binary response variable, where ${E_{i}}=1$ indicates that the ith subject has the disease, and ${E_{i}}=0$ otherwise, $i=1,\dots ,n$. The covariate vector ${\boldsymbol{x}_{i}}={({x_{i1}},{x_{i2}},\dots ,{x_{ip}})^{\prime }}\in {\mathcal{R}^{p}}$ contains the expression levels of p genes for the ith subject. The probability of ${E_{i}}=1$ is modeled through the generalized liner regression model framework as

(2.1)

\[ P({E_{i}}=1\mid {x_{i}},\boldsymbol{\beta },v)=\Psi ({\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },v),\]

where $\Psi (\cdot )$ is the cumulative distribution function of a distribution, such as the standard normal distribution for the probit link and the standard t-distribution for the t-link. The parameter v may be a scalar or a vector of parameters that is fixed or estimated in the regression model. In the t-link regression, v represents the degrees of freedom. The regression coefficients $\boldsymbol{\beta }$ is a $p\times 1$ vector, which satisfies $\| \boldsymbol{\beta }{\| _{0}}\lt \lt p$, meaning that only a small subset of the components of $\boldsymbol{\beta }$ are non-zero. This sparsity assumption is especially relevant in high-dimensional contexts where the number of predictors p is much larger than the sample size n.

The observed-data likelihood function of all n observations is given as

(2.2)

\[ {L_{n}}\left(\boldsymbol{\beta }\mid \boldsymbol{E},\boldsymbol{X},v\right)={\prod \limits_{i=1}^{n}}{\left[\Psi ({\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },\hspace{2.5pt}v)\right]^{{E_{i}}}}{\left[1-\Psi ({\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },\hspace{2.5pt}v)\right]^{(1-{E_{i}})}}.\]

Using the latent variable approach [1] and representing the t-distribution as a scale mixture of Gaussian, the complete-data likelihood function is expressed as:

(2.3)

\[\begin{array}{r}\displaystyle L(\boldsymbol{\beta },\boldsymbol{Z},\boldsymbol{\omega }\mid \boldsymbol{E},\boldsymbol{X},v)\hspace{99.58464pt}\\ {} \displaystyle \hspace{2em}={\prod \limits_{i=1}^{n}}\mathcal{N}({Z_{i}}\mid {\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },{\omega _{i}^{2}})\mathcal{IG}({\omega _{i}^{2}}\mid v/2,v/2)\times \\ {} \displaystyle \hspace{2em}\bigg({E_{i}}\mathcal{I}({Z_{i}}\ge 0)+(1-{E_{i}})\mathcal{I}({Z_{i}}\lt 0)\bigg),\end{array}\]

where ${Z_{i}}|{\omega _{i}^{2}}$ is a normal with mean ${\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta }$ and variance ${\omega _{i}^{2}}$ and ${\omega _{i}^{2}}$ has an inverse-gamma distribution with the scale and the shape as $v/2$. The aim is to efficiently infer the sparsity pattern encoded in the regression coefficients $\boldsymbol{\beta }$ through a hierarchical Bayesian model with consistency guarantees. Throughout this methodological exposition, we assume that the columns of $\boldsymbol{X}$ are standardized to have zero mean and unit variance.

2.1 Variable Selection with Spike-and-Slab Priors

The spike-and-slab prior is used to induce sparsity on $\boldsymbol{\beta }$ [17, 25]. This approach employs a binary latent vector $\boldsymbol{\gamma }$ of dimension p, corresponding to the number of covariates, where each element ${\gamma _{j}}$ indicates whether the jth covariate is active $({\gamma _{j}}=1)$ or inactive $({\gamma _{j}}=0)$, $j=1,\dots ,p$. The prior distribution for the regression coefficients reflects this status: a “spike” component centered near zero when ${\gamma _{j}}=0$, or a “slab” component with a diffuse probability density when ${\gamma _{j}}=1$. Building on this framework, [33] demonstrated that allowing the variances of the spike-and-slab components to depend on the sample size ensures that, as the sample size grows, the posterior probability of selecting the true model converges to one.

Let ${\boldsymbol{\beta }_{k}}$ be the set of active covariates and ${\boldsymbol{\beta }_{{k^{c}}}}$ the set of inactive covariates such that $k=\{j:\mathcal{I}({\gamma _{j}}\ne 0)\hspace{0.2778em}\}$. Then ${\textstyle\sum _{j=1}^{p}}{\gamma _{j}}=|k|$ is the number of non-zero entries in $\boldsymbol{\gamma }$. Similarly, let ${\boldsymbol{X}_{k}}$ represent a sub-matrix of $\boldsymbol{X}$ with columns corresponding to the nonzero indices of $\boldsymbol{\gamma }$ while ${\boldsymbol{X}_{{k^{c}}}}$ includes the remaining columns associated with ${\gamma _{j}}=0$. The priors for the binary latent variables ${\gamma _{j}}$ and the corresponding regression coefficients ${\beta _{j}}$, $j=1,\dots ,p$, are specified as follows:

(2.4)

\[ {\beta _{j}}\mid {\gamma _{j}}=0\sim N\left(0,{\tau _{0}^{2}}\right),\]

(2.5)

\[ {\beta _{j}}\mid {\gamma _{j}}=1\sim N\left(0,{\tau _{1}^{2}}\right),\]

(2.6)

\[ \pi \left({\gamma _{j}}=1\right)=1-\pi \left({\gamma _{j}}=0\right)=q,\]

(2.7)

\[ {\tau _{1}^{2}}\mid r,s\sim \mathcal{IG}(r,s),\]

for some positive constants ${\tau _{0}^{2}}$, r, s and $q\in (0,1)$. The hyperparameters r and s are the shape and scale parameters of the inverse gamma prior on ${\tau _{1}^{2}}$. The parameter ${\tau _{0}^{2}}$ is the variance of the spike part and ${\tau _{1}^{2}}$ controls the variance of the slab part. The prior probability q is the probability that ${\beta _{j}}\ne 0$. The rates determining ${\tau _{0}^{2}}$ and q are as specified in [33, 32] and for asymptotic considerations (i.e. $n{\tau _{0}^{2}}=o(1)$, and $q\sim {p^{-1}}$).

2.2 Hierarchical Exact Gibbs Sampler (HEGT)

The full joint posterior distributions of $\boldsymbol{\beta }$, $\boldsymbol{Z}$, $\boldsymbol{\gamma }$, $\boldsymbol{\omega }$ and ${\tau _{1}^{2}}$ conditioned on X, E and v using the complete-data likelihood in (2.3) and the spike-and-slab prior specified in (2.4)-(2.7) is given as

(2.8)

\[\begin{array}{r}\displaystyle \pi (\boldsymbol{\beta }\mathbf{,}\boldsymbol{Z}\mathbf{,}\boldsymbol{\gamma }\mathbf{,}\boldsymbol{\omega },{\tau _{1}^{2}}\mid \boldsymbol{E}\mathbf{,}\boldsymbol{X},v)\hspace{119.50148pt}\\ {} \displaystyle \propto L(\boldsymbol{\beta },\boldsymbol{Z}\mathbf{,}\boldsymbol{\omega }\mid \boldsymbol{E}\mathbf{,}\boldsymbol{X},v)\cdot \pi (\boldsymbol{\beta }\mid \boldsymbol{\gamma },{\tau _{1}^{2}})\cdot \pi (\boldsymbol{\gamma })\cdot \pi ({\tau _{1}^{2}})\hspace{5.69046pt}\\ {} \displaystyle \hspace{2em}\propto \exp \left\{-\frac{1}{2}\Big[{(\mathbf{Z}-\boldsymbol{X}\boldsymbol{\beta })^{\prime }}\mathbf{W}(\mathbf{Z}-\boldsymbol{X}\boldsymbol{\beta })+{\boldsymbol{\beta }^{\mathbf{\prime }}}\boldsymbol{D}\boldsymbol{\beta }\Big]\right\}\times \\ {} \displaystyle {\prod \limits_{i=1}^{n}}\bigg({E_{i}}\mathcal{I}({Z_{i}}\ge 0)+(1-{E_{i}})\mathcal{I}({Z_{i}}\lt 0)\bigg)\times \hspace{28.45274pt}\\ {} \displaystyle {\prod \limits_{i=1}^{n}}\left({\left(\frac{1}{{\omega _{i}^{2}}}\right)^{\frac{v+3}{2}}}\exp \left(-\frac{v}{2{\omega _{i}^{2}}}\right)\right){\left[\frac{1-q}{{\tau _{0}}\sqrt{2\pi }}\right]^{p}}\times \hspace{19.91684pt}\\ {} \displaystyle {\left[\frac{{\tau _{0}}q}{{\tau _{1}}(1-q)}\right]^{|\boldsymbol{k}|}}\left({\bigg[\frac{1}{{\tau _{1}^{2}}}\bigg]^{r+1}}\exp \left(-\frac{s}{{\tau _{1}^{2}}}\right)\right),\hspace{36.98866pt}\end{array}\]

where $\mathbf{W}=\text{Diag}\left(1/{\omega _{1}^{2}},\dots ,1/{\omega _{n}^{2}}\right)$, $|\boldsymbol{k}|={\textstyle\sum _{j=1}^{p}}{\gamma _{j}}$ and $\mathbf{D}=\text{Diag}\left({\tau _{0}^{-2}}(\mathbf{1}-\boldsymbol{\gamma })+{\tau _{1}^{-2}}\boldsymbol{\gamma }\right)$. (i) The conditional posterior distribution of ${\gamma _{j}}$ is

\[ {\gamma _{j}}|\cdot \sim \text{Bernoulli}\left(\frac{{d_{j}}}{1+{d_{j}}}\right),\hspace{0.2222em}\hspace{0.2222em}\hspace{0.2222em}j=1,\dots ,p,\]

where

\[ {d_{j}}=\frac{\pi ({\gamma _{j}}=1\mid {\boldsymbol{\gamma }_{-j}},\boldsymbol{\beta }\mathbf{,}\boldsymbol{X}\mathbf{,}\boldsymbol{W}\mathbf{,}\boldsymbol{E})}{\pi ({\gamma _{j}}=0\mid {\boldsymbol{\gamma }_{-j}},\boldsymbol{\beta }\mathbf{,}\boldsymbol{X}\mathbf{,}\boldsymbol{W}\mathbf{,}\boldsymbol{E})}=\frac{q\phi \left({\beta _{j}};\hspace{0.2222em}\hspace{0.2222em}0,{\tau _{1}^{2}}\right)}{(1-q)\phi \left({\beta _{j}};\hspace{0.2222em}\hspace{0.2222em}0,{\tau _{0}^{2}}\right)}.\]

(ii) The conditional posterior distribution of ${\omega _{i}^{2}}$ is

\[ {\omega _{i}^{2}}\mid \cdot \sim \mathcal{IG}\left(\frac{1+v}{2},\hspace{1em}\frac{v+{({Z_{i}}-{\boldsymbol{x}_{\boldsymbol{i}}^{\mathbf{\prime }}}\boldsymbol{\beta })^{2}}}{2}\right).\]

(iii) The conditional posterior distribution of $\boldsymbol{\beta }$ is

\[ \boldsymbol{\beta }\mid \cdot \sim {\mathcal{N}_{p}}\left({\boldsymbol{V}^{\boldsymbol{-}\mathbf{1}}}{\boldsymbol{X}^{\mathbf{\prime }}}\boldsymbol{W}\boldsymbol{Z},\hspace{2.5pt}{\boldsymbol{V}^{-1}}\right),\]

where $\mathbf{V}={\mathbf{X}^{\prime }}W\mathbf{X}+\mathbf{D}$.

(iv) The conditional posterior of ${Z_{i}}$ is

\[ \pi ({Z_{i}}\mid \cdot )\propto \left\{\begin{array}{l}\mathcal{N}({Z_{i}};\hspace{0.1667em}{\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },\hspace{0.1667em}{\omega _{i}^{2}})\mathcal{I}({Z_{i}}\ge 0),\hspace{1em}\text{if}\hspace{2.5pt}{E_{i}}=1,\hspace{1em}\\ {} \mathcal{N}({Z_{i}};\hspace{0.1667em}{\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\beta },\hspace{0.1667em}{\omega _{i}^{2}})\mathcal{I}({Z_{i}}\lt 0),\hspace{1em}\text{if}\hspace{2.5pt}{E_{i}}=0.\hspace{1em}\end{array}\right.\]

(v) The conditional posterior distribution of ${\tau _{1}^{2}}$ is

\[ {\tau _{1}^{2}}\mid \cdot \sim \mathcal{IG}\left(r+\frac{|\boldsymbol{k}|}{2},\hspace{2.5pt}s+\frac{{\textstyle\textstyle\sum _{j=1}^{p}}{\gamma _{j}}{\beta _{j}^{2}}}{2}\right).\]

2.3 Hierarchical Skinny Gibbs Sampler(HSGT)

The computational complexity of the above algorithm is primarily driven by sampling $\boldsymbol{\beta }$ in step (iii), which involves handling a $p\times p$ precision matrix $\boldsymbol{V}$. Calculating its inverse $({\boldsymbol{V}^{-1}})$ has a complexity of $O({p^{3}})$, while computing the product ${\boldsymbol{V}^{-1}}{\boldsymbol{X}^{\prime }}\boldsymbol{W}\boldsymbol{Z}$ introduces an additional complexity of $O({p^{2}}n)$. As a result, the total computational cost of HEGT is $O({p^{2}}(p\vee n))$, which increases rapidly with p. This high complexity makes the algorithm computationally expensive in terms of both time and memory, particularly within the context of an Markov chain Monte Carlo (MCMC) framework. [32] introduces the skinny Gibbs algorithm to handle this computational bottleneck. The skinny Gibbs algorithm is a simple yet effective modification of the Gibbs sampler, designed for high-dimensional settings with many predictors. At each MCMC iteration, $\boldsymbol{\beta }$ is divided into two components: the active component (${\boldsymbol{\beta }_{k}}$) and the inactive component (${\boldsymbol{\beta }_{{k^{c}}}}$). The sparsity is imposed by structuring the precision matrix ($\boldsymbol{V}$) of $\boldsymbol{\beta }$ as a two-block diagonal matrix, assuming full independence between the active and inactive components of $\boldsymbol{\beta }$. That is,

\[\begin{aligned}{}\boldsymbol{V}& =\left(\begin{array}{c@{\hskip10.0pt}c}{\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{W}\boldsymbol{X}_{k}}+{\tau _{1}^{-2}}{\boldsymbol{I}_{|\boldsymbol{k}|}}& {\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{W}\boldsymbol{X}_{{k^{c}}}}\\ {} {\boldsymbol{X}^{\prime }_{{k^{c}}}}{\boldsymbol{W}\boldsymbol{X}_{k}}& {\mathbf{X}^{\prime }_{{k^{c}}}}{\boldsymbol{X}_{{k^{c}}}}+{\tau _{o}^{-2}}{\boldsymbol{I}_{(p-|\boldsymbol{k}|)}}\end{array}\right)\\ {} & \approx \left(\begin{array}{c@{\hskip10.0pt}c}{\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{W}\boldsymbol{X}_{k}}+{\tau _{1}^{-2}}{\boldsymbol{I}_{|\boldsymbol{k}|}}& \mathbf{0}\\ {} \mathbf{0}& (n-1+{\tau _{o}^{-2}}){\boldsymbol{I}_{(p-|\boldsymbol{k}|)}}\end{array}\right).\end{aligned}\]

The joint skinny posterior of the skinny $\boldsymbol{\beta }$, $\boldsymbol{Z}$ $\boldsymbol{\omega }$, $\boldsymbol{\gamma }$ and ${\tau _{1}^{2}}$ conditioned on $\boldsymbol{X}$, $\boldsymbol{E}$ and v is given as

(2.9)

\[\begin{array}{r}\displaystyle \pi (\boldsymbol{\beta },\boldsymbol{Z},\boldsymbol{\gamma },\boldsymbol{\omega },{\tau _{1}^{2}}\mid \boldsymbol{X},\boldsymbol{E},v)\hspace{142.26378pt}\\ {} \displaystyle \propto L({\boldsymbol{\beta }_{k}},\hspace{2.5pt}\boldsymbol{Z},\boldsymbol{\omega }\mid \boldsymbol{E},\hspace{2.5pt}{\boldsymbol{X}_{k}},v)\cdot \pi (\boldsymbol{\beta }\mid \boldsymbol{\gamma },{\tau _{1}^{2}})\cdot \pi (\boldsymbol{\gamma })\pi ({\tau _{1}^{2}})\hspace{14.22636pt}\\ {} \displaystyle \propto \exp \left\{-\frac{1}{2}\left({(\boldsymbol{Z}-{\boldsymbol{X}_{\boldsymbol{k}}}{\boldsymbol{\beta }_{\boldsymbol{k}}})^{\prime }}\boldsymbol{W}(\boldsymbol{Z}-{\boldsymbol{X}_{\boldsymbol{k}}}{\boldsymbol{\beta }_{\boldsymbol{k}}})+\frac{{\boldsymbol{\beta }^{\prime }_{k}}{\boldsymbol{\beta }_{k}}}{{\tau _{1}^{2}}}\right)\right\}\\ {} \displaystyle \times \exp \left\{-\frac{1}{2}\left(n-1+{\tau _{0}^{-2}}\right){\boldsymbol{\beta }^{\prime }_{{k^{c}}}}{\boldsymbol{\beta }_{{k^{c}}}}\right\}{\left[\frac{(1-q){\tau _{1}}}{q{\tau _{0}}}\right]^{-|\boldsymbol{k}|}}\times \\ {} \displaystyle {\prod \limits_{i=1}^{n}}\bigg({E_{i}}\mathcal{I}({Z_{i}}\ge 0)+(1-{E_{i}})\mathcal{I}({Z_{i}}\lt 0)\bigg)\times \hspace{48.36958pt}\\ {} \displaystyle {\prod \limits_{i=1}^{n}}\left({\left(\frac{1}{{\omega _{i}^{2}}}\right)^{\frac{v+3}{2}}}\exp \left(-\frac{v}{2{\omega _{i}^{2}}}\right)\right){\bigg[\frac{1}{{\tau _{1}^{2}}}\bigg]^{r+1}}\exp \left\{-\frac{s}{{\tau _{1}^{2}}}\right\}.\end{array}\]

(i) The conditional posterior distribution of $\boldsymbol{\beta }$ is

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\boldsymbol{\beta }_{k}}\mid \cdot & \displaystyle \sim & \displaystyle {\mathcal{N}_{|\boldsymbol{k}|}}\left({\boldsymbol{V}_{k}^{-1}}{\boldsymbol{X}^{\prime }_{k}}\boldsymbol{W}\boldsymbol{Z},\hspace{2.5pt}{\mathbf{V}_{k}^{-1}}\right),\\ {} \displaystyle {\boldsymbol{\beta }_{{k^{c}}}}\mid \cdot & \displaystyle \sim & \displaystyle {\mathcal{N}_{p-|\boldsymbol{k}|}}\left(0,\hspace{2.5pt}{\boldsymbol{V}_{{k^{c}}}^{-1}}\right),\end{array}\]

where ${\boldsymbol{V}_{k}}={\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{W}\boldsymbol{X}_{k}}+{\tau _{1}^{-2}}{\boldsymbol{I}_{|\boldsymbol{k}|}}$ and ${\boldsymbol{V}_{{k^{c}}}}=(n-1+{\tau _{0}^{-2}}){\boldsymbol{I}_{p-|\boldsymbol{k}|}}$.

Comparing to Step (iii) in Section 2.2, the computation here has a much lower complexity of $O(n(p\vee |\boldsymbol{k}{|^{2}}))$. This reduction not only minimizes memory usage, but also improves computational efficiency, making the MCMC algorithm more practical and desirable for high-dimensional data settings.

(ii) The conditional posterior distribution of ${\omega _{i}^{2}}$ is

\[ {\omega _{i}^{2}}\mid \cdot \sim \mathcal{IG}\left(\frac{1+v}{2},\hspace{1em}\frac{v+{({Z_{i}}-{\boldsymbol{x}^{\prime }_{ik}}{\boldsymbol{\beta }_{k}})^{2}}}{2}\right),\]

for $i=1,\dots ,n$.

(iii) The conditional posterior distribution of ${Z_{i}}$ is

\[ \pi ({Z_{i}}\mid \cdot )\propto \left\{\begin{array}{l}\mathcal{N}({Z_{i}};\hspace{0.1667em}{\boldsymbol{x}^{\prime }_{i,k}}{\boldsymbol{\beta }_{k}},\hspace{0.1667em}{\omega _{i}^{2}})\mathcal{I}({Z_{i}}\ge 0),\hspace{1em}\text{if}\hspace{2.5pt}{E_{i}}=1,\hspace{1em}\\ {} \mathcal{N}({Z_{i}};\hspace{0.1667em}{\boldsymbol{x}^{\prime }_{i,k}}{\boldsymbol{\beta }_{k}},\hspace{0.1667em}{\omega _{i}^{2}})\mathcal{I}({Z_{i}}\lt 0),\hspace{1em}\text{if}\hspace{2.5pt}{E_{i}}=0,\hspace{1em}\end{array}\right.\]

for $i=1,\dots ,n$.

(iv) The conditional posterior distribution of ${\gamma _{j}}$ is

\[ {\gamma _{j}}|\cdot \sim \text{Bernoulli}\left(\frac{{\tilde{d}_{j}}}{1+{\tilde{d}_{j}}}\right),\hspace{0.2778em}\hspace{0.2222em}\hspace{0.2222em}j=1,\dots ,p,\]

where

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\tilde{d}_{j}}& \displaystyle =& \displaystyle \frac{\pi ({\gamma _{j}}=1\mid {\boldsymbol{\gamma }_{-j}},\boldsymbol{\beta }\mathbf{,}\boldsymbol{X}\mathbf{,}\boldsymbol{W}\mathbf{,}\boldsymbol{E})}{\pi ({\gamma _{j}}=0\mid {\boldsymbol{\gamma }_{-j}},\boldsymbol{\beta }\mathbf{,}\boldsymbol{X}\mathbf{,}\boldsymbol{W}\mathbf{,}\boldsymbol{E})}\\ {} & \displaystyle =& \displaystyle {d_{j}}\cdot \exp \left\{\frac{1}{2}{X^{\prime }_{j}}(\boldsymbol{I}-\boldsymbol{W}){\boldsymbol{X}_{j}}{\beta _{j}^{2}}+\right.\\ {} & & \displaystyle \left.(\boldsymbol{Z}-{\boldsymbol{\beta }^{\prime }_{{k_{-j}}}}{\boldsymbol{X}^{\prime }_{{k_{-j}}}}){\boldsymbol{W}\boldsymbol{X}_{j}}{\beta _{j}}\right\}.\end{array}\]

(v) The conditional posterior distribution of ${\tau _{1}^{2}}$ is

\[ {\tau _{1}^{2}}\mid \cdot \sim \mathcal{IG}\left(r+\frac{|\boldsymbol{k}|}{2},\hspace{2.5pt}s+\frac{{\textstyle\textstyle\sum _{j=1}^{p}}{\gamma _{j}}{\beta _{j}^{2}}}{2}\right).\]

3 Theoretical Properties

In this section, we present the asymptotic properties of the hierarchical prior setup of the algorithm on binary regression with the student t-link. Let $\boldsymbol{t}=\{{t_{1}},{t_{2}},\dots ,{t_{t}}\}\subseteq [p]$ represent the true model, indicating that the non-zero locations of the true coefficient vector correspond to $\boldsymbol{t}=(j,\hspace{0.2222em}\hspace{0.2222em}j\in \boldsymbol{t})$. We assume $|\boldsymbol{t}|$ is fixed. Let ${\boldsymbol{\beta }_{0}}\in {\mathcal{R}^{p}}$ be the true coefficient vector, and ${\boldsymbol{\beta }_{0,t}}\in {\mathcal{R}^{|\boldsymbol{t}|}}$ represent the vector of true non-zero coefficients.

For a given model k with active components indexed by $\boldsymbol{k}\subseteq [p]$ with size $|\boldsymbol{k}|$ at most ${m_{n}}$, we define ${\ell _{n}}({\boldsymbol{\beta }_{k}})$ as the log-likelihood and ${S_{n}}({\boldsymbol{\beta }_{k}})=\partial {\ell _{n}}({\boldsymbol{\beta }_{k}})/\partial {\boldsymbol{\beta }_{k}}$ as the score function. The restriction on the largest model size is a common practice in Bayesian high-dimensional asymptotic theory to effectively control the posterior ratio and achieve desirable consistency results. The variable selection consistency by the skinny Gibbs has been shown under the logit link [32] and the probit link [38], respectively. Here, we extend the consistency results to a t-link model with a known degrees of freedom.

The Hessian matrix of ${\ell _{n}}({\boldsymbol{\beta }_{k}})$ is defined as

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {H_{n}}({\boldsymbol{\beta }_{k}})=& \displaystyle -\frac{{\partial ^{2}}{\ell _{n}}}{\partial {\boldsymbol{\beta }_{k}}\partial {\boldsymbol{\beta }^{\prime }_{k}}}& \displaystyle ={\sum \limits_{i=1}^{n}}{\boldsymbol{x}_{i,k}}{\boldsymbol{x}^{\prime }_{i,k}}{w_{i}}({\boldsymbol{\beta }_{k}})\\ {} & & \displaystyle ={\boldsymbol{X}^{\prime }_{k}}\boldsymbol{W}({\boldsymbol{\beta }_{k}}){\boldsymbol{X}_{k}},\end{array}\]

where $\boldsymbol{W}({\boldsymbol{\beta }_{k}})=\text{Diag}\{{w_{i}}({\boldsymbol{\beta }_{k}})\}$ for $i=1,\dots ,n$ with ${w_{i}}({\boldsymbol{\beta }_{k}})$ as

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}& & \displaystyle {E_{i}}\left\{\frac{(v+1)}{v}{\left(1+\frac{{\mu _{i,k}^{2}}}{v}\right)^{-1}}{\mu _{i,k}}\hspace{2.5pt}{r_{i,k}}+{r_{i,k}^{2}}\right\}+\\ {} & & \displaystyle (1-{E_{i}})\cdot \left\{-\frac{(v+1)}{v}{\left(1+\frac{{\mu _{i,k}^{2}}}{v}\right)^{-1}}{\mu _{i,k}}\hspace{2.5pt}{R_{i,k}}+{R_{i,k}^{2}}\right\}\end{array}\]

with ${r_{i,k}}=\psi ({\mu _{i,k}},\hspace{2.5pt}v)/\Psi ({\mu _{i,k}},\hspace{2.5pt}v)$ and ${R_{i,k}}=\psi ({\mu _{i,k}},\hspace{2.5pt}v)/[1-\Psi ({\mu _{i,k}},\hspace{2.5pt}v)]$, $\psi ({\mu _{i,k}},\hspace{2.5pt}v)$ and $\Psi ({\mu _{i,k}},\hspace{2.5pt}v)$ as the density and the cumulative probability functions evaluated at ${\mu _{i,k}}$ of a standard t distribution with v degrees of freedom, and ${\mu _{i,k}}={\boldsymbol{x}_{i,k}}{\boldsymbol{\beta }_{i,k}}$.

Before presenting the main results, we introduce the following notation: (i) for any $x,y\in \mathbb{R}$, $x\vee y$ and $x\wedge y$ represent the maximum and minimum of x and y, respectively. For any positive real sequences ${x_{n}}$ and ${y_{n}}$, (ii) ${x_{n}}\lesssim {y_{n}}$ or ${x_{n}}=O({y_{n}})$ implies there exists $C\gt 0$ a constant such that $|{x_{n}}|\le C|{y_{n}}|$ for all large n. (iii) ${x_{n}}\ll {y_{n}}$ or ${x_{n}}=o({y_{n}})$ if ${x_{n}}/{y_{n}}\to 0$ as $n\to \infty $. (iv) If ${x_{n}}\sim {y_{n}}$, there exists ${K_{1}}\gt {K_{2}}\gt 0$ such that ${K_{2}}\lt {y_{n}}/{x_{n}}\le {x_{n}}/{y_{n}}\lt {K_{1}}$.

Let $\boldsymbol{a}={({a_{1}},{a_{2}},\dots ,{a_{p}})^{\prime }}\in {\mathbb{R}^{p}}$ then $\| \boldsymbol{a}{\| _{2}}$ and $\| \boldsymbol{a}{\| _{1}}$ is the ${\ell _{2}}$ and ${\ell _{1}}$ norm of a respectively and $\| \boldsymbol{a}{\| _{\text{max}}}={\text{max}_{1\le j\le p}}|{a_{j}}|$. For any real symmetric matrix $\boldsymbol{M}$, ${\lambda _{\max }}(\boldsymbol{M})$ and ${\lambda _{\min }}(\boldsymbol{M})$ are the maximum and minimum eigenvalues of $\boldsymbol{M}$, respectively. The following conditions are assumed to obtain the asymptotic results:

Condition (C1) (Condition on dimension (n and p)) For some constant $0\lt {d^{\prime }}\lt 1$, let $p\gg n$, $\log \hspace{2.5pt}p=o(n)$ and ${m_{n}}=O\Big({(n/\log p)^{\frac{1-{d^{\prime }}}{2}}}\wedge p\Big)$, as $n\to \infty $.Condition 3 guarantees that our proposed method is capable of handling high-dimensional setting analysis where the number of predictors increases at a sub-exponential rate relative to n. The parameter ${m_{n}}$ restricts the analysis to a set of sufficiently large models. Also, to avoid over-fitting, it is required that ${m_{n}}\ll n$. Similar assumptions regarding model size limitations are frequently encountered in the sparse estimation literature [14, 34, 28].

Condition (C2) (Condition on Design Matrix $\boldsymbol{X}$) For ${\boldsymbol{x}_{\boldsymbol{i}}}\in {\mathcal{R}^{p}},\hspace{1em}i=1,2,\dots ,n$.

i. (Boundedness) $P(||{\boldsymbol{x}_{\boldsymbol{i}}}|{|_{max}}\le {C_{0}})=1$ for some ${C_{0}}\gt 0$ and $P(\cdot )$ is the probability measure.
ii (Sub-Gaussianity) for every $\alpha \in {\mathcal{R}^{p}}$, there exists a constant ${C_{\alpha }}\gt 0$ such that $\bar{E}\exp ({\boldsymbol{x}^{\prime }_{i}}\boldsymbol{\alpha })\le \exp ({C_{\alpha }}||\boldsymbol{\alpha }|{|_{2}^{2}})$

The condition 3 [i] requires each component of ${\boldsymbol{x}_{i}}$ is bounded with probability 1 as adopted by [32] for a deterministic design matrix. The condition [ii] of 3 implies that a linear combination of ${\boldsymbol{x}_{i}}={({x_{i1}},\dots ,{x_{ip}})^{\prime }}$ has a sufficiently light tail satisfying sub-Gaussianity. See similar conditions under the logistic regression in [32, 8].

Condition (C3) (Conditions on True Model) For some constant ${c_{0}}\gt 0$,

\[ \underset{j\in t}{\min }\| {\beta _{0,j}}{\| _{2}^{2}}\ge {c_{0}}|\boldsymbol{t}|{\Lambda _{|\boldsymbol{t}|}}\Big(\frac{\log p}{n}\vee \frac{1}{\log p}\Big),\]

and ${\Lambda _{|\boldsymbol{t}|}}={\max _{\boldsymbol{t}:\hspace{0.2222em}|\boldsymbol{t}|\le {c_{0}}}}{\lambda _{\max }}({n^{-1}}{\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{X}_{k}})$ for any integer ${c_{0}}\gt 0$.This condition (also known as beta-min condition) guarantees true regression coefficients ${\boldsymbol{\beta }_{0}}$ have a finite number of nonzero entries and a bounded ${\ell _{1}}$-norm. It holds if we assume that the true inactive beta components (${\boldsymbol{\beta }_{{t^{c}}}}$) can be nonzero but $\| {\boldsymbol{X}_{{t^{c}}}}{\boldsymbol{\beta }_{{t^{c}}}}{\| _{2}}=O(1)$ in line with similar assumption in [33].

Condition (C4) By adapting the line of reasoning(Conditions on prior hyperparameters) $n{\tau _{0}^{2}}=O(1)$ and for some constants ${a_{1}},{a_{2}}\gt 0$, the hyperparameters satisfy ${a_{1}}\lt r,s\lt {a_{2}}$ and ${q^{2}}\sim {p^{-1}}$, where r and s are the shape and scale parameters of the inverse gamma prior on ${\tau _{1}^{2}}$ defined in Section 2.1.Condition 3 suggests appropriate conditions for the hyperparameters. Similar assumptions have also been considered in [43], [26] and [9]. We impose a prior on ${\tau _{1}^{2}}$ with its value estimated within the MCMC iterations. This eliminates the need for the tuning procedure required in [32].

Lemma 1 (Pseudoposterior of $\boldsymbol{\gamma }$).

The marginal posterior for any model $\boldsymbol{\gamma }$, ${m_{k}}(\boldsymbol{E})$, under the skinny Gibbs is given by

\[\begin{array}{r@{\hskip10.0pt}c}\displaystyle {m_{k}}(\boldsymbol{E})=\pi (\boldsymbol{k}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})\propto & \displaystyle {\big\{2\pi {v_{n}^{-2}}(n-1+{\tau _{0}^{-2}})\big\}^{\frac{|\boldsymbol{k}|}{2}}}\times \\ {} & \displaystyle \textstyle\int \textstyle\int {({\tau _{1}^{2}})^{-\frac{|\boldsymbol{k}|}{2}}}\pi ({\tau _{1}^{2}})\exp \big\{{\ell _{n}}({\boldsymbol{\beta }_{k}})\big\}\\ {} & \displaystyle \exp \Big(-\frac{1}{2{\tau _{1}^{2}}}\| {\boldsymbol{\beta }_{k}}{\| _{2}^{2}}\Big)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}},\end{array}\]

where ${v_{n}}=(1-q)/({\tau _{0}}q)$.

Proof of Lemma 1.

\[\begin{aligned}{}{m_{k}}(\boldsymbol{E})=& \pi (\boldsymbol{k}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})\\ {} \propto & {v_{n}^{-|\boldsymbol{k}|}}\int \int \exp \left\{{\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)\right\}\exp \left\{-\frac{1}{2{\tau _{1}^{2}}}{\boldsymbol{\beta }^{\prime }_{k}}{\boldsymbol{\beta }_{k}}\right\}\times \\ {} & \exp \left\{-\frac{1}{2}\left(n-1+{\tau _{0}^{-2}}\right){\boldsymbol{\beta }^{\prime }_{{k^{c}}}}{\boldsymbol{\beta }_{{k^{c}}}}\right\}d{\boldsymbol{\beta }_{{k^{c}}}}d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} \propto & {v_{n}^{-|\boldsymbol{k}|}}{\left\{2\pi \left(n-1+{\tau _{0}^{-2}}\right)\right\}^{-\frac{p-|\boldsymbol{k}|}{2}}}\times \\ {} & \int \int \exp \left\{{\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)\right\}\exp \left(-\frac{1}{2{\tau _{1}^{2}}}{\boldsymbol{\beta }^{\prime }_{k}}{\boldsymbol{\beta }_{k}}\right)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} \propto & {\left\{2\pi {v_{n}^{-2}}\left(n-1+{\tau _{0}^{-2}}\right)\right\}^{\frac{|\boldsymbol{k}|}{2}}}\times \\ {} & \int \int \exp \left\{{\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)\right\}\exp \left(-\frac{1}{2{\tau _{1}^{2}}}{\boldsymbol{\beta }^{\prime }_{k}}{\boldsymbol{\beta }_{k}}\right)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}.\end{aligned}\]

□

Lemma 1 is used to prove Theorem 1 which demonstrates the proposed method guarantees posterior ratio consistency.

3.1 Model Selection Consistency

Theorem 1 (Posterior ratio consistency).

Assume Conditions 3 - 3 hold, for any $\boldsymbol{k}\ne \boldsymbol{t}$, $PR(\boldsymbol{k},\boldsymbol{t})=\frac{\pi (\boldsymbol{k}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})}{\pi (\boldsymbol{t}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})}$,

(3.1)

\[ PR(\boldsymbol{k},\boldsymbol{t})\stackrel{P}{\to }0\hspace{1em}\hspace{2.5pt}\textit{as}\hspace{2.5pt}n\to \infty ,\]

where $\boldsymbol{t}\subseteq [p]$ is the true model.

Theorem 1 shows that, asymptotically, our posterior avoids over-fitting by not including an excessive number of unnecessary variables. However, this does not guarantee that the posterior will concentrate on the true model. To capture all significant variables, the nonzero entries in ${\beta _{0,t}}$ must have sufficiently large magnitudes. In the following theorem, we establish that our posterior achieves strong selection consistency, meaning that the posterior probability assigned to the true model t converges to 1. This requires a suitable condition on the lower bound for the magnitudes of the nonzero entries in ${\boldsymbol{\beta }_{0,t}}$. It is important to note that Theorem 1 does not require the smallest nonzero entries to be bounded away from 0. Furthermore, Theorem 2 guarantees posterior ratio consistency, which implies that the true model t will become the mode of the posterior as the sample size grows.

Theorem 2 (Strong selection consistency).

Under Conditions 3–3, the following holds:

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle \pi \big(\boldsymbol{t}\mid \boldsymbol{E}\mathbf{,}\boldsymbol{X}\big)& \displaystyle \stackrel{P}{\longrightarrow }& \displaystyle 1,\hspace{0.1667em}\hspace{0.1667em}\hspace{2.5pt}\textit{as}\hspace{2.5pt}n\to \infty .\end{array}\]

The proofs for Theorems 1 and 2 are provided in the appendix. The authors in [8] established consistency under hierarchical nonlocal priors in logistic regression, and we adapt their proof strategy to demonstrate that the asymptotic results also hold under spike and slab priors within the skinny Gibbs framework.

The current consistency results are established for variable selection under a specified regression model. That is, under Conditions (C1)–(C4), posterior concentration for $\boldsymbol{\beta }$ holds with all the other components of the regression model fixed, including the random distribution of the response variable and the link function. Allowing the degrees of freedom parameter ν to be unknown incorporates the link function determination into the model selection process. Because of the interplay between the regression coefficients and the link function, additional conditions beyond (C1)–(C4) are required to ensure model identifiability and consistency.

4 Simulation

We assess the performance of the t-links, the logit link, and the probit link in variable selection, particularly in the presence of outlying observations. In addition, we evaluate the performance under both the exact and the skinny Gibbs algorithms. For each data replicate, ten models are fitted and compared, including the hierarchal skinny Gibbs t-link model(HSGT), the exact Gibbs t-link model (HEGT), the skinny Gibbs logit model (SLogit)[37], the exact Gibbs logit model (ELogit), the skinny Gibbs probit model (SProbit) and the exact Gibbs probit model (EProbit)[38]. For the t-link models, we consider $\nu =1(\text{Cauchy}),3$, and 7, respectively, where $\nu =1$ represents the heaviest tail case among integer values of ν, and $\nu =7$ is sometimes used as an approximation to the logit link. In addition, three LASSO models with different links are fitted to provide a frequentist benchmark and to further illustrate the robustness of the link functions under alternative estimation methods.

Here we assume ν as known and fix it at a specific value. The proposed algorithm is flexible and can be applied to any value of ν. The “best” ν for a given dataset can be determined ad hoc using standard model selection criteria, such as cross validation, DIC or WAIC. All the theoretical properties are proved for any fixed value of ν.

In the simulation, the first $|\boldsymbol{k}|$ columns of the design matrix $\boldsymbol{X}$ correspond to active covariates with non-zero coefficients, while the remaining columns represent inactive covariates with zero regression coefficients. Each row ${\boldsymbol{x}_{i}}$ of $\boldsymbol{X}$ is independently generated from a normal distribution with a p-dimensional covariance matrix Σ. The binary response $\boldsymbol{E}$ is generated from a Bernoulli distribution with the probability of success $\Psi (\boldsymbol{X}\boldsymbol{\beta },v)$, where $\Psi (\cdot )$ is the cumulative distribution function of a standard normal for the probit link ($v\to \infty $), a standard logistic distribution for the logit link, and a standard t distribution with the degrees of freedom $v=1,3,7$. We set $|\boldsymbol{k}|=4$, $n=100$, $\boldsymbol{\beta }={(3,1.5,1.0,0.5)^{\prime }}$. We vary the true link models ${\Psi ^{-1}}$ (the probit link, the logit link or the Cauchy link ($v=1$)), the dimension in $\boldsymbol{X}$ ($p=100$ or 500) and its correlation structure (independent or dependent). For independent $\boldsymbol{X}$, the covariance matrix $\boldsymbol{\Sigma }=\boldsymbol{I}$; in the case of correlated covariates, Σ is structured into a two-block matrix corresponding to ${\boldsymbol{X}_{k}}$ and ${\boldsymbol{X}_{{k^{c}}}}$: with a constant correlation of 0.1 between blocks, and within each block, an AR(1) correlation structure with autocorrelation ρ values of 0.5 across the covariates. Thus, there are a total of 12 scenarios to simulate. A total of 50 data replicates are generated under each scenario.

For each simulated dataset, we also generate two modified datasets identical to the original except for one observation The first dataset is altered to create a “bad leverage point” as described in [22]. This observation is randomly selected from among those with ${E_{i}}=1$ and the value of the first covariate is changed to −10 (note the regression coefficient associated with the first covariate is 3). It is referred to as a “bad leverage point” because it exerts leverage while contradicting the association between the response and explanatory variable observed in the rest of the data. The second dataset creates a “non-leverage outlier” by selecting the observation with the largest magnitude of ${E_{i}}-\Psi ({\mathbf{x}_{i}}\boldsymbol{\beta })$ and set the value of binary observation as $1-{E_{i}}$. A comparison of the fits from the 10 Bayesian models and three LASSO models, both with and without the outlier, highlights the robustness of the t-link model compared to the probit link.

For all Bayesian models, we set ${\tau _{0}^{2}}$ to $1/n$ and select $q=P({\gamma _{i}}=1)$ such that $P\left({\textstyle\sum _{j=1}^{p}}{\gamma _{j}}=1\gt {K_{\gamma }}\right)=0.1$, with ${K_{\gamma }}=\text{max}(10,\log (n))$. As discussed, unlike [32] and [38], the slab variance parameter ${\tau _{1}^{2}}$ is imposed an inverse gamma prior with the hyperparameters s and r set to 2 and 1, respectively, providing a prior mean of 1 and an infinite prior variance. A total of 6, 000 iterations are run for the posterior sampling with a burn-in period of 2,000 iterations.

The metrics used for evaluating the methods are divided into two categories: variable selection performance and prediction performance. The variable selection metrics include sensitivity (SEN), specificity (SPE), and Matthew’s correlation coefficient (MCC), defined as follows:

\[\begin{aligned}{}\text{SEN}& =\text{Sensitivity}=\frac{TP}{TP+FN},\\ {} \text{SPE}& =\text{Specificity}=\frac{TN}{TN+FP},\\ {} \text{MCC}& =\frac{TP\times TN\hspace{-0.1667em}-\hspace{-0.1667em}FN\times FP}{\sqrt{(TP\hspace{-0.1667em}+\hspace{-0.1667em}FP)(TP\hspace{-0.1667em}+\hspace{-0.1667em}FN)(TN\hspace{-0.1667em}+\hspace{-0.1667em}FP)(TN\hspace{-0.1667em}+\hspace{-0.1667em}FN)}},\end{aligned}\]

where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. The MCC score provides a more comprehensive measure of performance than other metrics, as it incorporates all elements of the confusion matrix into a single summary statistic [13]. In addition, we present the out-of-sample mean square prediction error (MSPE) defined as:

\[ \text{MSPE}=\frac{||E-\hat{p}|{|_{2}^{2}}}{n},\]

where $\hat{p}$ is the estimated $\text{Prob}(E=1)$.

Table 1 presents the comparison results, reporting average metrics from simulations based on data generated from the probit link model under two covariate correlation structures with $p=500$. Results on the remaining 11 scenarios are provided in Tables S1–S5 of the Supplementary Materials.

Table 1

Simulation Results: Probit as the True Model (p = 500).

Bayesian											LASSO



	Cauchy ($v=1$)		t ($v=3$)		t ($v=7$)		logit		probit		Cauchy	logit	probit






	SG	EG	SG	EG	SG	EG	SG	EG	SG	EG

$\Sigma =I$
No Outlier
SEN	0.6100	0.5850	0.6000	0.6100	0.6400	0.5900	0.6300	0.6050	0.6400	0.6000	0.7450	0.6850	0.6800
SPE	0.9995	0.9995	0.9996	0.9997	0.9995	0.9996	0.9994	0.9998	1.0000	0.9998	0.9729	0.9898	0.9878
MCC	0.7435	0.7316	0.7432	0.7574	0.7673	0.7386	0.7562	0.7621	0.7772	0.7574	0.4188	0.5590	0.5510
FP	0.2600	0.2600	0.2000	0.1600	0.2600	0.1800	0.2800	0.0800	0.1600	0.1200	13.4200	5.0600	5.9000
FN	1.6600	1.6600	1.6000	1.5600	1.4400	1.6400	1.4800	1.5800	1.4400	1.6000	1.0200	1.2600	1.2800
MSPE	0.0994	0.1034	0.0923	0.0994	0.0930	0.1027	0.0916	0.0947	0.0902	0.1022	0.1233	0.1222	0.1194

Bad Leverage Outlier
SEN	0.6050	0.5750	0.5900	0.5600	0.5600	0.3950	0.5400	0.4200	0.4750	0.3550	0.7100	0.0400	0.0000
SPE	0.9999	0.9998	0.9999	0.9999	0.9999	0.9999	0.9999	0.9999	0.9982	1.0000	0.9864	1.0000	1.0000
MCC	0.7647	0.7427	0.7564	0.7374	0.7358	0.5946	0.7228	0.6098	0.5236	0.5260	0.5381	0.0626	0.0000
FP	0.0400	0.0800	0.0400	0.0400	0.0400	0.0400	0.0400	0.0600	1.4800	0.2200	6.7600	0.0000	0.0000
FN	1.5800	1.7000	1.6400	1.7600	1.7600	2.4200	1.8400	2.3200	2.1000	2.5800	1.1600	3.8400	4.0000
MSPE	0.0943	0.0956	0.0918	0.0984	0.0976	0.1480	0.1043	0.1377	0.1940	0.1838	0.1383	0.2472	0.2500

Non-leverage Outlier
SEN	0.5450	0.5650	0.5400	0.4850	0.4950	0.4700	0.5000	0.4600	0.4850	0.4500	0.6600	0.4900	0.4450
SPE	0.9998	0.9998	0.9995	0.9998	0.9994	0.9994	0.9996	0.9995	0.9992	1.0000	0.9849	0.9978	0.9988
MCC	0.7189	0.7291	0.7122	0.6680	0.6583	0.6414	0.6774	0.6356	0.6150	0.6354	0.4916	0.5978	0.5974
FP	0.1000	0.1200	0.2400	0.1200	0.2800	0.3000	0.1800	0.2600	0.7000	0.2200	7.4800	1.2800	0.7800
FN	1.8200	1.7400	1.8400	2.0600	2.0200	2.1200	2.0000	2.1600	2.0600	2.2000	1.3600	2.0400	2.2200
MSPE	0.1001	0.0998	0.1014	0.1146	0.1138	0.1229	0.1073	0.1176	0.1236	0.1250	0.1415	0.1562	0.1652

$\Sigma ={\Sigma _{ar1(\rho =0.50)}}$

No Outlier
SEN	0.5600	0.5400	0.5500	0.5350	0.6100	0.5650	0.5850	0.5700	0.5950	0.5450	0.8700	0.8450	0.8200
SPE	0.9997	0.9997	0.9997	0.9997	0.9996	0.9996	0.9997	0.9997	1.0000	1.0000	0.9690	0.9828	0.9846
MCC	0.7238	0.7116	0.7225	0.7088	0.7530	0.7245	0.7400	0.7355	0.7548	0.7322	0.4192	0.5624	0.5790
FP	0.1600	0.1600	0.1600	0.1400	0.2000	0.1800	0.1600	0.1400	0.1200	0.0400	15.3800	8.4200	7.7200
FN	1.7600	1.8400	1.8000	1.8600	1.5600	1.7400	1.6600	1.7200	1.6200	1.8200	0.5200	0.6200	0.7200
MSPE	0.0662	0.0689	0.0653	0.0719	0.0638	0.0719	0.0650	0.0672	0.0616	0.0702	0.0785	0.0806	0.0816

Bad Leverage Outlier
SEN	0.5600	0.5300	0.5450	0.4650	0.5100	0.4000	0.5150	0.4300	0.5000	0.4200	0.8400	0.6200	0.1500
SPE	0.9997	0.9998	0.9998	0.9995	0.9988	0.9994	0.9995	0.9996	0.9996	1.0000	0.9888	0.9958	1.0000
MCC	0.7257	0.7115	0.7187	0.6447	0.6384	0.5788	0.6762	0.6211	0.6108	0.6104	0.6381	0.6844	0.2308
FP	0.1600	0.0800	0.1200	0.2400	0.6200	0.3200	0.2400	0.1800	0.7600	0.2000	5.5800	2.0832	0.0200
FN	1.7600	1.8800	1.8200	2.1400	1.9600	2.4000	1.9400	2.2800	2.0000	2.3200	0.6400	1.4400	3.4000
MSPE	0.0715	0.0783	0.0765	0.1133	0.1255	0.1491	0.1169	0.1365	0.1394	0.1454	0.1090	0.1694	0.2344

Non-leverage Outlier
SEN	0.5700	0.5300	0.5550	0.4750	0.5200	0.3700	0.5100	0.4200	0.4450	0.3750	0.8500	0.7050	0.6300
SPE	0.9994	0.9998	0.9994	0.9999	0.9996	0.9998	0.9997	0.9996	0.9994	1.0000	0.9822	0.9984	0.9974
MCC	0.7197	0.7124	0.7133	0.6745	0.6890	0.5808	0.6925	0.6139	0.5472	0.5588	0.5287	0.7171	0.7176
FP	0.2800	0.0800	0.2800	0.0600	0.2200	0.1200	0.1400	0.2000	1.0200	0.3400	8.8400	2.0832	1.2600
FN	1.7200	1.8800	1.7800	2.1000	1.9200	2.5200	1.9600	2.3200	2.2200	2.5000	0.6000	1.1800	1.4800
MSPE	0.0697	0.0698	0.0697	0.0768	0.0775	0.0962	0.0784	0.0905	0.0996	0.1020	0.0944	0.1242	0.1460

Across all scenarios without data contamination (“No outlier”), the different link models exhibit similar performance. As expected, the true model generally performs slightly better than the others in terms of variable selection (higher MCC) and prediction accuracy (lower MSPE). Notably, the skinny Gibbs algorithm often outperforms the exact Gibbs algorithm for both the probit and t-link models, consistent with the findings of [32] and [38].

A comparison between the “No outlier” and “Bad Leverage Outlier” scenarios shows that the performance of the t-link models with $\nu =1$ or 3 experiences only minimal changes. In contrast, the results for the probit link show noticeable deterioration, both in variable selection (evidenced by lower MCC and higher FP and FN) and prediction (reflected in higher MSPE). Due to this one outlying observation, the probit model performs much worse than the t-link even when the datasets are generated under the probit link. These findings highlight the sensitivity of the probit link and the robustness of the t-link model. The t-link with $\nu =7$ and the logit link show generally comparable performance, both more robust than the probit link but less robust than the t-links with $\nu =1$ or $\nu =3$.

When comparing the “Non-leverage outlier” scenario with the other two cases, the t-links with $\nu =1$ and $\nu =3$ again demonstrate robustness in both variable selection and prediction. The other link functions show some deterioration, though to a lesser degree than under the “Bad Leverage Outlier” scenario, with the probit link remaining the most vulnerable.

Overall, the LASSO models perform worse than their Bayesian counterparts fitted with either the skinny Gibbs or exact Gibbs algorithms using the same link functions. Regarding the link functions, similar patterns are observed for the Cauchy, logit, and probit links under the LASSO method.

Figure 1 displays the runtime (in seconds) per iteration for the four methods, highlighting the computational efficiency of the skinny methods compared to the exact Gibbs methods. The computation time for HEGT and EProbit appears to increase quadratically with p, whereas HSGT and SProbit show a linear increase, consistent with their respective computational complexities. The t-link requires a slightly more time than the probit under either the exact or the skinny algorithms (by a margin of microseconds). This difference stems from the additional time spent sampling extra inverse-gamma latent variables, which is necessary when employing the t-link method.

The Fast Sampling Algorithm in [3] has a typical computational complexity of ${n^{2}}p$, which is slightly higher than the $n(p\vee |\boldsymbol{k}{|^{2}})$ complexity of Skinny Gibbs, where $|\boldsymbol{k}|$ denotes the active model size and typically satisfies $|\boldsymbol{k}|\ll p$. The Skinny Gibbs method is specifically designed to ensure scalability under sparsity, making it a more suitable choice for sparse Bayesian variable selection.

Figure 1

The average wall-clock time per iteration (in seconds) is compared across different covariate dimensions for the four methods: HSGT (solid red), SProbit (solid blue), HEGT (dashed red), and EProbit (dashed blue).

5 Application: PCR Data

The PCR dataset originates from a study conducted by [27] to explore the genetics of two inbred mouse populations, B6 and BTBR. It includes gene expression data for 22,575 genes measured across 60 F2 mice, comprising 31 females and 29 males. The physiological phenotype, glycerol-3-phosphate acyltransferase (GPAT), was quantified using quantitative real-time Polymerase Chain Reaction (PCR). Both the phenotypic data (including GPAT measurements) and the gene expression data are publicly available on the GEO website (http://www.ncbi.nlm.nih.gov/geo) under accession number GSE3330.

GPAT is known to influence Hepatic Steatosis, a condition associated with obesity, with lower GPAT levels reducing its prevalence. Following the approach in [32], we define a binary response variable (E) based on GPAT values as $E=\mathbf{I}(\text{GPAT}\lt Q(0.4))$, where $Q(0.4)$ represents the 40th percentile of GPAT.

Given the high dimensionality of the gene expression data, an initial screening step was performed using the simple binary regression. This process identified the 99 most significant genes based on p-values, which were combined with the sex variable, resulting in a total of 100 predictors for further analysis. Figure 2 displays the boxplots of the standardized covariates in both datasets. The plot highlights the presence of numerous covariates with outlying values, which is a common occurrence in real-world data applications.

Figure 2

Boxplots of all the standard covariates in the PCR dataset.

The six models, incorporating three link functions and two computational algorithms, are then fitted to the data. The initial number of active genes influencing GPAT is set to 10 ($|k|=10$), and the $\boldsymbol{\beta }$ vector is initialized to zero. Bayesian inference is performed using 4,000 MCMC samplings, following a burn-in period of 2,000 iterations. To assess the performance of these variable selection methods, we use 10-fold cross-validation and evaluate prediction accuracy based on mean squared prediction errors (MSPE), prediction accuracy and area under the curve (AUC). The dataset T is partitioned into 10 subsets, denoted ${T_{1}},{T_{2}},\dots ,{T_{10}}$. For each subset ${T_{r}}$ (where $r=1,\dots ,10$), variable selection is performed on the remaining data ($T\setminus {T_{r}}$), and the model is used to compute predicted probabilities for the responses in ${T_{r}}$. The MSPE for subset r is calculated as ${\text{MSPE}_{CV(r)}}={\textstyle\sum _{i\in {S_{r}}}}{({E_{i}}-{\hat{p}_{i}})^{2}}/{n_{r}}$ and ${\text{Accuracy}_{CV(r)}}=(T{P_{r}}+T{N_{r}})/(T{P_{r}}+T{N_{r}}+F{P_{r}}+F{N_{r}})$ where ${n_{r}}=|{S_{r}}|$, ${\hat{p}_{i}}$ is the predicted probability for the ith observation, and ${E_{i}}$ is the observed response. The process is repeated for $r=1,\dots ,10$ and the overall cross-validated MSPE, accuracy and AUC are then obtained by averaging across all 10 subsets.

Figure 3

PCR Data: A 10-fold cross-validated mean square prediction error versus model size.

Figure 3 depicts the 10-fold cross-validation errors based on the selected number of covariates (model size). The ${\text{MSPE}_{CV}}$ is plotted against the different model sizes ranging from $|k|=1$ to 6 with lower values indicating better predictive performance. It is evident that the HSGT and HEGT yield the smallest ${\text{MSPE}_{CV}}$ although the SProbit has the smallest MSPE when the number of active covariates is assumed to be one. The overall out-of-sample prediction accuracy in Figure 4 shows that the two t-link models demonstrate superior classification accuracy compared to the logit and probit methods except at model size 1.

Figure 4

PCR Data: 10-fold cross-validated prediction accuracy versus model size.

Figure 5

PCR Data: 10-fold cross-validated AUC performance evaluation versus model size.

Figure 5 illustrates the performance measured by AUC (Area Under the Curve) across varying model sizes. The plot shows that HSGT consistently achieves the highest AUC values as the model size increases, and has the highest AUC value from model size 3 to 6. The second best performed model is the HEGT. These results demonstrate the robustness of the t-link model and the practical performance of the scalable skinny Gibbs algorithm.

6 Discussion

In this study, we establish the theoretical consistency results for the scalable algorithm to implement t-link (HSGT) in high-dimensional variable selection. The performance of different links and different algorithms is illustrated through simulation studies and an application to the real dataset. The t-link model is preferable, compared to the probit link, given its robustness in variable selection and prediction in the presence of outlying cases. HSGT methods offers competitive computational efficiency. As observed in [37], HSGT has a lower variability of the MCMC chains for active variables than HEGT and thus better performance in variable selection. Our application to real datasets further supports the effectiveness of HSGT. This outcome underscores its utility in handling high-dimensional sparse data, a common characteristic of modern datasets.

Our work opens the door to several promising research directions. First, it would be valuable to further investigate the interplay between link function choice and variable selection in high-dimensional data settings. While [35] explored this interplay using reversible jump Markov chain Monte Carlo (RJMCMC) to study the simultaneous selection of link functions and variables, we aim to address this question in high-dimensional settings by removing restrictions on the degrees of freedom v. Such an investigation would enhance the theoretical foundations and practical applications of Bayesian methods in modern data science.

Additionally, we are interested in establishing consistency results for more general link function models. We also plan to explore the use of Beta priors on q, as suggested by [41] and [11], and implement scalable spike-and-slab algorithms introduced by [4]. Another promising avenue is extending the methods to mixed-effects models, leveraging the normal scale mixture prior, as discussed by [45].

Robustness in generalized linear models can be pursued through the choice of link function or through robust estimation procedures such as M-estimators with adaptive weighting. In this study, we focus on the former. The weighting-based methods [23, 20] achieve robustness by downweighting observations with large residuals or leverage, thereby directly limiting the impact of contamination. Robustness by link modification stabilizes inference by the latent structures associated with the corresponding links. For example, [19] showed that the t-link’s influence function is bounded and its weight function down-weights extreme observations, thereby providing robust estimation under contamination and outperforming probit or logit links in such settings. While the weighting-based methods provides more flexibility, alternative links provide robustness while retaining standard estimation procedures.

Appendix A Proofs

A.1 Background

We begin by introducing the notations that will be utilized throughout the proofs. For the remainder of the paper, $\boldsymbol{E}=({E_{1}},\dots ,{E_{n}}),\hspace{0.1667em}\hspace{0.1667em}\boldsymbol{X}={({\boldsymbol{x}_{1}},\dots ,{\boldsymbol{x}_{n}})^{\prime }}\in {\mathbb{R}^{n\times p}}$. For any k, let ${\mu _{i,k}}={\boldsymbol{x}_{i,k}}{\boldsymbol{\beta }_{k}}$, where ${\boldsymbol{x}_{i,k}}$ is the ${i^{\texttt{th}}}$ row vector in the active covariate matrix ${\boldsymbol{X}_{k}}\in {\mathbb{R}^{n\times k}}$. The log-likelihood function is

\[\begin{aligned}{}\ell \left({\boldsymbol{\beta }_{k}},v\mid \boldsymbol{E},\boldsymbol{X}\right)=& {\sum \limits_{i=1}^{n}}\left\{{E_{i}}\hspace{2.5pt}\log \hspace{2.5pt}\Psi ({\mu _{i,k}},\hspace{2.5pt}v)\right.+\\ {} & \hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}\hspace{1em}\left.(1-{E_{i}})\hspace{2.5pt}\log \hspace{2.5pt}[1-\Psi ({\mu _{i,k}},\hspace{2.5pt}v)]\right\}.\end{aligned}\]

Then the score function and the Hessian matrix are given as:

\[\begin{aligned}{}{S_{n}}({\boldsymbol{\beta }_{k}})& =\frac{\partial \ell }{\partial {\boldsymbol{\beta }_{k}}}\\ {} & ={\sum \limits_{i=1}^{n}}\left[{E_{i}}\frac{\psi ({\mu _{i,k}},\hspace{2.5pt}v)}{\Psi ({\mu _{i,k}},\hspace{2.5pt}v)}\hspace{-0.1667em}-\hspace{-0.1667em}(1-{E_{i}})\frac{\psi ({\mu _{i,k}},\hspace{2.5pt}v)}{1\hspace{-0.1667em}-\hspace{-0.1667em}\Psi ({\mu _{i,k}},\hspace{2.5pt}v)}\right]{\boldsymbol{x}_{i,k}}\\ {} & \equiv {\boldsymbol{X}^{\prime }_{k}}D({\boldsymbol{\beta }_{k}},v){\left[\Sigma ({\boldsymbol{\beta }_{k}},v)\right]^{-\frac{1}{2}}}\left(\boldsymbol{E}-\Psi ({\boldsymbol{\mu }_{k}},v)\right)\end{aligned}\]

and the Hessian matrix

\[\begin{aligned}{}{H_{n}}({\boldsymbol{\beta }_{k}})& =-\frac{{\partial ^{2}}\ell }{\partial {\boldsymbol{\beta }_{k}}\partial {\boldsymbol{\beta }^{\prime }_{k}}}\\ {} & ={\sum \limits_{i=1}^{n}}{\boldsymbol{x}_{i,k}}{\boldsymbol{x}^{\prime }_{i,k}}{w_{i}}({\boldsymbol{\beta }_{k}})={\boldsymbol{X}^{\prime }_{k}}\boldsymbol{W}({\boldsymbol{\beta }_{k}}){\boldsymbol{X}_{k}},\end{aligned}\]

where $D({\boldsymbol{\beta }_{k}},v)=\text{Diag}\left({s_{1}}({\boldsymbol{\beta }_{k}},v),\dots ,{s_{n}}({\boldsymbol{\beta }_{k}},v)\right)$,$\Sigma ({\boldsymbol{\beta }_{k}},v)=\text{Diag}({\sigma _{1}^{2}}({\boldsymbol{\beta }_{k}},v),\dots ,{\sigma _{n}^{2}}({\boldsymbol{\beta }_{k}},v))$,${\sigma _{i}^{2}}({\boldsymbol{\beta }_{k}},v)=\Psi ({\mu _{i,k}},\hspace{2.5pt}v)[1-\Psi ({\mu _{i,k}},\hspace{2.5pt}v)]$, $\Psi ({\boldsymbol{\mu }_{k}},v)=(\Psi ({\mu _{1,k}},v),\dots ,\Psi ({\mu _{n,k}},v))$, and $\boldsymbol{W}({\boldsymbol{\beta }_{k}})=\text{Diag}({w_{i}}({\boldsymbol{\beta }_{k}}))$ for $i=1,\dots ,n$.

Now, we restate several lemmas from [7], which are required for proving the consistency of model selection. The first lemma bounds $|{H_{n}}({\boldsymbol{\beta }_{k}})|$ below and above.

Lemma 2 (Lemma A1 in [7]).

Under 3- 3 and ${R_{2}}={(n/log\hspace{0.1667em}p)^{\frac{1-d}{2}}}$ with $1/2\lt d\lt 1$, for ${\epsilon _{n}}=C{R_{2}}\sqrt{log\hspace{0.1667em}p/n}$, we have

\[ (1-{\epsilon _{n}}){H_{n}}({\boldsymbol{\beta }_{0,k}})\le {H_{n}}({\boldsymbol{\beta }_{k}})\le (1+{\epsilon _{n}}){H_{n}}({\boldsymbol{\beta }_{0,k}})\]

for any $\boldsymbol{k}\in {S_{1}^{\ast }}=\{\boldsymbol{k}:\boldsymbol{k}\supset \boldsymbol{t},\hspace{0.1667em}\boldsymbol{k}\le {R_{2}}+|\boldsymbol{t}|\}$ and ${\boldsymbol{\beta }_{k}}$ such that $||{\boldsymbol{\beta }_{k}}-{\boldsymbol{\beta }_{0,k}}|{|_{2}}\le {({C^{\prime }}|\boldsymbol{k}|log\hspace{2.5pt}p/n)^{\frac{1}{2}}}$, with probability at least $1-2\exp (-cn)$ for some positive constant $c,\hspace{0.1667em}C$ and ${C^{\prime }}$ NB: The probability $1-2\exp (-cn)$ is based on the fact that, there exist positive constants ${T_{1}}$ and ${T_{2}}$ such that

\[\begin{aligned}{}{T_{1}}\le & \underset{\boldsymbol{k}:|\boldsymbol{k}|\le {R_{2}}+|\boldsymbol{t}|}{\textit{min}}{\lambda _{\textit{min}}}\left(\frac{{\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{X}_{k}}}{n}\right)\le \\ {} & \underset{\boldsymbol{k}:|\boldsymbol{k}|\le {R_{2}}+|\boldsymbol{t}|}{\textit{max}}{\lambda _{\textit{max}}}\left(\frac{{\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{X}_{k}}}{n}\right)\le {T_{2}}\end{aligned}\]

with probability at least $1-2\exp (-cn)$ for some constant $c\gt 0$ by Theorem 5.39 and Remark 5.40 in [15].

Lemma 3 provides an inequality for quadratic forms involving the projection matrices onto the column space of the design matrix.

Lemma 3 (Lemma A2 in [7]).

Let $\tilde{\boldsymbol{U}}={\Sigma ^{-1/2}}(\boldsymbol{E}\boldsymbol{-}\boldsymbol{\mu })$ and ${\boldsymbol{P}_{k}}$ be the projection matrix onto the column space of ${\boldsymbol{D}_{0,k}}{\boldsymbol{X}_{k}}$, where $|\boldsymbol{k}|\le {R_{2}}$, and ${R_{2}}={(n/logp)^{\frac{1-d}{2}}}$ with $1/2\lt d\lt 1$. Then, for some constant ${\delta ^{\ast }}\gt 0$,

\[ P\left[{\tilde{\boldsymbol{U}}^{\prime }}{\boldsymbol{P}_{k}}\tilde{\boldsymbol{U}}\hspace{-0.1667em}\gt \hspace{-0.1667em}(1\hspace{-0.1667em}+\hspace{-0.1667em}{\delta ^{\ast }})\{tr({\boldsymbol{P}_{k}}\hspace{2.5pt}\hspace{-0.1667em}+\hspace{-0.1667em}2\sqrt{tr({\boldsymbol{P}_{k}})s}\hspace{-0.1667em}+\hspace{-0.1667em}2s\}\right]\le {e^{-s}},\]

for all $s\gt 0$.

Lemma 4 (Lemma A3 in [7]).

Under conditions 3- 3 and ${R_{2}}={(n/log\hspace{0.1667em}p)^{\frac{1-d}{2}}}$ with $1/2\lt d\lt 1$, we have

\[ \underset{\boldsymbol{k}:\boldsymbol{k}\supset \boldsymbol{t}}{sup}||{\widehat{\boldsymbol{\beta }}_{k}}-{\boldsymbol{\beta }_{0,k}}|{|_{2}}=O\left(\sqrt{\frac{|\boldsymbol{k}|log\hspace{0.1667em}p}{n}}\right)\]

uniformly for all $|\boldsymbol{k}|\le {R_{2}}+|\boldsymbol{t}|$ with probability at least $1-2\exp (-cn)$ for some constant $c\gt 0$. Furthermore, under the same conditions and $|\boldsymbol{t}|\ge 1$, we have

\[ \underset{\boldsymbol{k}:\boldsymbol{k}\supseteq \boldsymbol{t}}{sup}||{\widehat{\boldsymbol{\beta }}_{k}}-{\boldsymbol{\beta }_{0,k}}|{|_{2}}=O\left(\sqrt{\frac{|\boldsymbol{k}|log\hspace{0.1667em}p}{n}}\right)\]

uniformly for all $|\boldsymbol{k}|\le {R_{2}}+|\boldsymbol{t}|$ with probability at least $1-2{p^{-|\boldsymbol{k}|}}-2\exp (-cn)$ for some constant $c\gt 0$.

Lemma 4 bounds the Euclidean norm of the different between the MLE and the true regression coefficient.

Lemma 5 (Lemma A4 in [7]).

Under conditions (A1)–(A3) and ${R_{2}}={(n/log\hspace{0.1667em}p)^{\frac{1-d}{2}}}$ with $1/2\lt d\lt 1$, there exist a constant $\lambda \gt 0$ such that

\[ \lambda \hspace{-0.1667em}\le \hspace{-0.1667em}\underset{\boldsymbol{k}:|\boldsymbol{k}|\le R+|\boldsymbol{t}|}{\textit{min}}{\lambda _{\textbf{\textit{min}}}}\hspace{-0.1667em}\left(\frac{1}{n}{H_{n}}({\boldsymbol{\beta }_{0,k}})\right)\hspace{-0.1667em}\le \hspace{-0.1667em}{\boldsymbol{\Lambda }_{{m_{n}}}}\hspace{-0.1667em}\le \hspace{-0.1667em}{C^{2}}\hspace{-0.1667em}{\Big(\frac{n}{\log p}\wedge \log p\Big)^{d}}\hspace{-0.1667em},\]

and ${\boldsymbol{\Lambda }_{{m_{n}}}}={\max _{\boldsymbol{k}:\hspace{0.2222em}|\boldsymbol{k}|\le \zeta }}{\lambda _{\max }}({n^{-1}}{\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{X}_{k}})$ for any integer $\zeta \gt 0$. Also, for any model $\boldsymbol{k}\in \{\boldsymbol{k}\subseteq [p]:|\boldsymbol{k}|\le {m_{n}}\}$ and any $u\in \{u\in {\mathbb{R}^{n}}:u\hspace{2.5pt}\textit{is in the space spanned by the columns of}\hspace{2.5pt}{\boldsymbol{\Sigma }^{1/2}}{\boldsymbol{X}_{k}}\}$, with probability at least $1-2\exp (-cn)$ for some constant $c\gt 0$.

Lemma 5 indicates that the minimum eigenvalues of all the $\frac{1}{n}{H_{n}}({\boldsymbol{\beta }_{0,k}})$ will be uniformly bounded below by a constant under the certain conditions. It also provides an upper bounds for max eigenvalues of all the ${\lambda _{\max }}\big({n^{-1}}{\boldsymbol{X}^{\prime }_{k}}{\boldsymbol{X}_{k}}\big)$, respectively, where k belongs to the set of reasonably large models. The lower bound can be viewed as a restricted eigenvalue condition for k-sparse vectors, which is typically satisfied with high probability for sub-Gaussian design matrices [32]. Similar idea have been employed in the linear regression literature [25, 46, 44].

Equipped with these lemmas, we can now proceed to establish the main results, as detailed in the follow-up subsections.

A.2 Proof of Theorem 1

Proof.

We want to show that:

(E1)

\[ PR(\boldsymbol{k},\boldsymbol{t})\stackrel{P}{\to }0\hspace{1em}\hspace{2.5pt}\text{as}\hspace{2.5pt}n\to \infty .\]

Consider ${S_{1}}=\left\{\boldsymbol{k}:\boldsymbol{k}\supsetneq \boldsymbol{t},|\boldsymbol{k}|\le {m_{n}}\right\}$ (overfitted models).

By Taylor’s expansion of ${\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)$ around the MLE (${\widehat{\boldsymbol{\beta }}_{k}}$) of ${\boldsymbol{\beta }_{k}}$ under the model $\boldsymbol{k}$,

\[ {\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)-{\ell _{n}}\left({\widehat{\boldsymbol{\beta }}_{k}}\right)=-\frac{1}{2}{\left({\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}}\right)^{\prime }}{H_{n}}\left({\widetilde{\boldsymbol{\beta }}_{k}}\right)\left({\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}}\right)\]

for some ${\widetilde{\boldsymbol{\beta }}_{k}}$ such that ${\left\| {\widetilde{\boldsymbol{\beta }}_{k}}-{\widehat{\boldsymbol{\beta }}_{k}}\right\| _{2}}\le {\left\| {\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}}\right\| _{2}}$. For any ${\boldsymbol{\beta }_{k}}$ such that ${\left\| {\boldsymbol{\beta }_{k}}-{\boldsymbol{\beta }_{0,k}}\right\| _{2}}\le $ $C\sqrt{|\boldsymbol{k}|\log p/n}\equiv C{w_{n}}$ for some constant $C\gt 0$, by Lemma 4,

\[\begin{aligned}{}{\left\| {\widetilde{\boldsymbol{\beta }}_{k}}-{\boldsymbol{\beta }_{0,k}}\right\| _{2}}& \le {\left\| {\widetilde{\boldsymbol{\beta }}_{k}}-{\widehat{\boldsymbol{\beta }}_{k}}\right\| _{2}}+{\left\| {\widehat{\boldsymbol{\beta }}_{k}}-{\boldsymbol{\beta }_{0,k}}\right\| _{2}}\\ {} & \le {\left\| {\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}}\right\| _{2}}+{\left\| {\widehat{\boldsymbol{\beta }}_{k}}-{\boldsymbol{\beta }_{0,k}}\right\| _{2}}\\ {} & \le {\left\| {\boldsymbol{\beta }_{k}}-{\boldsymbol{\beta }_{0,k}}\right\| _{2}}+2{\left\| {\widehat{\boldsymbol{\beta }}_{k}}-{\boldsymbol{\beta }_{0,k}}\right\| _{2}}\le 3C{w_{n}}\end{aligned}\]

uniformly for all $k\in {S_{1}}$ with probability at least $1-2\exp (-cn)$ for some constant $c\gt 0$. Thus by Lemma 2,

\[ {\ell _{n}}\hspace{-0.1667em}\left({\boldsymbol{\beta }_{k}}\right)-{\ell _{n}}\hspace{-0.1667em}\left({\widehat{\boldsymbol{\beta }}_{k}}\right)\hspace{-0.1667em}\le \hspace{-0.1667em}-\frac{1\hspace{-0.1667em}-\hspace{-0.1667em}\epsilon }{2}{\left({\boldsymbol{\beta }_{k}}\hspace{-0.1667em}-\hspace{-0.1667em}{\widehat{\boldsymbol{\beta }}_{k}}\right)^{\prime }}{H_{n}}\left({\boldsymbol{\beta }_{0,k}}\right)\left({\boldsymbol{\beta }_{k}}\hspace{-0.1667em}-\hspace{-0.1667em}{\widehat{\boldsymbol{\beta }}_{k}}\right)\]

for some small constant $\epsilon \gt 0$. For any ${\boldsymbol{\beta }_{k}}$ such that ${\left\| {\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}}\right\| _{2}}=C{w_{n}}/2$, we have

(E1)

\[\begin{aligned}{}& {\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)-{\ell _{n}}\left({\widehat{\boldsymbol{\beta }}_{k}}\right)\\ {} & \le -\frac{1-\epsilon }{2}{\left\| {\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}}\right\| _{2}^{2}}{\lambda _{\min }}\left({H_{n}}\left({\boldsymbol{\beta }_{0,k}}\right)\right)\\ {} & \le -\frac{1-\epsilon }{8}{C^{2}}\lambda |\boldsymbol{k}|\log p\longrightarrow -\infty \hspace{2.5pt}\text{as}\hspace{2.5pt}n\to \infty \end{aligned}\]

with probability at least $1-2\exp (-cn)$, by Lemma 5. Note that it also holds for any ${\boldsymbol{\beta }_{k}}$ such that ${\left\| {\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}}\right\| _{2}}\gt C{w_{n}}/2$ due to the concavity of ${\ell _{n}}(\cdot )$ and the fact that ${\widehat{\boldsymbol{\beta }}_{k}}$ maximizes ${\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)$.

Define ${B_{k}}=\left\{{\boldsymbol{\beta }_{k}}:{\left\| {\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}}\right\| _{2}}\le C{w_{n}}/2\right\}$, then ${B_{k}}\subset \left\{{\boldsymbol{\beta }_{k}}:{\left\| {\boldsymbol{\beta }_{k}}-{\boldsymbol{\beta }_{0,k}}\right\| _{2}}\lt C{w_{n}}\right\}$ with probability at least $1-2\exp (-cn)$ uniformly in $\boldsymbol{k}\in {S_{1}}$. Therefore, for any $\boldsymbol{k}\in {S_{1}}$, with probability at least $1-2\exp (-cn)$, The proof for showing consistency under overfitted models will start by obtaining an upper bound for the marginal density $\pi (\boldsymbol{k}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})$ under any model $k\in {S_{1}}$, follow by obtaining a lower bound for $\pi (\boldsymbol{t}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})$ under the true model $\boldsymbol{t}$. To do that, we will deal with integrals over set B and set ${B^{c}}$ separately.

Note that, we have shown that for any model $\boldsymbol{k}$, ${\ell _{n}}({\boldsymbol{\beta }_{k}})-{\ell _{n}}({\widehat{\boldsymbol{\beta }}_{k}})\le -\frac{1-\epsilon }{2}{({\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}})^{\prime }}{H_{n}}({\boldsymbol{\beta }_{0,k}})({\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}})$,for $\boldsymbol{\beta }\in B$ and ${\ell _{n}}({\boldsymbol{\beta }_{k}})-{\ell _{n}}({\widehat{\boldsymbol{\beta }}_{k}})\le -\frac{1-\epsilon }{8}{c^{2}}\lambda |k|\log p$, for $\boldsymbol{\beta }\in {B^{c}}$ with probability tending to 1. The proof below proceed with the above two inequalities to obtain the upper bound for $\pi (\boldsymbol{k}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E})$. It follows that

\[\begin{aligned}{}& \pi (\boldsymbol{k}\mid \boldsymbol{X},\boldsymbol{E})\\ {} \propto & {\left\{2\pi {v_{n}^{-2}}\left(n-1+{\tau _{0}^{-2}}\right)\right\}^{\frac{|\boldsymbol{k}|}{2}}}{s^{r}}/\Gamma (r)\times \\ {} & \int \int {({\tau _{1}^{2}})^{-\frac{|\boldsymbol{k}|+2r+2}{2}}}\exp \left({\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)-\frac{{\left\| {\boldsymbol{\beta }_{k}}\right\| _{2}^{2}}+2s}{2{\tau _{1}^{2}}}\right)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} =& {\left\{2\pi {q^{2}}\left(n{\tau _{0}^{2}}-{\tau _{0}^{2}}+1\right){\tau _{1}^{-2}}/{(1-q)^{2}}\right\}^{|\boldsymbol{k}|/2}}{s^{r}}/\Gamma (r)\times \\ {} & \int \int {({\tau _{1}^{2}})^{-\frac{|\boldsymbol{k}|+2r+2}{2}}}\exp \left({\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)-\frac{{\left\| {\boldsymbol{\beta }_{k}}\right\| _{2}^{2}}+2s}{2{\tau _{1}^{2}}}\right)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} \le & {\left({C_{1}}{q^{2{c_{q}}}}{\tau _{1}^{2}}\right)^{-|\boldsymbol{k}|/2}}\times \\ {} & \int \int {({\tau _{1}^{2}})^{-\frac{|\boldsymbol{k}|+2r+2}{2}}}\exp \left({\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)-\frac{{\left\| {\boldsymbol{\beta }_{k}}\right\| _{2}^{2}}+2s}{2{\tau _{1}^{2}}}\right)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\end{aligned}\]

for some constant ${C_{1}}\gt 0$. It proceed as follows. follows. follows. follows. follows. follows. follows.

(E2)

\[\begin{aligned}{}& \pi (\boldsymbol{k}\mid \boldsymbol{X},\boldsymbol{E})\\ {} \propto & {\left\{2\pi {v_{n}^{-2}}\left(n-1+{\tau _{0}^{-2}}\right)\right\}^{\frac{|\boldsymbol{k}|}{2}}}{s^{r}}/\Gamma (r)\times \\ {} & \int \int {({\tau _{1}^{2}})^{-\frac{\boldsymbol{k}}{2}-r-1}}\exp \left({\ell _{n}}\left({\boldsymbol{\beta }_{k}}\right)-\frac{{\left\| {\boldsymbol{\beta }_{k}}\right\| _{2}^{2}}+2s}{2{\tau _{1}^{2}}}\right)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} \le & {\left({C_{1}}{p^{2}}\right)^{-|\boldsymbol{k}|/2}}\exp \left\{{\ell _{n}}\left(\hat{{\boldsymbol{\beta }_{k}}}\right)\right\}\times \\ {} & \left[\int {\int _{{B_{k}}}}{({\tau _{1}^{2}})^{-|\boldsymbol{k}|/2-r-1}}{e^{-\frac{s}{{\tau _{1}^{2}}}}}\times \right.\\ {} & \exp \hspace{-0.1667em}\bigg\{\hspace{-0.1667em}\hspace{-0.1667em}-\hspace{-0.1667em}\frac{1\hspace{-0.1667em}-\hspace{-0.1667em}\epsilon }{2}{({\boldsymbol{\beta }_{k}}\hspace{-0.1667em}-\hspace{-0.1667em}{\hat{\boldsymbol{\beta }}_{k}})^{\prime }}{H_{n}}({\boldsymbol{\beta }_{0,k}})({\boldsymbol{\beta }_{k}}\hspace{-0.1667em}-\hspace{-0.1667em}{\hat{\boldsymbol{\beta }}_{k}})\hspace{-0.1667em}-\hspace{-0.1667em}\frac{\| {\boldsymbol{\beta }_{\boldsymbol{k}}}{\| _{2}^{2}}}{2{\tau _{1}^{2}}}\bigg\}d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} & +\exp \left(-\frac{1-\epsilon }{8}{C^{2}}\lambda |\boldsymbol{k}|\log p\right)\\ {} & \left.\int {\int _{{B_{k}^{c}}}}{({\tau _{1}^{2}})^{-|\boldsymbol{k}|/2-r-1}}{e^{-\frac{s}{{\tau _{1}^{2}}}}}\exp \left(-\frac{{\left\| {\boldsymbol{\beta }_{k}}\right\| _{2}^{2}}}{2{\tau _{1}^{2}}}\right)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\right].\end{aligned}\]

Note that for ${A_{k}}=(1-\epsilon ){H_{n}}({\boldsymbol{\beta }_{0,k}})$ and ${{\boldsymbol{\beta }_{k}}^{\ast }}={({A_{k}}+{I_{|\boldsymbol{k}|}}/{\tau _{1}^{2}})^{-1}}{A_{k}}{\widehat{\boldsymbol{\beta }}_{k}}$, we have

\[\begin{aligned}{}& \int {\int _{{B_{k}}}}{({\tau _{1}^{2}})^{-|\boldsymbol{k}|/2-r-1}}{e^{-\frac{s}{{\tau _{1}^{2}}}}}\times \\ {} & \exp \hspace{-0.1667em}\bigg\{\hspace{-0.1667em}\hspace{-0.1667em}-\frac{1\hspace{-0.1667em}-\hspace{-0.1667em}\epsilon }{2}{({\boldsymbol{\beta }_{k}}\hspace{-0.1667em}-\hspace{-0.1667em}{\widehat{\boldsymbol{\beta }}_{k}})^{\prime }}{H_{n}}({\boldsymbol{\beta }_{0,k}})({\boldsymbol{\beta }_{k}}\hspace{-0.1667em}-\hspace{-0.1667em}{\widehat{\boldsymbol{\beta }}_{k}})\hspace{-0.1667em}-\hspace{-0.1667em}\frac{\| {\boldsymbol{\beta }_{k}}{\| _{2}^{2}}}{2{\tau _{1}^{2}}}\bigg\}d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} & \le \int \int {({\tau _{1}^{2}})^{-|\boldsymbol{k}|/2-r-1}}\exp \Big(-\frac{s}{{\tau _{1}^{2}}}\Big)\times \\ {} & \hspace{2em}\exp \bigg\{-\frac{1}{2}{({\boldsymbol{\beta }_{k}}-{\boldsymbol{\beta }_{k}^{\ast }})^{\prime }}({A_{k}}+{I_{|\boldsymbol{k}|}}/{\tau _{1}^{2}})({\boldsymbol{\beta }_{k}}-{\boldsymbol{\beta }_{k}^{\ast }})\bigg\}\times \\ {} & \hspace{2em}\hspace{0.1667em}\hspace{0.1667em}\exp \bigg\{\hspace{-0.1667em}\hspace{-0.1667em}-\frac{1}{2}{\widehat{\boldsymbol{\beta }}^{\prime }_{k}}\big({A_{k}}\hspace{-0.1667em}-\hspace{-0.1667em}{A_{k}}{({A_{k}}\hspace{-0.1667em}+\hspace{-0.1667em}{I_{|\boldsymbol{k}|}}/{\tau _{1}^{2}})^{-1}}{A_{k}}\big){\widehat{\boldsymbol{\beta }}_{k}}\bigg\}\hspace{0.1667em}d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} & =\hspace{-0.1667em}\hspace{-0.1667em}\int \hspace{-0.1667em}{(2\pi )^{\hspace{-0.1667em}|\boldsymbol{k}|/2}}\hspace{-0.1667em}\det \hspace{-0.1667em}{\big(\hspace{-0.1667em}{A_{k}}\hspace{-0.1667em}+\hspace{-0.1667em}{I_{|\boldsymbol{k}|}}/{\tau _{1}^{2}}\big)^{\hspace{-0.1667em}-1/2}}{({\tau _{1}^{2}})^{\hspace{-0.1667em}-|\boldsymbol{k}|/2-r-1}}\hspace{-0.1667em}\exp \hspace{-0.1667em}\Big(\hspace{-0.1667em}\hspace{-0.1667em}-\hspace{-0.1667em}\frac{s}{{\tau _{1}^{2}}}\Big)\hspace{-0.1667em}\times \\ {} & \hspace{1em}\hspace{0.1667em}\hspace{0.1667em}\exp \Big\{-\frac{1}{2}{\widehat{\beta }^{\prime }_{k}}\big({A_{k}}-{A_{k}}{({A_{k}}+{I_{|\boldsymbol{k}|}}/{\tau _{1}^{2}})^{-1}}{A_{k}}\big){\widehat{\boldsymbol{\beta }}_{k}}\Big\}\hspace{0.1667em}d{\tau _{1}^{2}},\end{aligned}\]

where ${{\boldsymbol{\beta }_{k}}^{\ast }}$ denotes the mean with respect to a multivariate normal distribution and covariance matrix ${A_{k}}+{I_{|\boldsymbol{k}|}}/{\tau _{1}^{2}}$. Therefore, by noting that $\exp \big\{-1/2{\widehat{\boldsymbol{\beta }}^{\prime }_{k}}\big({A_{k}}-{A_{k}}{({A_{k}}+{I_{|\boldsymbol{k}|}}/{\tau _{1}^{2}})^{-1}}{A_{k}}\big){\widehat{\boldsymbol{\beta }}_{k}}\big\}\ge 1$, it follows from (E2) that, for some constant ${C_{1}}\gt 0$,

(E3)

\[\begin{aligned}{}& \int {\int _{{B_{k}}}}{({\tau _{1}^{2}})^{-|\boldsymbol{k}|/2-r-1}}{e^{-\frac{s}{{\tau _{1}^{2}}}}}\exp \left\{-\frac{1}{2{\tau _{1}^{2}}}\| {\boldsymbol{\beta }_{\boldsymbol{k}}}{\| _{2}^{2}}\right\}\times \\ {} & \exp \bigg\{-\frac{1-\epsilon }{2}{({\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}})^{\prime }}{H_{n}}({\beta _{0,k}})({\boldsymbol{\beta }_{k}}-{\widehat{\boldsymbol{\beta }}_{k}})\bigg\}d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} & \le {C_{\pi }}\int \det {\big({A_{k}}+{I_{|\boldsymbol{k}|}}/{\tau _{1}^{2}}\big)^{-\frac{1}{2}}}{({\tau _{1}^{2}})^{-\frac{|\boldsymbol{k}|+2r+2}{2}}}\exp \Big(-\frac{s}{{\tau _{1}^{2}}}\Big)d{\tau _{1}^{2}}\\ {} & \le \hspace{2.5pt}{C_{\pi }}\int \det {\big({A_{k}}\big)^{-1/2}}{({\tau _{1}^{2}})^{-|\boldsymbol{k}|/2-r-1}}\exp \Big(-\frac{s}{{\tau _{1}^{2}}}\Big)d{\tau _{1}^{2}}\\ {} & \le {C_{\pi }}\det {({n^{-1}}{A_{k}})^{-1/2}}{s^{-(|\boldsymbol{k}|/2+r)}}\Gamma \big(|\boldsymbol{k}|/2+r\big),\end{aligned}\]

where ${C_{\pi }}={(2\pi )^{|\boldsymbol{k}|/2}}$. Next, note that

(E4)

\[\begin{aligned}{}& \int {\int _{{B_{k}^{c}}}}{({\tau _{1}^{2}})^{-|\boldsymbol{k}|/2-r-1}}{e^{-\frac{s}{{\tau _{1}^{2}}}}}\exp \left(-\frac{{\left\| {\boldsymbol{\beta }_{k}}\right\| _{2}^{2}}}{2{\tau _{1}^{2}}}\right)d{\boldsymbol{\beta }_{k}}d{\tau _{1}^{2}}\\ {} & \le \int \int {({\tau _{1}^{2}})^{-|\boldsymbol{k}|/2-r-1}}\exp \Big(\hspace{-0.1667em}\hspace{-0.1667em}-\frac{s}{{\tau _{1}^{2}}}\Big)\exp \bigg(\hspace{-0.1667em}\hspace{-0.1667em}-\frac{{{\boldsymbol{\beta }_{k}}^{\prime }}{\boldsymbol{\beta }_{k}}}{2{\tau _{1}^{2}}}\bigg)\hspace{0.1667em}d{\beta _{k}}d{\tau _{1}^{2}}\\ {} & \le \int {({\tau _{1}^{2}})^{-|\boldsymbol{k}|/2-r-1}}\exp \Big(-\frac{s}{{\tau _{1}^{2}}}\Big){\big(2\pi {\tau _{1}^{2}}\big)^{|\boldsymbol{k}|/2}}\hspace{0.1667em}d{\tau _{1}^{2}}\\ {} & \le {(2\pi )^{|\boldsymbol{k}|}}{s^{-r}}\Gamma (r).\end{aligned}\]

Combining (E3) and (E4) and using the Stirling approximation for the gamma function, we obtain the following upper bound for $\pi (k\mid X,\boldsymbol{E})$ given as

(E5)

\[\begin{array}{r}\displaystyle \pi (\boldsymbol{k}\mid \boldsymbol{X},\boldsymbol{E})\lesssim \hspace{162.18062pt}\\ {} \displaystyle {({C_{3}}{p^{2}})^{-|\boldsymbol{k}|/2}}\exp \big\{{\ell _{n}}({\widehat{\boldsymbol{\beta }}_{k}})\big\}{n^{-|\boldsymbol{k}|/2}}\det {({n^{-1}}{A_{k}})^{-1/2}},\end{array}\]

for any $\boldsymbol{k}\in {S_{1}}$ and some constant ${C_{3}}\gt 0$.

Note that

\[ \det {({A_{k}})^{1/2}}{s^{-(|\boldsymbol{k}|/2+r)}}\Gamma \big(|\boldsymbol{k}|/2+r\big)\ll \exp \big\{\frac{1-\epsilon }{8}{c^{2}}\lambda |\boldsymbol{k}|\log p\big\}\]

by conditions 3–3.

On the other hand, for the true model t and some constant ${C_{4}}\gt 0$,

\[\begin{aligned}{}& \pi \left(\boldsymbol{t}\mid \boldsymbol{X}\mathbf{,}\boldsymbol{E}\right)\propto \\ {} & {\left\{2\pi {v_{n}^{-2}}\left(n-1+{\tau _{0}^{-2}}\right)\right\}^{\frac{\left|\boldsymbol{t}\right|}{2}}}{s^{r}}/\Gamma (r)\times \\ {} & \int \hspace{-0.1667em}\int {({\tau _{1}^{2}})^{-\frac{|\boldsymbol{t}|}{2}-r-1}}\exp \left\{{\ell _{n}}\left({\boldsymbol{\beta }_{t}}\right)\hspace{-0.1667em}-\hspace{-0.1667em}\frac{1}{2{\tau _{1}^{2}}}{\left\| {\boldsymbol{\beta }_{t}}\right\| _{2}^{2}}\hspace{-0.1667em}-\hspace{-0.1667em}\frac{s}{{\tau _{1}^{2}}}\right\}d{\boldsymbol{\beta }_{t}}d{\tau _{1}^{2}}\\ {} \propto & {\left\{2\pi {v_{n}^{-2}}\left(n-1+{\tau _{0}^{-2}}\right)\right\}^{\frac{\left|\boldsymbol{t}\right|}{2}}}\exp \left\{{\ell _{n}}\left({\boldsymbol{\beta }_{t}}\right)\right\}\times \\ {} & \int \int {({\tau _{1}^{2}})^{-|\boldsymbol{t}|/2-r-1}}\exp \Big(-\frac{s}{{\tau _{1}^{2}}}\Big)\\ {} & \exp \bigg\{-\frac{1}{2}{({\boldsymbol{\beta }_{t}}-{\boldsymbol{\beta }_{t}^{\ast }})^{\prime }}({A_{t}}+{I_{|\boldsymbol{t}|}}/{\tau _{1}^{2}})({\boldsymbol{\beta }_{t}}-{\boldsymbol{\beta }_{t}^{\ast }})\bigg\}\times \\ {} & \exp \bigg\{-\frac{1}{2}{\widehat{\boldsymbol{\beta }}^{\prime }_{t}}\big({A_{t}}-{A_{t}}{({A_{t}}+{I_{|\boldsymbol{t}|}}/{\tau _{1}^{2}})^{-1}}{A_{t}}\big){\widehat{\boldsymbol{\beta }}_{t}}\bigg\}\hspace{0.1667em}d{\boldsymbol{\beta }_{t}}d{\tau _{1}^{2}}\\ {} \propto & {\left\{2\pi {v_{n}^{-2}}\left(n-1+{\tau _{0}^{-2}}\right)\right\}^{\frac{\left|\boldsymbol{t}\right|}{2}}}\exp \left\{{\ell _{n}}\left({\boldsymbol{\beta }_{t}}\right)\right\}\times \\ {} & \int \hspace{-0.1667em}{(2\pi )^{|\boldsymbol{t}|/2}}\det {\big({A_{t}}+{I_{|\boldsymbol{t}|}}/{\tau _{1}^{2}}\big)^{-1/2}}{({\tau _{1}^{2}})^{-|\boldsymbol{t}|/2-r-1}}\hspace{-0.1667em}\exp \hspace{-0.1667em}\Big(\hspace{-0.1667em}\hspace{-0.1667em}-\hspace{-0.1667em}\frac{s}{{\tau _{1}^{2}}}\Big)\\ {} & \hspace{1em}\times \hspace{0.1667em}\hspace{0.1667em}\exp \Big\{-\frac{1}{2}{\widehat{\boldsymbol{\beta }}^{\prime }_{t}}\big({A_{t}}-{A_{t}}{({A_{t}}+{I_{|\boldsymbol{t}|}}/{\tau _{1}^{2}})^{-1}}{A_{t}}\big){\widehat{\boldsymbol{\beta }}_{t}}\Big\}\hspace{0.1667em}d{\tau _{1}^{2}},\\ {} \ge & {\left({C_{4}}{p^{2}}\right)^{-\left|\boldsymbol{t}\right|/2}}\exp \left\{{\ell _{n}}\left(\widehat{{\boldsymbol{\beta }_{t}}}\right)\right\}\times \\ {} & \hspace{2em}{n^{-|\boldsymbol{t}|/2}}\det {({n^{-1}}{A^{\prime }_{t}})^{-1/2}}{(\log p)^{-2d}}.\end{aligned}\]

Now, the lower bound is obtained using both the arguments and those in [38]. In particular, for some constant ${C_{4}}\gt 0$,

\[\begin{aligned}{}\pi (\boldsymbol{t}\mid \boldsymbol{X},\boldsymbol{E})\ge & {({C_{4}}{q^{2}})^{-|\boldsymbol{k}|/2}}\times \\ {} & \exp \hspace{-0.1667em}\big\{{\ell _{n}}({\widehat{\beta }_{t}})\big\}{n^{-|\boldsymbol{t}|/2}}\det {({n^{-1}}{A^{\prime }_{t}})^{-1/2}}{(\log p)^{-2d}},\end{aligned}\]

where ${A^{\prime }_{t}}=(1+\epsilon ){H_{n}}({\beta _{0,t}})$. Therefore, with probability tending to 1, combining with (E5), for some constant ${C^{\ast }}\gt 0$,

(A.1)

\[\begin{aligned}{}\frac{\pi (\boldsymbol{k}\mid \boldsymbol{X},\boldsymbol{E})}{\pi (\boldsymbol{t}\mid \boldsymbol{X},\boldsymbol{E})}\lesssim & {\big\{{C^{\ast }}n{q^{2}}\big\}^{-(|\boldsymbol{k}|-|\boldsymbol{t}|)/2}}\exp \big\{{\ell _{n}}({\widehat{\boldsymbol{\beta }}_{k}})-{\ell _{n}}({\widehat{\boldsymbol{\beta }}_{t}})\big\}\times \\ {} & \frac{\det {({n^{-1}}{A^{\prime }_{t}})^{1/2}}}{\det {({n^{-1}}{A_{k}})^{1/2}}}{(\log p)^{2d}}\\ {} \lesssim & {({C^{\ast }}p)^{-(2+\delta /2)(|\boldsymbol{k}|-|\boldsymbol{t}|)}}{p^{(1+{\delta ^{\ast }})(1+2w)(|\boldsymbol{k}|-|\boldsymbol{t}|)}},\end{aligned}\]

for any $k\in {S_{1}}$, where ${A^{\prime }_{t}}=(1+\epsilon ){H_{n}}({\boldsymbol{\beta }_{0,t}})$ and the second inequality holds by Lemma 3.

(A.2)

\[ {\ell _{n}}({\widehat{\boldsymbol{\beta }}_{k}})-{\ell _{n}}({\widehat{\boldsymbol{\beta }}_{t}})\le {b_{n}}(|\boldsymbol{k}|-|\boldsymbol{t}|)\]

for any $\boldsymbol{k}\in {S_{1}}$ with probability tending to 1, where ${b_{n}}=(1+{\delta ^{\ast }})(1+2w)\log p$ such that $1+\delta /2\gt (1+{\delta ^{\ast }})(1+2w)$. Note that since we will focus on sufficiently large n, ${\delta ^{\ast }}$ can be considered as an arbitrarily small constant.

Hence, with probability tending to 1, it follows from (A.1) and (A.2) that

\[\begin{aligned}{}PR(\boldsymbol{k},\boldsymbol{t})=& \frac{\pi (\boldsymbol{k}\mid X,E)}{\pi (\boldsymbol{t}\mid X,E)}\\ {} \lesssim & {({C^{\ast }}p)^{-(1+{\delta ^{\ast }})(|\boldsymbol{k}|-|\boldsymbol{t}|)}}\\ {} =& o(1),\end{aligned}\]

for some fixed constant ${\delta ^{\prime }}\gt 0$. By adapting the line of reasoning presented in [38], one can show that the above expression is $o(1)$ uniformly over all $\boldsymbol{k}\in {S_{2}}$, where ${S_{2}}=\left\{\boldsymbol{k}:\boldsymbol{k}\nsupseteq \boldsymbol{t},\hspace{0.2778em}|\boldsymbol{k}|\le {R_{n}}\right\}$. Thus, we have proved the desired result (3.1). □

Proof of Theorem 2.

It suffices to show that

(A9)

\[ \sum \limits_{\boldsymbol{k}:\boldsymbol{k}\ne \boldsymbol{t}}\frac{\pi (\boldsymbol{k}\mid X,E)}{\pi \left(\boldsymbol{t}\mid X,E\right)}\xrightarrow{P}0\hspace{1em}\hspace{2.5pt}\text{as}\hspace{2.5pt}n\to \infty \]

Note that

\[ \sum \limits_{\boldsymbol{k}:\boldsymbol{k}\ne \boldsymbol{t}}\frac{\pi (\boldsymbol{k}\mid X,E)}{\pi \left(\boldsymbol{t}\mid X,E\right)}=\sum \limits_{\boldsymbol{k}\in {S_{1}}}\frac{\pi (\boldsymbol{k}\mid X,E)}{\pi \left(\boldsymbol{t}\mid X,E\right)}+\sum \limits_{\boldsymbol{k}\in {S_{2}}}\frac{\pi (\boldsymbol{k}\mid X,E)}{\pi \left(\boldsymbol{t}\mid X,E\right)}\]

where ${S_{1}}=\left\{\boldsymbol{k}:\boldsymbol{k}\supsetneq t,|\boldsymbol{k}|\le {m_{n}}\right\}$ and ${S_{2}}=\{\boldsymbol{k}:\boldsymbol{k}\nsupseteq t,|\boldsymbol{k}|\le {R_{n}}\}$. By the proof of Theorem: 1, we have

(E5)

\[\begin{aligned}{}\sum \limits_{\boldsymbol{k}\in {S_{1}}}\frac{\pi (\boldsymbol{k}\mid X,Y)}{\pi \left(\boldsymbol{t}\mid X,Y\right)}& \lesssim {\sum \limits_{|\boldsymbol{k}|-\left|\boldsymbol{t}\right|=1}^{{m_{n}}-|\boldsymbol{t}|}}\sum \limits_{\boldsymbol{k}\in {S_{1}}}{\left\{{\left({C^{\ast }}p\right)^{-(1+{\delta ^{\ast }})}}\right\}^{|\boldsymbol{k}|-\left|\boldsymbol{t}\right|}}\\ {} & \le {\sum \limits_{|\boldsymbol{k}|-|\boldsymbol{t}|=1}^{{m_{n}}-|\boldsymbol{t}|}}\left(\genfrac{}{}{0.0pt}{}{p\hspace{-0.1667em}-\hspace{-0.1667em}|\boldsymbol{t}|}{|\boldsymbol{k}|\hspace{-0.1667em}-\hspace{-0.1667em}|\boldsymbol{t}|}\right){({C^{\ast }}p)^{-(1+{\delta ^{\ast }})(|\boldsymbol{k}|-|\boldsymbol{t}|)}},\end{aligned}\]

for some constant ${C^{\ast }}$, ${\delta ^{\ast }}\gt 0$. Using $\left(\genfrac{}{}{0.0pt}{}{p}{|\boldsymbol{k}|}\right)\le {p^{|\boldsymbol{k}|}}$ and (E5), we have

\[ \sum \limits_{\boldsymbol{k}\in {S_{1}}}PR(\boldsymbol{k},\boldsymbol{t})\hspace{0.2778em}={\sum \limits_{|\boldsymbol{k}|=\left|\boldsymbol{t}\right|+1}^{{m_{n}}-|\boldsymbol{t}|}}{\left\{{\left({C^{\ast }}p\right)^{-(1+\delta )}}\right\}^{|\boldsymbol{k}|-\left|\boldsymbol{t}\right|}}=o(1).\]

Similar arguments from [8] can be used to show that ${S_{2}}$ is of order o(1). This completes the proof. □

Acknowledgements

This work was supported in part by an allocation of computing time from the Ohio Supercomputer Center. The datasets used and/or analyzed in this study are available in the Gene Expression Omnibus (GEO) database maintained by the National Center for Biotechnology Information.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

[1]

Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association 88(422) 669–679. MR1224394

[2]

Anderson, D. and Kurtz, T. Continuous time Markov chain models for chemical reaction networks. http://www.math.wisc.edu/~kurtz/papers/AndKurJuly10.pdf. Accessed 27 July 2010.

[3]

Bhattacharya, A., Chakraborty, A. and Mallick, B. K. (2016). Fast sampling with Gaussian scale mixture priors in high-dimensional regression. Biometrika 042. https://doi.org/10.1093/biomet/asw042. MR3620452

[4]

Biswas, N., Mackey, L. and Meng, X. -L. (2022). Scalable spike-and-slab. In International Conference on Machine Learning 2021–2040. PMLR.

[5]

Blanchet, J., Leder, K. and Glynn, P. (2009). Efficient Simulation of Light-Tailed Sums: an Old-Folk Song Sung to a Faster New Tune... In Monte Carlo and Quasi-Monte Carlo Methods (P. L’ Ecuyer and A. B. Owen, eds.) Springer, Berlin. https://doi.org/10.1007/978-3-642-04107-5_13. MR2743897

[6]

Blanchet, J., Leder, K. and Shi, Y. (2011). Analysis of a splitting estimator for rare event probabilities in Jackson networks. Stochastic Systems 1 306–339. https://doi.org/10.1214/11-SSY026. MR2949543

[7]

Cao, X. and Lee, K. (2023). Consistent and scalable Bayesian joint variable and graph selection for disease diagnosis leveraging functional brain network. Bayesian Analysis 1(1) 1–29. https://doi.org/10.1214/23-ba1376. MR4770325

[8]

Cao, X. and Lee, K. (2024). Bayesian inference on hierarchical nonlocal priors in generalized linear models. Bayesian Analysis 19(1) 99–122. https://doi.org/10.1214/22-ba1350. MR4692544

[9]

Cao, X., Khare, K. and Ghosh, M. (2020). High-dimensional posterior consistency for hierarchical non-Local priors in regression. Bayesian Analysis 15(1) 241–262. https://doi.org/10.1214/19-BA1154. MR4050884

[10]

Cardie, C. and Howe, N. (1997). Improving minority class prediction using case-specific feature weights. In Computer Science: Faculty Publications, Smith College, Northampton, MA.

[11]

Castillo, I. and Van Der Vaart, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics 40 2069–2101. https://doi.org/10.1214/12-AOS1029. MR3059077

[12]

Chen, M. -H., Ibrahim, J. G. and Yiannoutsos, C. (1999). Prior elicitation, variable selection and Bayesian computation for logistic regression models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(1) 223–242. https://doi.org/10.1111/1467-9868.00173. MR1664057

[13]

Chicco, D. and Jurman, G. (2023). The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Mining 16(1) 4.

[14]

Dunn, P. K., Smyth, G. K. et al. (2018) Generalized linear models with examples in R 53. Springer. https://doi.org/10.1007/978-1-4419-0118-7. MR3887706

[15]

Eldar, Y. C. and Kutyniok, G. (2012) Compressed sensing: theory and applications. Cambridge university press. https://doi.org/10.1515/dmvm-2014-0014. MR3254193

[16]

Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). https://doi.org/10.1214/06-BA117A. MR2221284

[17]

George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88(423) 881–889.

[18]

George, E. I. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statistica Sinica 35 339–373.

[19]

Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the american statistical association 69(346) 383–393. MR0362657

[20]

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986) Robust Statistics: The Approach Based on Influence Functions. Wiley, New York. MR0829458

[21]

He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21(9) 1263–1284.

[22]

Hosseinian, S. and Morgenthaler, S. (2011). Robust binary regression. Journal of Statistical Planning and Inference 141(4) 1497–1509. https://doi.org/10.1016/j.jspi.2010.11.015. MR2747918

[23]

Huber, P. J. (1981) Robust Statistics. Wiley, New York. MR0606374

[24]

Iannario, M., Monti, A. C., Piccolo, D. and Ronchetti, E. (2017). Robust inference for ordinal response models. Electronic Journal of Statistics 11(2) 3407–3445. https://doi.org/10.1214/17-EJS1314. https://doi.org/10.1214/17-EJS1314. MR3709859

[25]

Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist and Bayesian strategies. The Annals of Statistics 33(2) 730–773. https://doi.org/10.1214/009053604000001147. MR2163158

[26]

Johnson, V. E. and Rossell, D. (2012). Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association 107(498) 649–660. https://doi.org/10.1080/01621459.2012.682536. MR2980074

[27]

Lan, H., Chen, M., Flowers, J. B., Yandell, B. S., Stapleton, D. S., Mata, C. M., Mui, E. T. q. K., Flowers, M. T., Schueler, K. L., Manly, K. F. et al. (2006). Combined expression trait correlations and expression quantitative trait locus mapping. PLoS Genetics 2(1) 6. MR2709393

[28]

Lee, K. and Cao, X. (2021). Bayesian group selection in logistic regression with application to MRI data analysis. Biometrics 77(2) 391–400. https://doi.org/10.1111/biom.13290. MR4307642

[29]

Menon, A. K., Jayasumana, S., Rawat, A. S., Jain, H., Veit, A. and Kumar, S. (2020). Long-tail learning via logit adjustment. arXiv preprint arXiv:2007.07314.

[30]

Mirfarah, E., Naderi, M., Lin, T. -I. and Wang, W. -L. (2024). Robust Bayesian inference for the censored mixture of experts model using heavy-tailed distributions. Advances in Data Analysis and Classification 1–29. https://doi.org/10.1007/s11634-024-00609-2. MR4993902

[31]

Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. Journal of the american statistical association 83(404) 1023–1032. MR0997578

[32]

Narisetty, N. N., Shen, J. and He, X. (2018). Skinny Gibbs: A consistent and scalable Gibbs sampler for model selection. Journal of the American Statistical Association. https://doi.org/10.1080/01621459.2018.1482754. MR4011773

[33]

Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics 42(2) 789–817. https://doi.org/10.1214/14-AOS1207. MR3210987

[34]

Nikooienejad, A., Wang, W. and Johnson, V. E. (2016). Bayesian variable selection for binary outcomes in high-dimensional genomic studies using non-local priors. Bioinformatics 32(9) 1338–1345.

[35]

Ntzoufras, I., Dellaportas, P. and Forster, J. J. (2003). Bayesian variable and link determination for generalised linear models. Journal of Statistical Planning and Inference 111(1-2) 165–180. https://doi.org/10.1016/S0378-3758(02)00298-7. MR1955879

[36]

O’Brien, S. M. and Dunson, D. B. (2004). Bayesian multivariate logistic regression. Biometrics 60(3) 739–746. https://doi.org/10.1111/j.0006-341X.2004.00224.x. MR2089450

[37]

Odoom, E., Ouyang, J., Cao, X. and Wang, X. (2026). Hierarchical skinny Gibbs sampler in logistic regression using Pólya-Gamma latent variables. Statistics and Its Interface 19(2) 179–196. https://doi.org/10.4310/sii.260108020451. MR5012359

[38]

Ouyang, J. and Cao, X. (2024). Consistent skinny Gibbs in probit regression. Computational Statistics & Data Analysis 198 107993. https://doi.org/10.1016/j.csda.2024.107993. https://doi.org/10.1016/j.csda.2024.107993. MR4752161

[39]

Rockova, V., Lesaffre, E., Luime, J. and Löwenberg, B. (2012). Hierarchical Bayesian formulations for selecting variables in regression models. Statistics in medicine 31(11-12) 1221–1237. https://doi.org/10.1002/sim.4439. MR2925691

[40]

Scalera, V., Iannario, M. and Monti, A. C. (2021). Robust link functions. Statistics 55(4) 963–977. https://doi.org/10.1080/02331888.2021.1987436. MR4364403

[41]

Scott, J. G. and Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. The Annals of Statistics 38 2587–2619. https://doi.org/10.1214/10-AOS792. MR2722450

[42]

Seumois, G., Ramírez-Suástegui, C., Schmiedel, B. J., Liang, S., Peters, B., Sette, A. and Vijayanand, P. (2020). Single-cell transcriptomic analysis of allergen-specific T cells in allergy and asthma. Science Immunology 5(48) 6087.

[43]

Shin, M., Bhattacharya, A. and Johnson, V. E. (2018). Scalable Bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. Statistica Sinica 28 1053–1078. MR3791100

[44]

Song, Q. and Liang, F. (2017). Nearly optimal Bayesian shrinkage for high dimensional regression. arXiv preprint arXiv:1712.08964. https://doi.org/10.1007/s11425-020-1912-6. MR4535982

[45]

Yang, M., Wang, M. and Dong, G. (2020). Bayesian variable selection for mixed effects model with shrinkage prior. Computational Statistics 35 227–243. https://doi.org/10.1007/s00180-019-00895-x. MR4066283

[46]

Yang, Y., Wainwright, M. J., Jordan, M. I. et al. (2016). On the computational complexity of high-dimensional Bayesian variable selection. The Annals of Statistics 44(6) 2497–2532. https://doi.org/10.1214/15-AOS1417. MR3576552

Reading mode

Table of contents

1 Introduction
2 Methodology
3 Theoretical Properties
4 Simulation
5 Application: PCR Data
6 Discussion
Appendix A Proofs
Acknowledgements
Conflicts of Interest
References

Open access article under the CC BY license.

Keywords

Binary regression Link functions Robustness; skinny Gibbs Spike-and-slab prior

Funding

There is no funding required for this project.

Metrics

since December 2021

114

Article info
views

Full article
views

PDF
downloads

XML
downloads

RSS

Figures
5
Tables
1
Theorems
2
Supplementary
1

Figure 1

Figure 2

Boxplots of all the standard covariates in the PCR dataset.

Figure 3

PCR Data: A 10-fold cross-validated mean square prediction error versus model size.

Figure 4

PCR Data: 10-fold cross-validated prediction accuracy versus model size.

Figure 5

PCR Data: 10-fold cross-validated AUC performance evaluation versus model size.

Table 1

Simulation Results: Probit as the True Model (p = 500).

Theorem 1 (Posterior ratio consistency).

Theorem 2 (Strong selection consistency).

Supplementary Material
nejsds102_s001.zip

The supplementary materials include the R code used in this study.

Figure 1

Figure 2

Boxplots of all the standard covariates in the PCR dataset.

Figure 3

PCR Data: A 10-fold cross-validated mean square prediction error versus model size.

Figure 4

PCR Data: 10-fold cross-validated prediction accuracy versus model size.

Figure 5

PCR Data: 10-fold cross-validated AUC performance evaluation versus model size.

Table 1

Simulation Results: Probit as the True Model (p = 500).

Bayesian											LASSO



	Cauchy ($v=1$)		t ($v=3$)		t ($v=7$)		logit		probit		Cauchy	logit	probit






	SG	EG	SG	EG	SG	EG	SG	EG	SG	EG

$\Sigma =I$
No Outlier
SEN	0.6100	0.5850	0.6000	0.6100	0.6400	0.5900	0.6300	0.6050	0.6400	0.6000	0.7450	0.6850	0.6800
SPE	0.9995	0.9995	0.9996	0.9997	0.9995	0.9996	0.9994	0.9998	1.0000	0.9998	0.9729	0.9898	0.9878
MCC	0.7435	0.7316	0.7432	0.7574	0.7673	0.7386	0.7562	0.7621	0.7772	0.7574	0.4188	0.5590	0.5510
FP	0.2600	0.2600	0.2000	0.1600	0.2600	0.1800	0.2800	0.0800	0.1600	0.1200	13.4200	5.0600	5.9000
FN	1.6600	1.6600	1.6000	1.5600	1.4400	1.6400	1.4800	1.5800	1.4400	1.6000	1.0200	1.2600	1.2800
MSPE	0.0994	0.1034	0.0923	0.0994	0.0930	0.1027	0.0916	0.0947	0.0902	0.1022	0.1233	0.1222	0.1194

Bad Leverage Outlier
SEN	0.6050	0.5750	0.5900	0.5600	0.5600	0.3950	0.5400	0.4200	0.4750	0.3550	0.7100	0.0400	0.0000
SPE	0.9999	0.9998	0.9999	0.9999	0.9999	0.9999	0.9999	0.9999	0.9982	1.0000	0.9864	1.0000	1.0000
MCC	0.7647	0.7427	0.7564	0.7374	0.7358	0.5946	0.7228	0.6098	0.5236	0.5260	0.5381	0.0626	0.0000
FP	0.0400	0.0800	0.0400	0.0400	0.0400	0.0400	0.0400	0.0600	1.4800	0.2200	6.7600	0.0000	0.0000
FN	1.5800	1.7000	1.6400	1.7600	1.7600	2.4200	1.8400	2.3200	2.1000	2.5800	1.1600	3.8400	4.0000
MSPE	0.0943	0.0956	0.0918	0.0984	0.0976	0.1480	0.1043	0.1377	0.1940	0.1838	0.1383	0.2472	0.2500

Non-leverage Outlier
SEN	0.5450	0.5650	0.5400	0.4850	0.4950	0.4700	0.5000	0.4600	0.4850	0.4500	0.6600	0.4900	0.4450
SPE	0.9998	0.9998	0.9995	0.9998	0.9994	0.9994	0.9996	0.9995	0.9992	1.0000	0.9849	0.9978	0.9988
MCC	0.7189	0.7291	0.7122	0.6680	0.6583	0.6414	0.6774	0.6356	0.6150	0.6354	0.4916	0.5978	0.5974
FP	0.1000	0.1200	0.2400	0.1200	0.2800	0.3000	0.1800	0.2600	0.7000	0.2200	7.4800	1.2800	0.7800
FN	1.8200	1.7400	1.8400	2.0600	2.0200	2.1200	2.0000	2.1600	2.0600	2.2000	1.3600	2.0400	2.2200
MSPE	0.1001	0.0998	0.1014	0.1146	0.1138	0.1229	0.1073	0.1176	0.1236	0.1250	0.1415	0.1562	0.1652

$\Sigma ={\Sigma _{ar1(\rho =0.50)}}$

No Outlier
SEN	0.5600	0.5400	0.5500	0.5350	0.6100	0.5650	0.5850	0.5700	0.5950	0.5450	0.8700	0.8450	0.8200
SPE	0.9997	0.9997	0.9997	0.9997	0.9996	0.9996	0.9997	0.9997	1.0000	1.0000	0.9690	0.9828	0.9846
MCC	0.7238	0.7116	0.7225	0.7088	0.7530	0.7245	0.7400	0.7355	0.7548	0.7322	0.4192	0.5624	0.5790
FP	0.1600	0.1600	0.1600	0.1400	0.2000	0.1800	0.1600	0.1400	0.1200	0.0400	15.3800	8.4200	7.7200
FN	1.7600	1.8400	1.8000	1.8600	1.5600	1.7400	1.6600	1.7200	1.6200	1.8200	0.5200	0.6200	0.7200
MSPE	0.0662	0.0689	0.0653	0.0719	0.0638	0.0719	0.0650	0.0672	0.0616	0.0702	0.0785	0.0806	0.0816

Bad Leverage Outlier
SEN	0.5600	0.5300	0.5450	0.4650	0.5100	0.4000	0.5150	0.4300	0.5000	0.4200	0.8400	0.6200	0.1500
SPE	0.9997	0.9998	0.9998	0.9995	0.9988	0.9994	0.9995	0.9996	0.9996	1.0000	0.9888	0.9958	1.0000
MCC	0.7257	0.7115	0.7187	0.6447	0.6384	0.5788	0.6762	0.6211	0.6108	0.6104	0.6381	0.6844	0.2308
FP	0.1600	0.0800	0.1200	0.2400	0.6200	0.3200	0.2400	0.1800	0.7600	0.2000	5.5800	2.0832	0.0200
FN	1.7600	1.8800	1.8200	2.1400	1.9600	2.4000	1.9400	2.2800	2.0000	2.3200	0.6400	1.4400	3.4000
MSPE	0.0715	0.0783	0.0765	0.1133	0.1255	0.1491	0.1169	0.1365	0.1394	0.1454	0.1090	0.1694	0.2344

Non-leverage Outlier
SEN	0.5700	0.5300	0.5550	0.4750	0.5200	0.3700	0.5100	0.4200	0.4450	0.3750	0.8500	0.7050	0.6300
SPE	0.9994	0.9998	0.9994	0.9999	0.9996	0.9998	0.9997	0.9996	0.9994	1.0000	0.9822	0.9984	0.9974
MCC	0.7197	0.7124	0.7133	0.6745	0.6890	0.5808	0.6925	0.6139	0.5472	0.5588	0.5287	0.7171	0.7176
FP	0.2800	0.0800	0.2800	0.0600	0.2200	0.1200	0.1400	0.2000	1.0200	0.3400	8.8400	2.0832	1.2600
FN	1.7200	1.8800	1.7800	2.1000	1.9200	2.5200	1.9600	2.3200	2.2200	2.5000	0.6000	1.1800	1.4800
MSPE	0.0697	0.0698	0.0697	0.0768	0.0775	0.0962	0.0784	0.0905	0.0996	0.1020	0.0944	0.1242	0.1460

Theorem 1 (Posterior ratio consistency).

(3.1)

\[ PR(\boldsymbol{k},\boldsymbol{t})\stackrel{P}{\to }0\hspace{1em}\hspace{2.5pt}\textit{as}\hspace{2.5pt}n\to \infty ,\]

where $\boldsymbol{t}\subseteq [p]$ is the true model.

Theorem 2 (Strong selection consistency).

Under Conditions 3–3, the following holds:

Authors

Abstract

1 Introduction

2 Methodology

(2.1)

(2.2)

(2.3)

2.1 Variable Selection with Spike-and-Slab Priors

(2.4)

(2.5)

(2.6)

(2.7)

2.2 Hierarchical Exact Gibbs Sampler (HEGT)

(2.8)

2.3 Hierarchical Skinny Gibbs Sampler(HSGT)

(2.9)

3 Theoretical Properties

Lemma 1 (Pseudoposterior of $\boldsymbol{\gamma }$).

Proof of Lemma 1.

3.1 Model Selection Consistency

Theorem 1 (Posterior ratio consistency).

(3.1)

Theorem 2 (Strong selection consistency).

4 Simulation

Table 1

Figure 1

5 Application: PCR Data

Figure 2

Figure 3

Figure 4

Figure 5

6 Discussion

Appendix A Proofs

A.1 Background

Lemma 2 (Lemma A1 in [7]).

Lemma 3 (Lemma A2 in [7]).

Lemma 4 (Lemma A3 in [7]).

Lemma 5 (Lemma A4 in [7]).

A.2 Proof of Theorem 1

Proof.

(E1)

(E1)

(E2)

(E3)

(E4)

(E5)

(A.1)

(A.2)

Proof of Theorem 2.

(A9)

(E5)

Acknowledgements

Conflicts of Interest

References

Export citation

Copy and paste formatted citation

Download citation in file

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Table 1

Theorem 1 (Posterior ratio consistency).

(3.1)

Theorem 2 (Strong selection consistency).