1 Introduction
Analysis of data in the presence of model uncertainty is a critical problem in statistical modeling applications. Accounting for model uncertainty, rather than selecting a single statistical model, improves predictive performance and robustness in estimation and inference of model parameters [40, 37].
One common instance of model uncertainty is that of variable selection in linear regression model. Given an n-dimensional continuous response variable, $\boldsymbol{Y}$, and a set of p potential predictor variables $\boldsymbol{X}=({X_{1}},\dots ,{X_{p}})\in {\mathcal{R}^{n\times p}}$, the aim is to do statistical analysis of the data when it is not known in advance which of the ${2^{p}}$ possible models is most appropriate. Consider a binary indexing vector $\boldsymbol{\gamma }=({\gamma _{1}},{\gamma _{2}},\dots ,{\gamma _{p}})$ for the model space which indicates which explanatory variables are part of a model ${\mathcal{M}_{\boldsymbol{\gamma }}}$. Under ${\mathcal{M}_{\boldsymbol{\gamma }}}$, the linear regression model can then be expressed as:
\[ {\mathcal{M}_{\boldsymbol{\gamma }}}:\boldsymbol{Y}={\mathbf{1}_{n}}\alpha +{X_{\boldsymbol{\gamma }}}{\boldsymbol{\beta }_{\boldsymbol{\gamma }}}+\boldsymbol{\epsilon },\]
where $\boldsymbol{\epsilon }\sim \mathcal{N}(\mathbf{0},{\sigma ^{2}}{I_{n}})$, and ${X_{\boldsymbol{\gamma }}}$ is a $n\times {p_{\boldsymbol{\gamma }}}$ matrix where each column is centered around its mean and ${p_{\boldsymbol{\gamma }}}$ denotes the number of explanatory variables in the model ${\mathcal{M}_{\boldsymbol{\gamma }}}$.The Bayesian framework provides a straightforward way to account for model uncertainty by treating the model as a parameter itself, using Bayesian model averaging (BMA) [29, 38, 20, 34]. BMA requires the specification of a model space prior and a parameter space prior. However, subjective elicitation of these priors is often not feasible, particularly when p is large, motivating the use of default reference priors.
Several default parameter prior choices have been proposed in the last thirty years (see Porwal and Raftery [37] for an overview) and several other comparisons of these methods have been carried out [4, 8, 11, 13, 15, 32]. However, similar comparisons of default model space priors remain limited. In this article, our focus is on understanding the effect of model space priors on the statistical tasks of parameter estimation, interval estimation, statistical inference, point and interval prediction.
We compare combinations of three default parameter priors with eight choices of model space priors that have been advocated in the literature. These model space priors correspond to different flavors of Bayesian inference with: (i) fixed hyper-parameter choices, (ii) with Bayesian treatment of hyper-parameters, and (iii) estimation of hyper-parameters in an empirical Bayes (EB) manner. The comparison is carried out over an extensive simulation study closely based on 14 real datasets that span a wide range of practical data analysis situations.
The article is organized as follows. Section 2 provides a brief review of BMA and discusses in detail the parameter and model prior choices considered in this article. We discuss the simulation design, metrics and datasets used for comparison, and the results in Section 3, followed by discussion and concluding remarks in Section 4.
2 Bayesian Model Averaging: A Review
Bayesian model averaging [24, 28] provides a formal way to account for model uncertainty in statistical inference. Several reviews of the BMA literature are available [28, 25, 13, 50, 7, 15]. The basic idea of BMA is that it uses prior probabilities for the models considered, and Bayes’ theorem to deal with model uncertainty by calculating their posterior probabilities.
Assuming that there is one true model among the set of ${2^{p}}$ candidate models, the posterior probability of a model ${\mathcal{M}_{\boldsymbol{\gamma }}}$ is
\[ P({\mathcal{M}_{\boldsymbol{\gamma }}}|\boldsymbol{Y})=\frac{P(\boldsymbol{Y}|{\mathcal{M}_{\boldsymbol{\gamma }}})P({\mathcal{M}_{\boldsymbol{\gamma }}})}{{\textstyle\sum _{{\boldsymbol{\gamma }^{\mathbf{\prime }}}}}P(\boldsymbol{Y}|{\mathcal{M}_{{\boldsymbol{\gamma }^{\prime }}}})P({\mathcal{M}_{{\boldsymbol{\gamma }^{\mathbf{\prime }}}}})},\]
where $P({\mathcal{M}_{\boldsymbol{\gamma }}})$ is the prior model probability of ${\mathcal{M}_{\boldsymbol{\gamma }}}$, and $P(\boldsymbol{Y}|{\mathcal{M}_{\boldsymbol{\gamma }}})$ is the marginal likelihood of the model after integrating out parameters with respect to the prior ${\pi _{\boldsymbol{\gamma }}}$, namely:
\[\begin{aligned}{}P(\boldsymbol{Y}|{\mathcal{M}_{\boldsymbol{\gamma }}})=\int \mathcal{N}(\boldsymbol{Y}|{\mathbf{1}_{n}}\alpha +& {X_{\boldsymbol{\gamma }}}{\boldsymbol{\beta }_{\boldsymbol{\gamma }}},{\sigma ^{2}}{I_{n}})\\ {} & {\pi _{\boldsymbol{\gamma }}}({\boldsymbol{\beta }_{\boldsymbol{\gamma }}},\alpha ,{\sigma ^{2}})d{\boldsymbol{\beta }_{\boldsymbol{\gamma }}}d\alpha d{\sigma ^{2}}.\end{aligned}\]
Under BMA inference, we can express the predictive distribution of a quantity of interest, Δ, such as a parameter or an observable future quantity, as a weighted average of its predictive distributions under the different candidate models:
\[ P(\Delta |\boldsymbol{Y})=\sum \limits_{\boldsymbol{\gamma }}P(\Delta |{\mathcal{M}_{\boldsymbol{\gamma }}})P({\mathcal{M}_{\boldsymbol{\gamma }}}|\boldsymbol{Y}),\]
where the posterior model probabilities $P({\mathcal{M}_{\boldsymbol{\gamma }}}|\boldsymbol{Y})$ serve as averaging weights. In the case where Δ is a regression coefficient, the resulting posterior distribution, $P(\Delta |\boldsymbol{Y})$, is a mixture of a point mass at 0 and a continuous density.BMA has several desirable theoretical properties [39]. When choosing between two models, one of which is nested within the other, choosing the one with the higher posterior probability minimizes the total error rate (sum of Type I and Type II error probabilities); BMA point estimators and predictions minimize mean squared error (MSE); BMA estimation and prediction intervals are calibrated; and BMA predictive distributions have optimal performance in the log score sense.
The next subsection discusses the choice of parameter and model space priors that need to be specified by the user when implementing BMA.
2.1 Choice of Parameter Priors
Despite the wide adoption of Bayesian methods in linear models, prior elicitation for linear models is still an open problem. The parameter prior distribution ${\pi _{\boldsymbol{\gamma }}}$ can be expressed as
\[ {\pi _{\boldsymbol{\gamma }}}({\boldsymbol{\beta }_{\boldsymbol{\gamma }}},\alpha ,{\sigma ^{2}})={\pi _{\boldsymbol{\gamma }}}({\boldsymbol{\beta }_{\boldsymbol{\gamma }}}|\alpha ,{\sigma ^{2}}){\pi _{\boldsymbol{\gamma }}}(\alpha ,{\sigma ^{2}}).\]
A standard Jeffreys’ prior is often used for the intercept and error variance, which are often common to all models considered, i.e. ${\pi _{\boldsymbol{\gamma }}}(\alpha ,{\sigma ^{2}})\propto {\sigma ^{-2}}$. One of the most popular priors used for the model parameters ${\boldsymbol{\beta }_{\boldsymbol{\gamma }}}$ is Zellner’s g-prior [53], namely
This is popular because of its computational efficiency in evaluating marginal likelihoods and performing model search. It is also attractive because of its intuitive interpretation arising from analysis of a conceptual sample generated using the same design matrix ${\boldsymbol{X}_{\boldsymbol{\gamma }}}$ as in the data at hand. It is a special case of spike-and-lab family with the slab density given by the Normal density above and the spike being a point mass at zero. Also, g-priors are appealing in variable selection problems since they require the user to specify only value (or hyper-prior) for the scalar hyper-parameter g. This controls the prior variance of the parameters; the effective prior sample size is $n/g$. Several choices of g have been proposed [6, 13, 19, 23, 28, 32, 48, 54].
Based on an extensive simulation study, Porwal and Raftery [37] found that in comparing parameter priors for BMA, three adaptive g-priors performed the best among many popular choices across the statistical tasks of parameter estimation, interval estimation, model inference, point prediction and interval prediction. In what follows, we shall focus only on these three parameter prior choices, namely:
which is proper for $a\gt 2$. Liang et al [32] recommended $a=3$ as a default choice for the hyper-g prior. One advantage of using a hyper-g prior is that the posterior distribution of g given the model ${\mathcal{M}_{\boldsymbol{\gamma }}}$ is available in closed form, simplifying Bayesian inference.
-
• $\boldsymbol{g}\mathbf{=}\sqrt{\boldsymbol{n}}$: First proposed by [13], it corresponds to a prior sample size equal to $\sqrt{n}$ and has been found to work well in high dimensional settings [52]. The complexity penalty for a model using this specification is effectively half that in the BIC [37].
-
• EB-local: An alternative to fixing g to a pre-specified value is to instead estimate it from the data in an empirical Bayes (EB) manner. The local EB approach estimates a different g for each model. Let $P(\boldsymbol{Y}|{\mathcal{M}_{\boldsymbol{\gamma }}},g)$ denotes the marginal likelihood of the data under a g-prior. Then\[ {\hat{g}_{\boldsymbol{\gamma }}}=\underset{g\ge 0}{\operatorname{arg\,max}}P(\boldsymbol{Y}|{\mathcal{M}_{\boldsymbol{\gamma }}},g).\]For a linear model, Hansen and Yu [23] showed that it reduces to ${\hat{g}_{\boldsymbol{\gamma }}}=\max \{{F_{\boldsymbol{\gamma }}}-1,0\}$ where ${F_{\boldsymbol{\gamma }}}$ is the F statistic for testing ${\boldsymbol{\beta }_{\boldsymbol{\gamma }}}=\mathbf{0}$.
-
• Hyper-g: A natural Bayesian way to account for uncertainty about the scale parameter g is to specify a hyper-prior for g and perform fully Bayesian inference. Liang et al [32] proposed the hyper-g prior
In terms of theoretical properties, all three priors are model-selection consistent [32], except when the true model is the null model. This means that if the true model, denoted by ${\mathcal{M}_{{\boldsymbol{\gamma }_{T}}}}$ belongs to the model space, then the posterior probability of the true model, $P({\mathcal{M}_{\boldsymbol{\gamma }}}={\mathcal{M}_{{\boldsymbol{\gamma }_{T}}}}|\boldsymbol{y})\to 1$ as the sample size $n\to \infty $. None of the above priors suffers from Bartlett’s paradox [1]. BMA with $g=\sqrt{n}$ is subject to the “information paradox”, but it has been argued that information consistency is of little practical importance in real data applications [37]. The EB-local and Hyper-g BMA methods are not subject to the information paradox.
2.2 Choice of Model Space Priors
Model space priors require specification of the prior probabilities of all models ${\mathcal{M}_{\boldsymbol{\gamma }}}$, indexed by the binary variable inclusion vector $\boldsymbol{\gamma }$. A common approach is to consider the inclusion of each variable as an independent and exchangeable Bernoulli random variable with a common prior probability of inclusion θ, i.e.
where θ is the prior expected fraction of the ${\beta _{j}}$’s that are not zero and ${p_{\boldsymbol{\gamma }}}={\textstyle\sum _{i=1}^{p}}{\gamma _{i}}$ is the total number of covariates included in the model ${\mathcal{M}_{\boldsymbol{\gamma }}}$.
(2.1)
\[ p({\mathcal{M}_{\boldsymbol{\gamma }}}|\theta )={\prod \limits_{i=1}^{p}}{\theta ^{{\gamma _{i}}}}{(1-\theta )^{1-{\gamma _{i}}}}={\theta ^{{p_{\boldsymbol{\gamma }}}}}{(1-\theta )^{p-{p_{\boldsymbol{\gamma }}}}},\]In the absence of prior information, a common choice is to set $\theta =0.5$. This induces a uniform prior over all models with $p({\mathcal{M}_{\boldsymbol{\gamma }}})={2^{-p}}$ for all $\boldsymbol{\gamma }\in {\{0,1\}^{p}}$, where p is the total number of covariates considered. The expected prior model size under the uniform model prior is $p/2$. However, choosing $\theta =0.5$ does not provide any multiplicity control [19].
For a fixed value of θ, the above prior induces a binomial prior for model size $S={\textstyle\sum _{i=1}^{p}}{\gamma _{i}}$ i.e. $S\sim Bin(p,\theta )$, with prior mean $p\theta $ and prior variance $p\theta (1-\theta )$. Another way to specify θ is by using the researcher’s prior belief about expected model size $E[S]$. Sala-i-Martin et al [46] (hereafter SDM) recommended a prior expected model size of 7 based on their growth regression analysis. Similar priors for expected model size have also been proposed elsewhere [45, 30]. Hence, we can define an SDM version of the above prior with ${\theta _{SDM}}=7/p$.
Any fixed choice of θ can lead to rather informative priors on model size ${p_{\boldsymbol{\gamma }}}$. One way to address this issue is by estimating θ from the data using an empirical Bayes (EB) approach [19]. The EB approach involves maximizing the marginal likelihood of the data given θ:
However, maximization of (2.2) can be computationally challenging, especially when p is large since the sum is over all models. Moreover, when p is large, marginal likelihood evaluation for all models is not feasible and the sum is approximated based on the models explored by MCMC. To optimize the marginal likelihood in (2.2), we implement Algorithm 1, iterating between fitting the BMA approach to find likely models given θ and solving (2.2) using the fitted models to find a new θ:
An alternative way to reduce the sensitivity of the posterior distribution to prior assumptions is to use hierarchical modeling and specify a weak hyper-prior for θ. One choice of such a hyper-prior is a Beta distribution, $\theta \sim \text{Beta}(a,b)$. Marginalizing out θ in (2.1), gives
where $B({a^{\prime }},{b^{\prime }})=\frac{\Gamma ({a^{\prime }})\Gamma ({b^{\prime }})}{\Gamma ({a^{\prime }}+{b^{\prime }})}$ is the Beta function. It thus induces a Beta-Binomial$(a,b)$ prior on the model size S with probability mass function
(2.3)
\[\begin{aligned}{}P({\mathcal{M}_{\boldsymbol{\gamma }}}|a,b)& ={\int _{0}^{1}}p({\mathcal{M}_{\boldsymbol{\gamma }}}|\theta )p(\theta )d\theta \\ {} & =\frac{B({p_{\boldsymbol{\gamma }}}+a,p-{p_{\boldsymbol{\gamma }}}+b)}{B(a,b)},\end{aligned}\]Under a uniform prior on θ, i.e. when $a=b=1$, (2.3) simplifies to $p({\mathcal{M}_{\boldsymbol{\gamma }}})=\frac{1}{p+1}{\left(\genfrac{}{}{0pt}{}{p}{{p_{\boldsymbol{\gamma }}}}\right)^{-1}}$. This is a combination of a uniform prior over model size with a uniform prior over the models of same size given the model size.
Under a Beta-Binomial (BB) prior, the prior expected model size is $E[S]=\frac{a}{a+b}p$. Similarly to a Bernoulli prior, we can elicit the prior in terms of the prior expected model size $E[S]$. To facilitate prior elicitation, we fix $a=1$. We can then define an SDM version of the BB prior (BB-SDM) with an expected prior model size, such as $E[S]=7$, by setting ${b_{SDM}}=\frac{p}{E[S]}-1$. Note that SDM themselves [46] did not use a Beta-Binomial prior on models, but only a Bernouilli prior.
Alternatively, we can use an EB approach to learn b from the data. This can be done by maximizing the marginal likelihood given b, namely
We find the optimal value, ${\hat{b}_{EB}}$, using Algorithm 2.
(2.4)
\[ {\hat{b}_{EB}}=\underset{b\in (0,\infty )}{\operatorname{arg\,max}}\sum \limits_{\boldsymbol{\gamma }}P(\boldsymbol{Y}|{\mathcal{M}_{\boldsymbol{\gamma }}})P({\mathcal{M}_{\boldsymbol{\gamma }}}|a=1,b).\]For Zellner’s g-prior, we require the model size to be no larger than the number of regression coefficients that can be identified from the data, so that ${p_{\boldsymbol{\gamma }}}\lt n-2$. Thus for higher dimensional datasets $(p\gt n)$, we require that $P({\mathcal{M}_{\boldsymbol{\gamma }}})=0$ for all models with model size greater than $n-2$. Hence, we use truncated versions of the priors defined in (2.1) and (2.3), namely
Castillo et al [3] introduced complexity priors, also known in the literature as diffusing [35] or power priors [5]. Here the marginal probability of inclusion of any variable decays at the rate ${p^{-\kappa }}$ for some $\kappa \gt 0$, where p is the total number of possible covariates. This specifies a vanishing prior probability of large models and leads to a faster rate of rejection of spurious parameters, at the cost of slower rates of detection of active parameters [44]. Similar priors have also been used elsewhere [51, 35, 42].
The complexity prior is defined as
\[ p({\mathcal{M}_{\boldsymbol{\gamma }}})\propto {p^{-\kappa |\boldsymbol{\gamma }|}}\mathbb{1}\{|\boldsymbol{\gamma }|\le {s_{0}}\},\]
where ${s_{0}}$ is a pre-specified integer specifying the maximum number of important covariates and $|\boldsymbol{\gamma }|$ is the model size. In the absence of external information, we set ${s_{0}}=\min \{n-2,p\}$. This prior is implemented in the BAS package as tr.power.prior(kappa,trunc). We implement Complexity priors with $\kappa =1$ [43, 44], and with $\kappa =2$ which is the default choice in the BAS package.2.3 Model Space Priors – A Graphical Illustration
To illustrate the effect of different model space priors, we use two datasets from our analysis: Boston Housing $(n=506,p=103)$ and Nutrimouse $(n=40,p=120)$ (Figure 1). The solid lines show the independent Bernoulli prior from (2.1), while the dashed lines represent the Beta-Binomial prior in (2.3) and the dash-dotted lines illustrate Complexity priors. For the Nutrimouse dataset $(p\gt n)$, we use the truncated versions as discussed above. The colors group different flavors of methods: (i) Uniform versions with $\theta =0.5$ or $b=1$ (blue), (ii) SDM versions with expected prior model size 7 (red), (iii) EB versions with θ or b learned from the data (green), (iv) Complexity prior with $\kappa =1$ (orange), and (v) Complexity prior with $\kappa =2$ (brown).
The Bernoulli model space priors are very concentrated around their mean, $p\theta $. The complexity priors are concentrated around smaller model sizes with a mode at 0. The BB priors, on the other hand, are more diffuse, implying more prior uncertainty about model size. For the Nutrimouse dataset, all the model space priors assign zero probability to any model with size greater than $(n-2)=38$. Among the Bernoulli versions, $\theta =0.5$ implies a prior mode around $\min \{p/2,n-2\}$ while ${\theta _{SDM}}$ has a prior model size of 7 (the same as the prior mean). The prior model size distribution induced by the Ber$({\theta _{EB}})$ prior adapts based on the data, with a prior mode between $\theta =0.5$ and ${\theta _{SDM}}$ for the Boston Housing dataset, while having the lowest prior mode among the Bernoulli priors considered for the Nutrimouse dataset. The BB$(1,1)$ prior corresponds to a uniform prior over model size. The BB$(1,{\theta _{SDM}})$ and BB$(1,{\theta _{EB}})$ priors both induce a model size distribution with prior mode at zero.
Table 1
Summary of prior moments of model size S under different model space priors and BAS code to implement them.
Model prior | E[S] | Var(S) | BAS code |
Bernoulli$(\theta )$ | $p\theta $ | $p\theta (1-\theta )$ | bas.lm($\dots \hspace{0.1667em}$, model.prior=bernoulli(probs)) |
Beta-Binomial$(a,b)$ | $\frac{pa}{a+b}$ | $\frac{pab(a+b+p)}{{(a+b)^{2}}(a+b+1)}$ | bas.lm($\dots \hspace{0.1667em}$, model.prior=beta.binomial(alpha,beta)) |
Complexity$(\kappa )$ | – | – | bas.lm($\dots \hspace{0.1667em}$, model.prior=tr.power.prior(kappa,trunc)) |
3 Numerical Comparison
We investigate the performance of different model space priors and parameter prior combinations using an extensive simulation study based closely on real datasets. We evaluate the effect of prior choices for the statistical tasks of parameter point and interval estimation, inference, point and interval prediction, and computation time.
All the parameter and model space prior combinations were implemented using the BAS R package [5] with skeleton code shown in Table 1. A combination of the MC3 Metropolis-Hastings algorithm for sampling from the posterior distribution of models [40], along with a random swap between a currently included and a currently excluded variable is used for model space exploration. This is implemented by setting the option method=”MCMC” in the bas.lm() function. We used a default of 10,000 MCMC iterations for all methods.
For the EB methods, we used Algorithm 1 (or 2) to find the optimal ${\theta _{EB}}$ (or ${b_{EB}}$) before fitting a BAS model with the estimated hyperparameter value. For higher dimensional datasets $(p\gt n)$, a truncated version of the Beta-Binomial prior (2.6) was implemented by setting the option model.prior=”tr.beta.binomial(alpha,beta,trunc=n-2)” in BAS. Similarly, a truncated version of complexity prior is implemented in the BAS package. A truncated version of the Bernoulli prior (2.5) is not available in BAS. We implemented it by (i) implementing bas.lm() with tr.beta.binomial(1,1,trunc=n-2), and then (ii) using importance sampling to calculate updated posterior model probabilities with weights proportional to the ratio of the prior model space densities in (2.5) and (2.6).
3.1 Datasets
We based our analysis on 14 publicly available datasets, of which six are available from UCI machine learning repository and the others are examples available in the literature. The sample size and number of candidate variables along with the data sources for the different datasets are listed in Table 2. These include the classical statistical setting with $n\gt p$, high dimensional datasets with $p\gt n$, and intermediate settings where $n\approx p$. For each dataset, continuous covariates are standardized to have zero mean and variance 1 and the response variable is centered to have zero mean. Details of the choice of datasets and additional pre-processing can be found in [37].
Table 2
Datasets used in the study.
Dataset Name | Sample size (N) | Covariates (p) | Source |
College | 777 | 14 | ISLR [27] |
Bias Correction-Tmax | 7590 | 21 | UCI ML repository |
Bias Correction-Tmin | 7590 | 21 | UCI ML repository |
SML2010 | 1373 | 22 | UCI ML repository |
Bike sharing-daily | 731 | 28 | UCI ML repository |
Bike sharing-hourly | 17379 | 32 | UCI ML repository |
Superconductivity | 21263 | 81 | UCI ML repository |
Diabetes | 442 | 64 | spikeslab [26] |
Ozone | 330 | 44 | gss [22] |
Boston housing | 506 | 103 | mlbench [36] |
NIR | 166 | 225 | chemometrics [14] |
Nutrimouse | 40 | 120 | mixOmics [41] |
Multidrug | 60 | 853 | mixOmics [41] |
Liver toxicity | 64 | 3116 | mixOmics [41] |
3.2 Simulation Design
For each dataset, we selected a data generating model that closely approximates the real dataset. We carry out all-subsets regression for datasets with $p\lt 30$ using the leaps package [33] in R. We then selected the largest model with all variables significant at the 0.05 level. For datasets with $p\gt 30$, all subsets regression is not computationally feasible. For these datasets, we obtained a filtered list of variables using iterative sure independence screening [12]. If the filtered list contained more than 30 variables, we selected the top 30 variables with the highest ${R^{2}}$ values under univariate regression. All-subsets regression was then applied to the filtered list of covariates to obtain the data generating model for our study. A summary of the data generating model used for each dataset can be found in the supplementary materials.
We used the data generating model and parametric bootstrapping to generate 100 bootstrapped datasets with the same design matrix $\boldsymbol{X}$ and different simulated response vectors $\boldsymbol{Y}$. Each of the resulting simulated datasets had the same design matrix and error distribution as the real dataset on it was based.
We compared the performance of different parameter and model space prior combinations on these simulated datasets using the following metrics:
-
• PointEst: We use root mean squared error (RMSE) as a metric to evaluate the parameter point estimation performance of different combinations. RMSE is evaluated as
(3.1)
\[ RMSE=\sqrt{\frac{1}{p}{\sum \limits_{i=1}^{p}}{({\beta _{i,DG}}-{\hat{\beta }_{i}})^{2}}},\] -
• IntEst: The interval score (IS) [21] evaluates the performance of interval estimators in terms of both their coverage and width. It is the sum of two terms, the first of which rewards narrow intervals while the second rewards accurate coverage. For a variable z, the IS is
(3.2)
\[ \begin{aligned}{}I{S_{\alpha }}(l,u,z)=(u-l)+& \frac{2}{\alpha }(l-z)\mathbb{1}\{z\lt l\}\\ {} & +\frac{2}{\alpha }(z-u)\mathbb{1}\{u\lt z\},\end{aligned}\] -
• Inference: We calculate the area under the precision recall curve (AUPRC) using the posterior inclusion probabilities of the covariates to evaluate the model selection performance of different combinations of priors. We assess the quality of the resulting inference using ($1-$AUPRC) as our metric, where a lower value is better.
We also compared methods based on their out-of-sample predictive performance. We divided each dataset into 100 random 75%–25% train-test splits. We trained the methods on the training data and used the test data to assess the predictive performance using the metrics described below:
-
• Prediction: We calculate ${R_{test}^{2}}$ to evaluate accuracy of point predictions as follows:
(3.3)
\[ {R_{test}^{2}}=1-\frac{{\textstyle\sum _{i\in test}}{({y_{i}}-\hat{{y_{i}}})^{2}}}{{\textstyle\sum _{i\in test}}{({y_{i}}-{\bar{y}_{train}})^{2}}},\] -
• IntPred: To assess the quality of the prediction intervals, we calculate the interval score using (3.2) for each of the test set observations. Here, l and u represent the lower and upper bounds of the $(1-\alpha )\times 100\% $ posterior predictive interval for the test observation. We calculate the mean interval score (MIS), averaging IS over test set observations for each of the train-test splits. A lower MIS corresponds to a better interval forecast.
We also recorded the average size of the sampled models for each dataset and the average CPU time (in seconds) to carry out BMA for one bootstrapped dataset.
3.3 Results
The results are shown in Table 3. We used the combination of the g-prior with $g=\sqrt{n}$ as the parameter prior and the Beta-Binomial$(1,1)$ model space prior as the reference. Note that the g-prior with $g=\sqrt{n}$ was found to the best parameter prior by [37]. Metrics for all other combinations were calculated relative to the reference metric, and averaged across datasets. Detailed results of performance metrics for the simulation studies based on each of the 14 datasets can be found in the Supplementary materials. The “Score” column contains the average of the scores for PointEst, IntEst, Inference, Prediction and IntPred under each method. We used the Score column to rank the methods.
For each metric, we color the methods based on their performance relative to the reference metric. A method is colored green if it performed similarly or better than the reference method, yellow if it performed somewhat worse, and orange if it performed substantially worse.
For all choices of parameter prior, Beta-Binomial$(1,1)$ was the top scoring model space prior. The three Beta-Binomial versions with $g=\sqrt{n}$ were the top three methods across statistical tasks. The Complexity priors with $\kappa =1$ and $\kappa =2$ were the worst performing model space priors. The uniform prior denoted by Ber$(\theta =0.5)$ also performed less well than the Beta-Binomial priors. This ranking of methods was consistent across different performance metrics.
Most parameter and model prior combinations selected sparser models than the $g=\sqrt{n}$ and BB$(1,1)$ prior combination, with the exception of methods involving the uniform model space prior. The complexity priors selected very sparse models compared to our baseline, which may be explained by the strong sparsity induced by the prior. This may also explain the poor performance of the complexity priors across statistical tasks. Notably, the rankings of the prior combinations are similar for the different tasks. In particular, the rankings for prediction are consistent with those for point estimation and parameter inference, with a correlation of 0.77 between scores for point estimation and point prediction.
We also note that the EB model space priors tended to outperform the corresponding SDM model space priors when combined with the Hyper-g and EB-local parameter priors. However, the results with the EB model space priors took longer to computer on average because of the optimisation procedure. The hyper-g parameter priors are the slowest due to the integral calculations required in the posterior computation. In general, the Beta-Binomial priors performed better than the Bernoulli and complexity priors.
Table 3
Performance of different parameter prior and model space prior combinations for inference in linear regression under model uncertainty: “PointEst” is the RMSE for point estimation, “IntEst” is the Mean Interval Score (MIS) for interval estimation, “Inference” is the 1- area under the precision-recall curve (AUPRC), “Prediction” is the RMSE for point prediction, while “IntPred” is the MIS for interval prediction. “N vars” is the average number of variables used for the task. All metrics are standardized to equal 1 for the $g=\sqrt{n}$ with BB$(1,1)$ prior on model space. For each column, lower value is better.
4 Discussion
We have compared BMA techniques with different choices of model space priors and parameter priors using an empirical study based closely on real datasets. We found that the Beta-Binomial$(1,1)$ model space prior performed the best across various statistical tasks and choices of parameter priors. We found that the hierarchical model space priors with a hyper-prior on the prior inclusion probability θ was more diffuse and led to more efficient exploration of the model space. Fixed choices of θ led to worse performance across statistical tasks and were often quite concentrated. Complexity priors that induce high sparsity on model complexity performed worst among all the methods considered.
We are not the first to compare model space priors in the presence of model uncertainty. Past comparisons have either focused on a subset of the model priors discussed here, or evaluated BMA methods for only a subset of the statistical tasks considered here. In several cases, they also tended to use simulation designs that are at best loosely related to empirical data observed in practice.
Ley and Steel [31] evaluated the effect of different model priors on model selection performance using three real economic growth regressions datasets. However, they used only two fixed choices of g-priors: the Unit Information prior (UIP) with $g=n$ [28] and the risk inflation criterion (RIC) with $g={p^{2}}$ [16], motivated by the simulation study of [13]. Porwal and Raftery [37] found both of these parameter prior choices to be outperformed by the parameter priors used in this study. Also, their comparison was based only on tall datasets $(n\gt p)$ and their comparison of methods was limited to the statistical tasks of inference and probabilistic prediction using the log-predictive score. They also did not consider EB versions of the Binomial$(p,\theta )$ and Beta-Binomial$(1,b)$ model space priors and complexity priors. Like Ley and Steel, we found that random θ versions (or Beta-Binomial versions) performed better since the hierarchical prior is less sensitive to the choice of prior model size $E[S]$. Similarly, they found that priors specified by a fixed θ tended to be quite informative, casting doubt on their appropriateness as default reference priors.
Scott and Berger [47] discussed the multiplicity correction effect of a subset of the model space priors discussed here, specifically Ber$(\theta =0.5)$, Ber$({\theta _{EB}})$ and BB$(1,1)$. They used a non-empirical simulation design, and did not compare methods based on the statistical tasks discussed here. Eicher et al [11] compared 12 parameter priors (of which $g=\sqrt{n}$ is common with ours) along with two fixed model priors: Uniform model priors with $\theta =0.5$ and Ber$({\theta _{SDM}})$ with a prior expected model size of 7. The comparison was based on non-empirical simulation studies and one real growth regression dataset using predictive performance and inference measures. They found that the UIP with a uniform model prior performed better than Ber$({\theta _{SDM}})$ on the three statistical tasks common with ours. In contrast, we found that Ber$({\theta _{SDM}})$ was ranked higher than Uniform model priors for all our three preferred parameter priors across the statistical tasks considered.
We found the complexity priors [3] to perform relatively poorly. At first sight, this seems to be in conflict with the theoretical results of Castillo et al [3], who showed that under certain assumptions the posterior distribution contracts optimally to recover an unknown sparse parameter vector and gives optimal predictions. However, their theoretical results assume that the data are generated from a spike and slab prior with the Laplace distribution as the slab density, and that the error variance ${\sigma ^{2}}$ is known, which rarely holds in practice. Also, Rossell [44] argued that complexity priors can introduce very strong sparsity a priori, and showed empirically that when the true model is not sparse, complexity priors may perform suboptimally for finite n. This is consistent with our results.
We have focused attention on independent model priors, i.e. priors in which the inclusion of each variable is statistically independent of that of the other variables. However, non-independent default priors have been proposed as well. George [17, 18] proposed dilution priors which dilute the prior model probability within subsets of similar models with highly correlated predictors. There is also research designing dependent model priors based on domain knowledge [2, 10]. Dellaportas et al [9] proposed a joint specification of the prior distribution across models so that the sensitivity of posterior model probabilities to the dispersion of prior distributions for the parameters of individual models (Lindley’s paradox) is diminished. Villa and Walker [49] assigned prior mass to models on the basis of their worth, based on the KL-divergence between densities under different models. However, all of these dependent model space priors lead to increased computational complexity and have been shown to work only when p is relatively small. They have also not yet been implemented in publicly available software.