1 Introduction
Spatial processes have occupied center stage in statistical theory and applications for the last few decades. Their voracious use can largely be explained by geographically tagged data becoming increasingly commonplace in modern applications. Such data are often composed of complex variables which are no longer amenable to a Gaussian assumption. For example, spatially indexed counts [see for e.g., 61, 9, 2, 39], proportions [see, for e.g., 23, 26, 21, 20], time to event or, survival outcomes [see, for e.g. 4, 44, 67] are some frequently occurring variables where spatial processes have proved invaluable in performing uncertainty quantification. The purpose being to quantify unobserved dependence introduced within the variable of interest due to varying location. The cornerstone for such studies is a spatially indexed process variable of interest, often termed as a response process and denoted by, $y(\mathbf{s})$. This is accompanied by covariate information $\mathbf{X}(\mathbf{s})=[{\mathbf{x}_{1}}(\mathbf{s}),{\mathbf{x}_{2}}(\mathbf{s}),\dots ,{\mathbf{x}_{p}}(\mathbf{s})]$. Here $\mathbf{s}\in \mathcal{S}=\{{\mathbf{s}_{1}},{\mathbf{s}_{2}},\dots ,{\mathbf{s}_{L}}\}$ is the spatial indexing and, $\mathcal{S}$ is a finite set of indices or, locations over which the response variable and covariates are observed. The investigator often encounters observation–level covariates that account for response specific characteristics when learning such processes. It becomes important to understand which of these covariates are important contributors to variation in the data. From a model parsimony standpoint, model choice becomes an important issue to the investigator. In statistical theory, this problem is often addressed by performing shrinkage or, variable selection on the model coefficients. Moreover, performing spatial uncertainty quantification produces accurate inference for model coefficients, which also raises the concern regarding a more “honest” subset of covariates within $\mathbf{X}(\mathbf{s})$ that primarily determine the variation in $y(\mathbf{s})$. The crown jewel of our contribution is Bayesian methodology for performing spatial uncertainty quantification and model choice simultaneously.
Spatial process modeling generally requires a hierarchical specification of an unobserved random effect within the model [14]. Maintaining a hierarchical approach allows for exploration of the effect of covariates, $\mathbf{X}(\mathbf{s})$, and the random effect $\mathbf{w}(\mathbf{s})$ jointly on the response process $y(\mathbf{s})$. Particularly, considering generalized linear spatial process modeling, it is assumed that $y(\mathbf{s})\mid \boldsymbol{\beta },\mathbf{w}(\mathbf{s})$ arise in a conditionally independent fashion from a member of the exponential family with mean $\mu (\mathbf{s})$ such that $g(\mu (\mathbf{s}))=\mathbf{X}(\mathbf{s})\boldsymbol{\beta }+\mathbf{F}(\mathbf{s})\mathbf{w}(\mathbf{s})$, where g is a monotonic link function, $\boldsymbol{\beta }$ are model coefficients, or fixed effects and $\mathbf{F}(\mathbf{s})$ is a spatial incidence matrix. In contrast to Gaussian response processes where a direct hierarchical specification on the response is feasible, modeling a non-Gaussian spatial process leverages the generalized linear model framework to employ a latent process specification [see for e.g., 64, 17, 16, 65, 5]. This is facilitated by the existence of a valid joint probability distribution, $\pi \left(y({\mathbf{s}_{1}}),y({\mathbf{s}_{2}}),\dots ,y({\mathbf{s}_{L}})\mid \boldsymbol{\beta },{\boldsymbol{\theta }_{pr}}\right)$, where ${\boldsymbol{\theta }_{pr}}$ denotes process parameters required for specification of $\mathbf{w}(\mathbf{s})$ [see discussion in section 6.2, 5]. This hierarchical specification gives us a natural way to perform variable selection by incorporating shrinkage priors into the hierarchical prior formulation. There are several choices of shrinkage priors that differ in their prior specification such as: the discrete spike-and-slab prior [see, for e.g, 46, 27], other priors based on the Gaussian-gamma family in linear Gaussian models [see, for e.g, 51, 7], the continuous spike-and-slab prior [see, for e.g, 33], the Bayesian counterparts of LASSO and elastic net [see, for e.g, 50, 41], mixtures of g–priors [see, for e.g, 42], the horseshoe priors and its variants [see, for e.g, 13], among several others. We use the continuous spike-and-slab prior to effect shrinkage on model coefficients.
We focus on a subset of probability distributions within the exponential family, termed as the exponential dispersion family [see, for e.g., 34, 35, 36, 37]. It allows the dispersion along with the mean to vary across observations, suppressing the need for having a constant dispersion across observations. We focus on a particular member of the family, the Tweedie compound Poisson-gamma (CP-g), more commonly referred to as the Tweedie (probability) distribution [see, 58]. The corresponding random variable is constructively defined as a Poisson sum of independently distributed Gamma random variables. Allowing for a varying dispersion across observations enables exploration of the effect of covariates $\mathbf{X}(\mathbf{s})$ on the mean and the dispersion separately, by employing two separate generalized linear models (GLMs). This gives rise to the double generalized linear model (DGLM) [see, for e.g., 54, 59, 56]. Hierarchical frameworks for specifications of DGLMs were first developed in [40]. Although not mandatory, it is customary to use the same covariates, $\mathbf{X}(\mathbf{s})$ for both the mean and dispersion GLMs to avoid ambiguities in model specification. Previous work [see for e.g., 29, 30] uses this approach and considers developing inference for DGLMs under a frequentist framework. Inference on spatial effects is obtained through penalizing the graph Laplacian. In this paper we adopt a Bayesian discourse by supplementing the DGLM framework with the continuous version of the spike-and-slab prior to effect shrinkage and thereby achieve better model specification. We integrate the spike and slab prior into our hierarchical prior formulation for both mean and dispersion models. We show that these priors provide a natural way of incorporating sparsity into the model, while offering straightforward posterior sampling in the context of our spatial DGLMs.
The scale for spatial indexing is assumed to be point-referenced. For example, latitude-longitude or easting–northing. Generally, $\mathbf{s}\in {\mathbb{R}^{2}}$. Specification of a neighborhood structure or, proximity is naturally important when attempting to quantifying the behavior of response in locations that are near each other. We select the Euclidean distance between locations. This results in a Gaussian process [see, for e.g., 60] prior on the spatial process, $\mathbf{w}(\mathbf{s})$. Other choices exist for specifying such spatial process, $\mathbf{w}(\mathbf{s})$, for e.g. log-Gamma [see 10, 11] etc. We particularly focus on a Gaussian process specification on $\mathbf{w}(\mathbf{s})\sim GP(\mathbf{0},K(\cdot ))$, where K is a covariance function. For arbitrary locations, s and ${\mathbf{s}^{\prime }}$, dependence between $y(\mathbf{s})$ and $y({\mathbf{s}^{\prime }})$ is specified through $K(\mathbf{s},{\mathbf{s}^{\prime }})$, which governs the covariance between $w(\mathbf{s})$ and $w({\mathbf{s}^{\prime }})$. For point-referenced data, the Matérn family [see, for e.g., 45] provides the most generic and widely adopted covariance specification.
Next, we address Bayesian model specification. In the absence of such concerns for the hierarchical process models discussed above, prior specification follows the generic framework, $[\text{data}\mid \text{process},\widetilde{\boldsymbol{\theta }}]\times [\text{process}\mid \widetilde{\boldsymbol{\theta }}]\times [{\boldsymbol{\theta }_{m}}]\times [{\boldsymbol{\theta }_{pr}}]$. Here, $\widetilde{\boldsymbol{\theta }}=\{{\boldsymbol{\theta }_{m}}$, ${\boldsymbol{\theta }_{pr}}\}$ denote model parameters [see for e.g. 8, 24, 5, Chapter 6, p. 125]. In particular, ${\boldsymbol{\theta }_{pr}}$ constitute parameters instrumental in specification of the process, while ${\boldsymbol{\theta }_{m}}$ are other model parameters. We adopt a proper prior on ${\boldsymbol{\theta }_{pr}}$ to avoid the risk of generating improper posteriors [see, for e.g., 6]. Building a Bayesian variable selection framework that facilitates model specification for ${\boldsymbol{\theta }_{m}}$ requires an additional layer of hierarchical prior specification, appending the latter framework with variable selection parameters, ${\boldsymbol{\theta }_{vs}}$ and, thereby producing
where $\boldsymbol{\theta }=\{\widetilde{\boldsymbol{\theta }},{\boldsymbol{\theta }_{vs}}\}$. We resort to Markov Chain Monte Carlo (MCMC) sampling [see, for e.g. 12, 28] for performing posterior inference on $\boldsymbol{\theta }$. The novelty of our approach lies in the simple Bayesian computation devised—employing only Gibbs sampling updates for ${\boldsymbol{\theta }_{vs}}$. To the best of our knowledge hierarchical Bayesian frameworks for fitting (a) Tweedie DGLMs, (b) spatial Tweedie DGLMs with (or without) variable selection, do not exist in the statistical literature. Evidently, proposed methodology in (1.1) remedies that.
(1.1)
\[ [\text{data}\mid \text{process},\boldsymbol{\theta }]\times [\text{process}\mid \boldsymbol{\theta }]\times [{\boldsymbol{\theta }_{pr}}]\times [{\boldsymbol{\theta }_{m}}]\times [{\boldsymbol{\theta }_{vs}}],\]The ensuing developments in the paper are organized as follows: In Section 2 we detail the proposed statistical framework outlining Tweedie distributions—the likelihood and parameterization, model formulation and the hierarchical prior specification. Section 3 provides comprehensive synthetic experiments that document the efficacy of our proposed statistical framework for Bayesian variable selection in spatial DGLMs. Section 5 considers application of the developed framework to automobile insurance premiums in Connecticut, USA during 2008. Additional synthetic experiments capturing various performance aspects for the models are provided in the Supplementary Material.
2 Statistical Framework
The Tweedie distribution produces observations composed of exact zeros with a continuous Gamma tail. Their ability to model mixed data types featuring exact zeros and continuous measurements jointly makes them suitable for modeling response arising from a variety of domains. Some of the current applications include, actuarial science [see 55, 62, 29, 30], ecology [see for e.g., 57], public health [63], environment [see for e.g., 38], ecology [53], gene expression studies [43]. As evidenced by these applications, the presence of unobserved dependence between observations, affecting the quality of inference, is not unlikely. In the following subsections we provide more details on Tweedie distributions, followed by the model formulation and hierarchical prior specification.
2.1 The Exponential Dispersion Family: Tweedie Distributions
The Tweedie family of distributions belong to the exponential dispersion (ED) family of models whose probability density/mass function has the generic form,
where θ is the natural or canonical parameter, ϕ is the dispersion parameter, and $\kappa (\theta )$ is the cumulant function. Characterizing the Tweedie family is an index parameter ξ, varying values of which produce different members of the family. For e.g., the CP-g is obtained with $\xi \in (1,2)$, for $\xi =1$ we obtain a Poisson and $\xi =2$ produces a Gamma distribution, for $\xi \in (0,1)$ they do not exist [for further details see Table 1, 29]. We are particularly interested in the CP-g distributions in this work. In general, for the ED family we have the mean, $\mu =E(y)={\kappa ^{\prime }}(\theta )$ and the variance, $Var(y)=\phi {\kappa ^{\prime\prime }}(\theta )$. For the CP-g we have $\kappa (\theta )={(2-\xi )^{-1}}{\{(1-\xi )\theta \}^{2-\xi /1-\xi }}$. Using the relation, ${\kappa ^{\prime }}(\theta )=\mu $, some straightforward algebra yields, $\kappa (\theta )={(2-\xi )^{-1}}{\mu ^{2-\xi }}$ and ${\kappa ^{\prime\prime }}(\theta )={\mu ^{\xi }}$, implying $Var(y)=\phi {\mu ^{\xi }}$ and denoting $\alpha ={(1-\xi )^{-1}}(2-\xi )$ we have,
(2.1)
\[ \pi (y\mid \theta ,\phi )=a(y,\phi )\exp \left\{{\phi ^{-1}}(y\theta -\kappa (\theta ))\right\},\]
\[\begin{aligned}{}a(y,\phi )& =1\cdot I(y=0)+{y^{-1}}{\sum \limits_{j=1}^{\infty }}{\left[\frac{{y^{-\alpha }}{(\xi -1)^{\alpha }}}{{\phi ^{1-\alpha }}(2-\xi )}\right]^{j}}\frac{1}{j!\Gamma (-j\alpha )}\times \\ {} & \hspace{142.26378pt}I(y\gt 0).\end{aligned}\]
Evidently, $\pi (0\mid \theta ,\phi )=\exp \{-{\phi ^{-1}}\kappa (\theta )\}$.We introduce some notation. Let ${y_{ij}}({\mathbf{s}_{i}})$ denote the j-th response at the i-th location ${\mathbf{s}_{i}}\in \mathcal{S}$, where $j=1,2,\dots ,{n_{i}}$ and $i=1,2,\dots ,L$ with ${\textstyle\sum _{i=1}^{L}}{n_{i}}=N$. Together we denote, $\mathbf{y}=\mathbf{y}(\mathbf{s})={\{{\{{y_{ij}}({\mathbf{s}_{i}})\}_{j=1}^{{n_{i}}}}\}_{i=1}^{L}}$ as the $N\times 1$ response. Similarly, $\boldsymbol{\mu }=\boldsymbol{\mu }(\mathbf{s})={\{{\{{\mu _{ij}}({\mathbf{s}_{i}})\}_{j=1}^{{n_{i}}}}\}_{i=1}^{L}}$ and $\boldsymbol{\phi }={\{{\{{\phi _{ij}}\}_{j=1}^{{n_{i}}}}\}_{i=1}^{L}}$ denotes the mean and dispersion vectors respectively. If $\mathbf{y}\mid \boldsymbol{\mu },\boldsymbol{\phi },\xi $ arises independently from a CP-g distribution, then the likelihood is given by
Working with the likelihood, $\pi (\cdot )$, when devising computation, evaluating the infinite series representation of $a(y,\phi )$ is required. The two commonly used methods are—saddle-point approximation [see for e.g. 49, 55, 18, 66] and Fourier inversion [see, for e.g. 19]. The saddle-point approximation to (2.1) uses a deviance function based representation where, $\widetilde{\pi }(y\mid \mu ,\phi )=b(y,\phi )\exp \{-{(2\phi )^{-1}}d(y\mid \mu )\}$. For CP-g distributions, the deviance function is $d(y\mid \mu )=d(y\mid \mu ,\xi )=2\{({y^{2-\xi }}-y{\mu ^{1-\xi }}){(1-\xi )^{-1}}-({y^{2-\xi }}-{\mu ^{2-\xi }})\times {(2-\xi )^{-1}}\}$, and $b(y\mid \phi ,\xi )={(2\pi \phi {y^{\xi }})^{-1/2}}I(y\gt 0)+1\cdot I(y=0)\approx a(y,\phi )\exp \{{\phi ^{-1}}{y^{2-\xi }}{(1-\xi )^{-1}}{(2-\xi )^{-1}}\}$. We performed experiments which showed that the saddle-point approximation performs well when fewer zeros are present in the data. Under higher proportions of zeros its performance was sub-optimal. However, albeit its computationally intensive nature, in all scenarios the Fourier inversion based method had stable performance. Hence, in this paper we use the evaluation of $a(y,\phi )$ that is based on Fourier inversion. The adopted Bayesian approach requires MCMC computation that relies on accurate likelihood evaluations. Hence, we emphasize the importance of choosing the appropriate likelihood function for application purposes. We denote the likelihood in (2.2) as $Tw(\boldsymbol{\mu },\boldsymbol{\phi },\xi )$. Tweedie distributions are the only members of the ED family that possess a scale invariance property [see, for e.g., 37, Theorem 4.1]. This suggests for ${c_{ij}}\gt 0$, ${\mathbf{y}^{\ast }}(\mathbf{s})=\{{c_{ij}}{y_{ij}}({\mathbf{s}_{i}})\}\sim Tw({\mathbf{c}^{T}}\boldsymbol{\mu },{{\mathbf{c}^{2-\xi }}^{T}}\boldsymbol{\phi },\xi )$ allowing observations with different scales of measurement to be modeled jointly.
(2.2)
\[ \begin{aligned}{}& \pi (\mathbf{y}\mid \boldsymbol{\mu },\boldsymbol{\phi },\xi )={\prod \limits_{i=1}^{L}}{\prod \limits_{j=1}^{{n_{i}}}}{a_{ij}}({y_{ij}}({\mathbf{s}_{i}})\mid {\phi _{ij}})\times \\ {} & \hspace{14.22636pt}\exp \left[{\phi _{ij}^{-1}}\left(\frac{{y_{ij}}({\mathbf{s}_{i}}){\mu _{ij}}{({\mathbf{s}_{i}})^{1-\xi }}}{1-\xi }-\frac{{\mu _{ij}}{({\mathbf{s}_{i}})^{2-\xi }}}{2-\xi }\right)\right].\end{aligned}\]2.2 Model Formulation
Formulating DGLMs with spatial effects theoretically involves specification of a spatial random effect in both mean and dispersion models. In such a scenario complex dependencies can be specified to account for varied degrees of uncertainty quantification. In the simplest case the corresponding spatial random effects for the mean and dispersion models are independent Gaussian processes. More complex scenarios can feature dependent Gaussian processes, where the dependence arises from a cross-covariance matrix. Spatial random effects in the mean model are readily interpretable—risk faced (adjustment to mean premium paid in our case) owing to location. However, spatial random effects in the dispersion model are not readily interpretable. Subsequently, we choose to include spatial random effects only in the mean model for this work.
Let ${\mathbf{x}_{ij}}({\mathbf{s}_{i}})$ denote a $p\times 1$ vector of observed covariates for the mean model and $\boldsymbol{\beta }$ be the corresponding a $p\times 1$ vector of coefficients. ${\mathbf{f}_{ij}}({\mathbf{s}_{i}})$ denotes a $L\times 1$ vector specifying the location and $w({\mathbf{s}_{i}})$ is the spatial effect at location ${\mathbf{s}_{i}}$ with $\mathbf{w}=\mathbf{w}(\mathbf{s})={(w({\mathbf{s}_{1}}),w({\mathbf{s}_{2}}),\dots ,w({\mathbf{s}_{L}}))^{T}}$ denoting the $L\times 1$ vector of spatial effects; ${\mathbf{z}_{ij}}({\mathbf{s}_{i}})$ denotes a $q\times 1$ vector of known covariates for the dispersion model and $\boldsymbol{\gamma }$ is the corresponding $q\times 1$ vector of coefficients. A Bayesian hierarchical double generalized linear model (DGLM) using a non-canonical logarithmic link function is specified as
which implies ${\mu _{ij}}({\mathbf{s}_{i}})={\mu _{ij}}(\boldsymbol{\beta },\mathbf{w})=\exp ({\mathbf{x}_{ij}^{T}}({\mathbf{s}_{i}})\boldsymbol{\beta }+{\mathbf{f}_{ij}^{T}}({\mathbf{s}_{i}})\mathbf{w}({\mathbf{s}_{i}}))$ and ${\phi _{ij}}={\phi _{ij}}(\boldsymbol{\gamma })=\exp ({\mathbf{z}_{ij}^{T}}\boldsymbol{\gamma })$.
(2.3)
\[ \begin{aligned}{}\log {\mu _{ij}}({\mathbf{s}_{i}})& ={\mathbf{x}_{ij}^{T}}({\mathbf{s}_{i}})\boldsymbol{\beta }+{\mathbf{f}_{ij}}{({\mathbf{s}_{i}})^{T}}\mathbf{w}({\mathbf{s}_{i}}),\\ {} \log {\phi _{ij}}& ={\mathbf{z}_{ij}^{T}}\boldsymbol{\gamma },\end{aligned}\]2.3 Hierarchical Prior Specification
In this section, we first present the hierarchical prior formulation for the model and process parameters, $\tilde{\boldsymbol{\theta }}$, followed by the prior formulation for the variable selection parameters ${\boldsymbol{\theta }_{vs}}$. Prior specification for model and process parameters are as follows:
where ${\mathbf{0}_{m}}$ is the $m\times 1$ zero vector and ${\mathbf{I}_{m}}$ is the $m\times m$ identity matrix; $\mathbf{X}\in {\mathbb{R}^{n\times p}}$ and $\mathbf{Z}\in {\mathbb{R}^{n\times q}}$ are design matrices corresponding to the mean and dispersion models, with coefficients $\boldsymbol{\beta }\in {\mathbb{R}^{p}}$ and $\boldsymbol{\gamma }\in {\mathbb{R}^{q}}$, respectively; $\mathbf{F}\in {\mathbb{R}^{n\times L}}$ is the spatial incidence matrix, $\mathbf{w}\in {\mathbb{R}^{L}}$ is the spatial effect, and $\mathbf{R}({\phi _{s}})={\sigma ^{2}}{(\phi ||\Delta ||)^{\nu }}{K_{\nu }}(\phi ||\Delta ||)$, where ${K_{\nu }}$ is the modified Bessel function of order ν [1], is the Matérn covariance kernel. Here ${\{||\Delta ||\}_{i{i^{\prime }}}}=||{\mathbf{s}_{i}}-{\mathbf{s}_{{i^{\prime }}}}|{|_{2}}$, is the Euclidean distance between locations ${\mathbf{s}_{i}}$ and ${\mathbf{s}_{{i^{\prime }}}}$. $U(\cdot \mid a,b)$ denotes the uniform distribution, ${N_{m}}(\cdot \mid \mathbf{0},\Sigma )$ is the m-dimensional Gaussian with zero mean and covariance matrix Σ, and $\mathrm{Gamma}(\cdot \mid a,b)$ is the Gamma distribution with shape-rate parameterization. Note that the priors on ${\boldsymbol{\lambda }_{\beta }}=({\lambda _{\beta ,1}},\dots ,{\lambda _{\beta ,p}})$ and ${\boldsymbol{\lambda }_{\gamma }}=({\lambda _{\gamma ,1}},\dots ,{\lambda _{\gamma ,q}})$ are a part of the variable selection priors and are discussed next. Referring to the framework in (1.1), the resulting posterior from (2.4) establishes the $[\text{data}\mid \text{process},\boldsymbol{\theta }]\times [\text{process}\mid \widetilde{\boldsymbol{\theta }}]$ step. Conditional posteriors for $\widetilde{\boldsymbol{\theta }}$ are outlined in Section S1 of Supplementary Materials.
(2.4)
\[ \begin{aligned}{}& \text{Model Parameters:}\hspace{2.5pt}\xi \sim U({a_{\xi }},{b_{\xi }});\\ {} & \hspace{28.45274pt}\boldsymbol{\beta }\sim {N_{p}}\left({\mathbf{0}_{p}},{\boldsymbol{\lambda }_{\beta }^{T}}{\mathbf{I}_{p}}\right);\boldsymbol{\gamma }\sim {N_{q}}\left({\mathbf{0}_{q}},{\boldsymbol{\lambda }_{\gamma }^{T}}{\mathbf{I}_{q}}\right),\\ {} & \text{Process Parameters:}\hspace{2.5pt}{\phi _{s}}\sim U\left({a_{{\phi _{s}}}},{b_{{\phi _{s}}}}\right);\\ {} & \hspace{28.45274pt}{\sigma ^{-2}}\sim \mathrm{Gamma}\left({a_{\sigma }},{b_{\sigma }}\right);\hspace{2.5pt}\nu \sim U({a_{\nu }},{b_{\nu }}),\\ {} & \text{Process:}\hspace{2.5pt}\mathbf{w}\sim {N_{L}}\left({\mathbf{0}_{L}},{\sigma ^{2}}\mathbf{R}({\phi _{s}})\right),\end{aligned}\]For the continuous spike-and-slab prior formulation, ${\boldsymbol{\theta }_{vs}}=\{{\zeta _{\beta }},{\zeta _{\gamma }},{\sigma _{\beta }^{2}},{\sigma _{\gamma }^{2}},{\alpha _{\beta }},{\alpha _{\gamma }}\}$ [see, for e.g., 33]. Note that we have separate prior formulations for mean and dispersion models. Let $\boldsymbol{\beta }={({\beta _{1}},{\beta _{2}},\dots ,{\beta _{p}})^{T}}$ and $\boldsymbol{\gamma }={({\gamma _{1}},{\gamma _{2}},\dots ,{\gamma _{q}})^{T}}$ be the model coefficients corresponding to the mean and the dispersion models. Let us define ${\lambda _{\beta ,u}}={\zeta _{\beta ,u}}{\sigma _{\beta ,u}^{2}}$ and ${\lambda _{\gamma ,v}}={\zeta _{\gamma ,v}}{\sigma _{\gamma ,v}^{2}}$ for $u=1,\dots ,p$ and $v=1,\dots ,q$, respectively. We consider the following prior formulation:
where ${\beta _{u}}$ and ${\gamma _{v}}$ have normal priors with mean 0 and variance ${\zeta _{\beta ,u}}{\sigma _{\beta ,u}^{2}}$ and ${\zeta _{\gamma ,u}}{\sigma _{\gamma ,u}^{2}}$, respectively. Here, ${\delta _{c}}(\cdot )$ denotes the discrete measure at c; hence, ${\zeta _{\beta ,u}}$ and ${\zeta _{\gamma ,u}}$ are indicators taking values 1 or ${v_{0}}$ (small number close to 0) based on the selection of their corresponding covariates. The probabilities of these indicators taking the value 1 is given by ${\alpha _{\beta }}$ and ${\alpha _{\gamma }}$ respectively. We place a uniform prior on these selection probabilities and an inverse-Gamma prior on the parameters ${\sigma _{\beta ,u}^{2}}$ and ${\sigma _{\gamma ,u}^{2}}$. The choice of the shape and rate parameters (${a_{{\sigma _{\beta }}}},{b_{{\sigma _{\beta }}}};{a_{{\sigma _{\gamma }}}},{b_{{\sigma _{\gamma }}}}$) of these inverse-Gamma priors induces a continuous bimodal distributions on ${\zeta _{\beta ,u}}{\sigma _{\beta ,u}^{2}}$ and ${\zeta _{\gamma ,u}}{\sigma _{\gamma ,u}^{2}}$ with a spike at ${\nu _{0}}$ and a right continuous tail. Combining the priors in (2.4) and (2.5) completes the hierarchical prior formulation for parameters $\boldsymbol{\theta }$ as defined in (1.1). Evidently, the above prior formulation allows for sufficient flexibility regarding variations in implementation. For instance, a hierarchical Bayesian framework for a simple DGLM can be obtained by omitting the process specification and variable selection. Analogously, DGLMs featuring variable selection or, spatial effects are obtained by omitting respective components from the prior specification outlined previously.
(2.5)
\[ \displaystyle \pi ({\boldsymbol{\theta }_{vs}})=\left\{\begin{array}{l}{\zeta _{\beta ,u}}\stackrel{iid}{\sim }(1-{\alpha _{\beta }}){\delta _{{\nu _{0}}}}(\cdot )+{\alpha _{\beta }}{\delta _{1}}(\cdot ),\hspace{1em}\\ {} {\zeta _{\gamma ,v}}\stackrel{iid}{\sim }(1-{\alpha _{\gamma }}){\delta _{{\nu _{0}}}}(\cdot )+{\alpha _{\gamma }}{\delta _{1}}(\cdot ),\hspace{1em}\\ {} {\alpha _{\beta }}\sim U(0,1),\hspace{2.5pt}{\alpha _{\gamma }}\sim U(0,1),\hspace{1em}\\ {} {\sigma _{\beta ,u}^{-2}}\stackrel{iid}{\sim }\mathrm{Gamma}({a_{{\sigma _{\beta }}}},{b_{{\sigma _{\beta }}}}),\hspace{1em}\\ {} {\sigma _{\gamma ,v}^{-2}}\stackrel{iid}{\sim }\mathrm{Gamma}({a_{{\sigma _{\gamma }}}},{b_{{\sigma _{\gamma }}}}),\hspace{1em}\end{array}\right.\]2.4 Bayesian Estimation and Inference
In its full capacity (a model with spatial effects and variable selection) the model structure with prior specifications in (2.4) and (2.5) contains $3p+L+3q+4$ parameters. Depending on the dimensions of $\mathbf{X}(\mathbf{s})$ and Z, and the number of locations L, posterior inference can be a sufficiently daunting task. Traditional Metropolis–Hasting (M-H) random walk strategies are sub-optimal, involving costly pilot runs to determine viable initial starting points and unreasonably long chains while performing MCMC sampling. To avoid such issues, we use an adaptive rejection sampling while leveraging the log-concavity of the posteriors to perform effective inference that is not plagued by the above described issues [for more details, see, for e.g, 28]. In the following, we describe (i) briefly, our adaptive rejection MCMC sampling approach (more details are provided in the Supplementary Materials), (ii) the identifiability issues on the overall intercept that arise due to inclusion of a spatial effect and a strategy to address this, and (iii) a false discovery rate (FDR)–based approach for performing variable selection.
The joint posterior $\pi (\boldsymbol{\theta }\mid \mathbf{y})$ generated as a result of the hierarchical priors in Eq. 2.4 is sampled using a hybrid sampling strategy that includes M-H random walk and the Metropolis-Adjusted Langevin Algorithm (MALA) [52, 28]. We consider MALA updates for the model parameters $\{\boldsymbol{\beta },\mathbf{w}\}$ for the mean model. The dispersion model coefficients are sampled depending on the choice of likelihood, i.e., $\boldsymbol{\gamma }$ is sampled using a MALA if a saddle-point approximation of the likelihood is considered, otherwise $\boldsymbol{\gamma }$ is sampled using a MALA with a numerical approximation to the derivative of the conditional posterior for $\boldsymbol{\gamma }$ or using a M-H random walk. The parameters $\{\xi ,{\phi _{s}}\}$ are updated using a M-H random walk. All the other remaining parameters are sampled using Gibbs sampling. In particular, we employ block updates for ${\boldsymbol{\beta }_{\mathbf{w}}}=\{\boldsymbol{\beta },\mathbf{w}\}$ and $\boldsymbol{\gamma }$. Proposal variances feature adaptive scaling such that the optimal acceptance rate ($\approx 58\% $) to capture Langevin dynamics is achieved upon convergence [see, 12, 28]. Proposal variances in the M-H updates also feature adaptive scaling such that the optimal acceptance rate ($\approx 33\% $) for random walks is achieved upon convergence. We outline the full sampling algorithm at the end of Section S1 of the Supplement. For the hierarchical DGLM in (2.3), the specification of a spatial effect translates to fitting a random intercept mean model. Consequently, having an additional overall intercept ${\beta _{0}}$ in the model renders it unidentifiable [see, 25, 22]. Hence, ${\beta _{0}}$ is not estimable, although ${\beta _{0}}+\mathbf{w}$ is estimable. ${\beta _{0}}$ is estimated through hierarchical centering of the posterior for w [see, for e.g., 22].
The MCMC samples of $\boldsymbol{\beta }$ and $\boldsymbol{\gamma }$ explore their conditional posterior distributions and point estimates for these model parameters can be obtained using maximum a-posteriori (MAP) estimates or, the posterior means. Although we obtain point estimates, these estimates do not yield exact zero values since we have considered a continuous spike-and-slab prior with a spike at ${\nu _{0}}$ (a small positive number). Additionally, these estimates do not make use of all the MCMC samples. We use a Bayesian model averaging–based strategy that leverages all the MCMC samples to build inference [see, for e.g., 32]. Specifically, we use a FDR–based strategy—combining Bayesian model averaging with point estimates [see for e.g., 48, 47]. Let ${\beta _{u}^{(m)}}$ for $m=1,\dots ,M$ denote the MCMC samples (after burn-in and thinning) for the coefficients of the mean model. We compute ${p_{u}}=\frac{1}{M}{\textstyle\sum _{m}}I\big(|{\beta _{u}^{(m)}}|\le c\big)$, where $I(\cdot )$ is the indicator function; these probabilities ${p_{u}}$ can be interpreted as local FDR [see, for e.g., 48]. The probability $(1-{p_{u}})$ can be interpreted as the probability that covariate u is significantly associated with the response. We use the ${p_{u}}$s to decide on which covariates to select while controlling the FDR at level α. Explicitly, we infer that the covariate u has a non-zero coefficient if ${p_{u}}\lt {\kappa _{\alpha }}$ for some threshold ${\kappa _{\alpha }}\in (0,1)$. We compute the threshold ${\kappa _{\alpha }}$ as follows: We first sort the probabilities ${p_{u}}$ and denote the sorted probabilities as ${p_{(u)}}$ for $u=1,\dots ,p$. We then assign ${\kappa _{\alpha }}={p_{({u^{\ast }})}}$, where ${u^{\ast }}=\max \{\widetilde{u}\mid \frac{1}{\widetilde{u}}{\textstyle\sum _{u=1}^{\widetilde{u}}}{p_{(u)}}\le \alpha \}$. This approach caps our false discoveries of selected variables at $100\alpha \% $. We employ the same approach using the MCMC samples of $\boldsymbol{\gamma }$ to select significant coefficients for the dispersion models.
Table 1
Proposed Bayesian Hierarchical Double Generalized Linear Modeling Frameworks.
Models | Frameworks | Specification ($\boldsymbol{\theta }$) | Number of Parameters |
M1 | DGLM | ${\boldsymbol{\theta }_{m}}$ | $p+q+1$ |
M2 | DGLM + Variable Selection | ${\boldsymbol{\theta }_{m}}$, ${\boldsymbol{\theta }_{vs}}$ | $3p+3q+1$ |
M3 | DGLM + Spatial Effect | ${\boldsymbol{\theta }_{m}}$, ${\boldsymbol{\theta }_{pr}}$ | $p+q+L+4$ |
M4 | DGLM + Spatial Effect + Variable Selection | ${\boldsymbol{\theta }_{m}}$, ${\boldsymbol{\theta }_{vs}}$, ${\boldsymbol{\theta }_{pr}}$ | $3p+3q+L+4$ |
Posterior inference on $\boldsymbol{\beta }$, $\boldsymbol{\gamma }$ is performed using MAP (point) estimates along with posterior mean, median, standard deviation and highest posterior density (HPD) intervals. For w we employ the mean, median, standard deviation and HPD intervals to perform posterior inference. Next, we demonstrate some synthetic experiments that document the performance of our proposed models using the discussed metrics. The computation has been performed in the R statistical environment. The required subroutines can be accessed via an open-source repository at: https://github.com/arh926/sptwdglm.
3 Synthetic Experiments
We begin with an observation—the spatial heterogeneity that our models aim to quantify is not observed in real life. Hence, it is imperative to document the accuracy of estimating such effects through synthetic experiments. Settings used are outlined—we consider varying proportion of zeros (15%, 30%, 60%, 80% and 95%) under which the quality of posterior inference for $\boldsymbol{\theta }$ is assessed. Proportion of zeros can be interpreted as an inverse signal-to-noise ratio for the synthetic response. For the sake of brevity, we only show the results for synthetic experiments pertaining to Bayesian variable selection in the presence of spatial effects. Additional simulations can be found in the online Supplement. To construct the synthetic data we consider three scenarios pertaining to model structure, (a) there is no overlap (i.e. selected β’s and γ’s do not intersect) (b) 50% overlap (in the union of all selected variables across the mean and dispersion models) (c) 100% overlap between mean and dispersion model specification. We use 10 covariates including an intercept, where the columns of the synthetic design matrices X and Z are hierarchically centered and scaled, independently sampled Gaussian variables with mean 0 and variance 1. Naturally, specification of the true $\boldsymbol{\beta }$, w and $\boldsymbol{\gamma }$ parameters determine the proportion of zeros in the synthetic response. Table 2 contains the parameter specifications used. The true value of the index parameter, $\xi =1.5$. In an attempt to produce a synthetic setup that resembles reality we simulate, $\mathbf{w}\sim N(5(\sin (3\pi {\mathbf{s}_{1}})+\cos (3\pi {\mathbf{s}_{2}})),1)$ (see Figure 1, second row). The alternative route would be to fix values of ${\sigma ^{2}}$ and ${\phi _{s}}$ and generate a realization $\mathbf{w}\sim {N_{L}}({\mathbf{0}_{L}},{\sigma ^{2}}\mathbf{R}({\phi _{s}}))$ (see Figure 1, first row). Under each setting we consider $M=10$ replications. Within each replication we fit all the proposed modeling frameworks as shown in Table 1. The hyper-parameter settings used while specifying priors for the models are, ${a_{\xi }}=1$, ${b_{\xi }}=2$, ${a_{\sigma }}={a_{{\sigma _{\beta }}}}={a_{{\sigma _{\gamma }}}}=2$, ${b_{\sigma }}={b_{{\sigma _{\beta }}}}={b_{{\sigma _{\gamma }}}}=1$ (producing inverse-Gamma priors with mean 1 and infinite variance), ${a_{{\phi _{s}}}}=0$, ${b_{{\phi _{s}}}}=30$, ${\sigma _{\beta }^{-2}}={\sigma _{\gamma }^{-2}}={10^{-6}}$ and ${\nu _{0}}=5\times {10^{-4}}$, producing a vague and non-informative hierarchical prior. We maintain an FDR of 5% for all settings while performing model selection. The sample size varies from $N=\{2\times {10^{3}},5\times {10^{3}},1\times {10^{4}}\}$ and the number of locations are $L=1\times {10^{2}}$. Across replications, false positive rate (FPR) and true positive rate (TPR) are computed to measure accuracy of our model selection procedure. To record the quality of estimation, we compute the mean squared error (MSE), for e.g. $MSE(\boldsymbol{\beta })=\frac{1}{p}{\textstyle\sum _{{i_{\beta }}=1}^{p}}{({\beta _{{i_{\beta }}}}-{\widehat{\beta }_{{i_{\beta }}}})^{2}}$, which can be computed similarly for the other parameters. We also compute average coverage probabilities, for e.g. considering these probabilities for $\boldsymbol{\beta }$ we define $CP(\boldsymbol{\beta })=\frac{1}{M}{\textstyle\sum _{m=1}^{M}}I({\boldsymbol{\beta }_{true}}\in ({l_{m}}(\boldsymbol{\beta }),{u_{m}}(\boldsymbol{\beta })))$, where ${l_{m}}(\boldsymbol{\beta })$ and ${u_{m}}(\boldsymbol{\beta })$ are the lower and upper 95% HPD intervals respectively for $\boldsymbol{\beta }$ in replication m; we obtain coverage probabilities for w and $\boldsymbol{\gamma }$ similarly. The results obtained under the above settings are shown in Table 3. The first column is named configuration (abbreviated as config.) with entries denoting the proportion of overlap between selected coefficients in the mean and dispersion models, which is indicative of model structure. This is estimated by observing the overlap between selected variables following the model fit (for models M2 and M4). No variable selection is performed for models M1 and M3.
Table 2
Parameter settings used to obtain varying proportion of zeros in the synthetic data.
Proportion of 0s | ${\mu _{{\boldsymbol{\beta }_{-0}}}}({\sigma _{{\boldsymbol{\beta }_{-0}}}})$ | ${\gamma _{0}}$ | ${\mu _{{\boldsymbol{\gamma }_{-0}}}}$ $({\sigma _{{\boldsymbol{\gamma }_{-0}}}})$ |
15% | 0.50 (0.1) | −1.50 | 0.50 (0.1) |
30% | 0.50 (0.1) | 0.70 | 0.50 (0.1) |
60% | 0.50 (0.1) | 2.50 | 0.50 (0.1) |
80% | 1.00 (0.1) | 4.50 | 0.50 (0.1) |
95% | 1.00 (0.1) | 7.00 | 0.50 (0.1) |
Figure 1
Plots showing synthetic spatial patterns, pattern 1 (top, left column) and pattern 2 (bottom, left column) and corresponding logarithm of aggregated synthetic response (right column).
Figure 2
Results for synthetic experiments showing model performance, where MSE is plotted against proportion of zeros in the synthetic response which is tabulated across models (M1, M2, M3, M4) vs. $\{{\boldsymbol{\theta }_{m}},{\boldsymbol{\theta }_{pr}}\}\times $ configuration.
From the results shown, we see that models M1 and M2 perform poorly. Estimates, $\widehat{\boldsymbol{\beta }}$ remain fairly unaffected as compared to $\widehat{\boldsymbol{\gamma }}$ and $\widehat{\xi }$, where all of the variation not quantified, yet present in the synthetic data spills over to corrupt and compromise the quality of estimates. This also does not produce reliable results pertaining to model structure recovery for M1 and M2. However, significant improvements show up with M3 and M4. Particularly, under higher proportion of zeros in the synthetic data (low signal to noise ratio) the performance of M4 remains stable with respect to model structure recovery and estimation of parameters (refer to Table 3), thereby producing robust inference among the models in comparison. As an example within our simulation setting, under 95% of 0s in the data and under low sample sizes, for example 2000 or, 5000, the estimates of model coefficients and spatial effects in M3 and M4 are adversely affected by locations having fewer non-zero observations. This observation addresses the concern around specifying DGLMs without spatial random effects in a scenario where the data displays spatial variation. The results demonstrate expected gains when our model in its full capacity is used instead of an usual DGLM.
Table 3
Table showing results of synthetic experiments for model selection for models M2 and M4. Corresponding standard deviations are shown in brackets below.
N | True Overlap | Prop. of 0s | M2 | M4 | ||||
Overlap | FPR | TPR | Overlap | FPR | TPR | |||
5000 | 0.00 | 0.15 | 0.03 | 0.04 | 0.69 | 0.00 | 0.00 | 0.90 |
(0.06) | (0.06) | (0.07) | (0.00) | (0.00) | (0.00) | |||
0.30 | 0.05 | 0.05 | 0.89 | 0.00 | 0.00 | 1.00 | ||
(0.09) | (0.09) | (0.10) | (0.00) | (0.00) | (0.00) | |||
0.60 | 0.01 | 0.01 | 0.98 | 0.00 | 0.00 | 1.00 | ||
(0.04) | (0.04) | (0.04) | (0.00) | (0.00) | (0.00) | |||
0.80 | 0.04 | 0.04 | 0.99 | 0.00 | 0.00 | 1.00 | ||
(0.06) | (0.06) | (0.03) | (0.00) | (0.00) | (0.00) | |||
0.95 | 0.10 | 0.06 | 0.89 | 0.08 | 0.04 | 0.95 | ||
(0.20) | (0.10) | (0.20) | (0.10) | (0.11) | (0.08) | |||
0.50 | 0.15 | 0.24 | 0.01 | 0.67 | 0.50 | 0.00 | 0.90 | |
(0.16) | (0.05) | (0.09) | (0.00) | (0.00) | (0.03) | |||
0.30 | 0.49 | 0.01 | 0.94 | 0.50 | 0.00 | 1.00 | ||
(0.08) | (0.05) | (0.06) | (0.00) | (0.00) | (0.00) | |||
0.60 | 0.51 | 0.00 | 0.93 | 0.50 | 0.00 | 1.00 | ||
(0.03) | (0.00) | (0.06) | (0.00) | (0.00) | (0.00) | |||
0.80 | 0.56 | 0.07 | 0.97 | 0.49 | 0.01 | 1.00 | ||
(0.09) | (0.08) | (0.04) | (0.02) | (0.05) | (0.00) | |||
0.95 | 0.65 | 0.20 | 0.85 | 0.48 | 0.04 | 0.92 | ||
(0.31) | (0.17) | (0.22) | (0.10) | (0.15) | (0.18) | |||
1.00 | 0.15 | 0.40 | 0.03 | 0.66 | 1.00 | 0.00 | 0.90 | |
(0.25) | (0.05) | (0.07) | (0.00) | (0.00) | (0.00) | |||
0.30 | 0.90 | 0.04 | 0.86 | 1.00 | 0.00 | 1.00 | ||
(0.14) | (0.08) | (0.08) | (0.00) | (0.00) | (0.00) | |||
0.60 | 1.00 | 0.00 | 0.96 | 1.00 | 0.00 | 1.00 | ||
(0.00) | (0.00) | (0.05) | (0.00) | (0.00) | (0.00) | |||
0.80 | 1.00 | 0.00 | 1.00 | 1.00 | 0.00 | 1.00 | ||
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |||
0.95 | 0.35 | 0.15 | 0.85 | 0.89 | 0.10 | 0.89 | ||
(0.21) | (0.16) | (0.22) | (0.21) | (0.11) | (0.10) |
We use the MCMC algorithm featuring MALA updates for $\boldsymbol{\beta },\mathbf{w}$ and $\boldsymbol{\gamma }$. Chain lengths are set to $1\times {10^{4}}$, with the initial 5,000 samples as burn-in and thin the rest by selecting every 10-th sample which reduces any remaining auto-correlation and produces 500 independent posterior samples for each setting. The posterior estimate, $\widehat{\boldsymbol{\theta }}$ is then obtained using the produced samples by computing the median or a MAP estimate as applicable for the model. Coverage probabilities for model M4 remained sufficiently high ($\approx 1$) across all settings; only declining marginally for w (remaining above $90\% $) under high proportions of zeros (low signal to noise ratio) in the data. We performed additional synthetic experiments to showcase (a) the performance of M3 with respect to the quality of estimation for spatial effects and (b) the performance of M2. They are detailed in the Supplementary materials—we briefly outline its contents in the next section.
4 Supplementary Analysis
The online Supplement to this paper contains details on the derivations of the posteriors essential for constructing MCMC subroutines. They are outlined in Section S1. Section S2 features additional simulation experiments that supplement those outlined previously in Section 3. It documents performance of M2, shown in Table S1, contains results of experiments for scenarios featuring spatial covariates, shown in Table S2, and varying spatial patterns as seen in Figure 1, shown in Tables S3 and S4. Convergence diagnostics are shown for selected model parameters (index parameter ξ) in Section S2.2. Contents of the R-package are described in Section S2.3. Finally, results for models M1 and M3 pertaining to the real data analysis described in the next section appear in Section S3, Tables S5, S6, S7 and S8.
5 Automobile Insurance Premiums, Connecticut, 2008
We analyze automobile insurance premiums for collision coverage in the state of Connecticut, for the year 2008. The data is obtained from a comprehensive repository named Highway Loss Data Institute (HLDI) maintained by the independent non-profit, Insurance Institute for Highway Safety (IIHS) (https://www.iihs.org/) working towards reducing losses arising from motor vehicle collisions in North America. We briefly describe the variables contained in HLDI data. It records covariate information at three levels for an insured individual. They are as follows,
Derived variables like age categories, vehicle age in years and interactions like gender × marital status are computed and used as covariates in the model. For the state of Connecticut, 1,513,655 (≈ 1.5 million) data records were obtained in the year 2008, at 281 zip-codes. Zip-codes are areal units, we consider the latitude-longitude corresponding to the centroid of each zip code as the point reference counterpart unit for our application purposes. Distance between two zip-codes is then specified as the Euclidean distance between their centroids. The proportion of zeros in the payments is 95.73%. From an insurer’s perspective, policy rate-making is the problem of assigning policy-premium to a new customer’s policy based on their covariate information (for instance, individual level, policy and residence zip-code). We achieve this via out-sample prediction. To that end, we consider a 60-40 split for the data, the split is performed by using stratified sampling without replacement over zip-codes, such that the same 281 zip-codes are also available in the out-sample. The training data then contains ${N_{tr}}=908,741$ observations with ${N_{pr}}=604,914$ observations kept in reserve for prediction constituting the out-sample data.
-
a. Individual Level: (i) accident and model year of the vehicle, (ii) age, gender, marital status.
-
b. Policy level: (i) policy payments, measured in United States dollars, (ii) exposure–measured in policy years, for e.g. 0.5 indicates a coverage period of 6 months or, half a year, measured in years, (iii) policy risk–having two levels, which is assigned by the insurance company based on provided information by the individual, (iv) deductible limit–with 8 categories.
-
c. Spatial: 5-digit zip code.
Figure 3
(left) Spatial plot of zip-code level aggregated pure-premium $\times {10^{-6}}$ for the state of Connecticut, 2008 (right) histogram for the pure-premium overlaid with a probability density estimate (in red).
Table 4
Estimated coefficients for fixed effects corresponding to model M2. We show the MAP, median, mean, standard deviation and HPD intervals. The variables listed are a result of FDR-based variable selection at 1%.
Parameters | Levels | MAP | Median | Mean | SD | Lower HPD | Upper HPD | |
$\boldsymbol{\beta }$ | (Intercept) | – | 5.815 | 5.965 | 5.955 | 0.151 | 5.706 | 6.227 |
age.car | 1 | 0.113 | 0.112 | 0.110 | 0.033 | 0.044 | 0.173 | |
2 | 0.234 | 0.239 | 0.240 | 0.030 | 0.186 | 0.304 | ||
5 | 0.124 | 0.122 | 0.121 | 0.033 | 0.062 | 0.189 | ||
risk | S | −0.213 | −0.217 | −0.217 | 0.023 | −0.258 | −0.168 | |
agec | 3 | −0.399 | −0.399 | −0.400 | 0.030 | −0.454 | −0.341 | |
4 | −0.518 | −0.515 | −0.515 | 0.035 | −0.580 | −0.444 | ||
5 | −0.712 | −0.710 | −0.708 | 0.065 | −0.832 | −0.575 | ||
gender | F | 0.493 | 0.502 | 0.513 | 0.067 | 0.400 | 0.648 | |
M | 0.608 | 0.614 | 0.619 | 0.053 | 0.521 | 0.731 | ||
marital | M | −0.376 | −0.392 | −0.403 | 0.067 | −0.577 | −0.302 | |
deductible | B | 0.898 | 1.051 | 1.104 | 0.244 | 0.740 | 1.525 | |
E | 0.616 | 0.482 | 0.478 | 0.158 | 0.192 | 0.741 | ||
F | 0.580 | 0.428 | 0.435 | 0.154 | 0.135 | 0.656 | ||
G | 0.287 | 0.348 | 0.358 | 0.159 | 0.073 | 0.633 | ||
$\boldsymbol{\gamma }$ | (Intercept) | – | 7.354 | 7.346 | 7.345 | 0.048 | 7.249 | 7.435 |
age.car | 0 | −0.820 | −0.811 | −0.811 | 0.053 | −0.911 | −0.714 | |
1 | −1.024 | −1.017 | −1.016 | 0.051 | −1.115 | −0.922 | ||
2 | −0.888 | −0.874 | −0.873 | 0.052 | −0.969 | −0.776 | ||
3 | −0.864 | −0.854 | −0.851 | 0.052 | −0.953 | −0.760 | ||
4 | −0.843 | −0.838 | −0.838 | 0.052 | −0.936 | −0.743 | ||
5 | −0.788 | −0.781 | −0.780 | 0.052 | −0.877 | −0.682 | ||
6 | −0.770 | −0.765 | −0.765 | 0.053 | −0.862 | −0.665 | ||
7 | −0.730 | −0.723 | −0.723 | 0.053 | −0.829 | −0.627 | ||
risk | S | 0.114 | 0.114 | 0.114 | 0.011 | 0.093 | 0.135 | |
agec | 5 | −0.289 | −0.293 | −0.293 | 0.029 | −0.345 | −0.234 | |
deductible | B | −0.971 | −0.963 | −0.951 | 0.099 | −1.136 | −0.759 | |
C | −0.519 | −0.522 | −0.517 | 0.049 | −0.614 | −0.414 | ||
D | −0.565 | −0.568 | −0.564 | 0.044 | −0.657 | −0.470 | ||
E | −0.506 | −0.501 | −0.495 | 0.042 | −0.568 | −0.394 | ||
F | −0.348 | −0.346 | −0.340 | 0.039 | −0.410 | −0.242 | ||
H | 0.411 | 0.419 | 0.426 | 0.097 | 0.251 | 0.618 | ||
genderMarital | A | −0.113 | −0.106 | −0.097 | 0.043 | −0.168 | −0.001 | |
ξ | – | – | 1.673 | 1.673 | 1.673 | 0.001 | 1.671 | 1.676 |
Table 5
Estimated coefficients for fixed effects corresponding to model M4. We show the MAP, median, mean, standard deviation and HPDs. The variables listed are a result of FDR-based variable selection at 1%.
Parameters | Levels | MAP | Median | Mean | SD | Lower HPD | Upper HPD | |
$\boldsymbol{\beta }$ | (Intercept) | – | 5.215 | 5.213 | 5.216 | 0.135 | 5.165 | 5.327 |
age.car | 0 | 0.252 | 0.251 | 0.251 | 0.063 | 0.123 | 0.377 | |
1 | 0.329 | 0.317 | 0.318 | 0.058 | 0.217 | 0.431 | ||
2 | 0.424 | 0.423 | 0.425 | 0.061 | 0.318 | 0.549 | ||
3 | 0.209 | 0.193 | 0.195 | 0.060 | 0.084 | 0.318 | ||
4 | 0.236 | 0.252 | 0.255 | 0.062 | 0.149 | 0.387 | ||
5 | 0.297 | 0.302 | 0.307 | 0.059 | 0.191 | 0.428 | ||
6 | 0.182 | 0.209 | 0.214 | 0.061 | 0.113 | 0.343 | ||
7 | 0.155 | 0.154 | 0.157 | 0.063 | 0.049 | 0.286 | ||
risk | S | −0.190 | −0.184 | −0.183 | 0.023 | −0.226 | −0.137 | |
agec | 2 | −0.121 | −0.126 | −0.128 | 0.030 | −0.186 | −0.068 | |
3 | −0.461 | −0.467 | −0.470 | 0.031 | −0.531 | −0.412 | ||
4 | −0.608 | −0.617 | −0.619 | 0.034 | −0.685 | −0.554 | ||
5 | −0.721 | −0.699 | −0.696 | 0.065 | −0.808 | −0.561 | ||
gender | F | 0.535 | 0.540 | 0.541 | 0.066 | 0.407 | 0.679 | |
M | 0.670 | 0.705 | 0.720 | 0.100 | 0.535 | 0.919 | ||
deductible | B | 1.337 | 1.383 | 1.408 | 0.216 | 1.036 | 1.866 | |
C | 0.209 | 0.256 | 0.303 | 0.152 | 0.073 | 0.624 | ||
D | 0.603 | 0.627 | 0.672 | 0.161 | 0.397 | 0.982 | ||
E | 0.800 | 0.825 | 0.864 | 0.157 | 0.625 | 1.189 | ||
F | 0.753 | 0.774 | 0.819 | 0.155 | 0.592 | 1.154 | ||
G | 0.756 | 0.786 | 0.825 | 0.158 | 0.580 | 1.169 | ||
H | 0.820 | 0.829 | 0.824 | 0.190 | 0.463 | 1.149 | ||
genderMarital | B | −0.368 | −0.236 | −0.169 | 0.233 | −0.457 | 0.311 | |
$\boldsymbol{\gamma }$ | (Intercept) | – | 6.423 | 6.429 | 6.415 | 0.073 | 6.310 | 6.504 |
age.car | 0 | −0.790 | −0.809 | −0.811 | 0.040 | −0.891 | −0.731 | |
1 | −1.006 | −1.025 | −1.028 | 0.039 | −1.099 | −0.954 | ||
2 | −0.875 | −0.885 | −0.889 | 0.039 | −0.972 | −0.819 | ||
3 | −0.842 | −0.861 | −0.863 | 0.039 | −0.950 | −0.799 | ||
4 | −0.831 | −0.844 | −0.848 | 0.039 | −0.923 | −0.777 | ||
5 | −0.781 | −0.792 | −0.795 | 0.039 | −0.879 | −0.728 | ||
6 | −0.762 | −0.766 | −0.769 | 0.039 | −0.849 | −0.703 | ||
7 | −0.725 | −0.732 | −0.735 | 0.039 | −0.814 | −0.662 | ||
risk | S | 0.110 | 0.110 | 0.110 | 0.012 | 0.087 | 0.132 | |
agec | 5 | −0.262 | −0.262 | −0.261 | 0.030 | −0.316 | −0.201 | |
deductible | B | −1.023 | −1.026 | −1.029 | 0.158 | −1.328 | −0.721 | |
C | −0.534 | −0.553 | −0.571 | 0.069 | −0.700 | −0.461 | ||
D | −0.591 | −0.611 | −0.634 | 0.067 | −0.760 | −0.534 | ||
E | −0.533 | −0.552 | −0.575 | 0.066 | −0.702 | −0.478 | ||
F | −0.377 | −0.390 | −0.416 | 0.065 | −0.543 | −0.335 | ||
H | 0.352 | 0.361 | 0.369 | 0.107 | 0.170 | 0.591 | ||
ξ | – | 1.667 | 1.667 | 1.667 | 0.001 | 1.665 | 1.670 |
We denote payments towards a policy made by individual i, residing in zip-code j as ${y_{ij}}$ with an exposure of ${t_{ij}}$. We assume that the policy-premium defined as ${y_{ij}^{\ast }}=\frac{{y_{ij}}}{{t_{ij}}}\sim Tw\left({\mu _{ij}},{\phi _{ij}},\xi \right)$, which implies ${y_{ij}}\sim Tw({t_{ij}}{\mu _{ij}},{t_{ij}^{2-\xi }}{\phi _{ij}},\xi )$ using the scale invariance property. The following hierarchical DGLM is then specified,
when considered with a spatial specification, where the terms $-\log {t_{ij}}$ and $-(2-\xi )\log {t_{ij}}$ act as offsets for the respective mean and dispersion models. Given the covariates described at the beginning $p=q=29$, producing a ${N_{tr}}\times (p-1)$ design matrix for the mean model and a ${N_{tr}}\times q$ design matrix for the dispersion model. The model in (5.1) specifies model M3 from Table 1, M4 is obtained by specifying $\pi ({\boldsymbol{\theta }_{vs}})$ from (2.5) on ${\boldsymbol{\theta }_{m}}$. M1 is obtained by setting ${f_{ij}}({\mathbf{s}_{i}})=\mathbf{0}$ and M2 is obtained by specifying $\pi ({\boldsymbol{\theta }_{vs}})$ on ${\boldsymbol{\theta }_{m}}$ for the resulting model. For M1 and M2 we include an intercept in the mean model. We fit models M1–4 on the training data. Model selection is performed using FDR based variable selection on the posterior MCMC samples obtained from fitting models M2 and M4, controlling for FDR at 1%. The performance of M1–4 is assessed using the Akaike Information Criteria (AIC) [3]. While specifying (2.4) and (2.5) the hyper-parameter settings used are, ${\sigma _{\beta }^{2}}={\sigma _{\gamma }^{2}}={10^{6}}$, ${a_{{\sigma _{\beta }}}}={a_{\sigma }}={a_{{\sigma _{\gamma }}}}=2$ and ${b_{{\sigma _{\beta }}}}={b_{\sigma }}={b_{{\sigma _{\gamma }}}}=1$ (generating inverse-gamma priors with mean 1 and infinite variance), ${a_{{\phi _{s}}}}=0$, ${b_{{\phi _{s}}}}=60$. We maintain ${a_{\xi }}=1$ and ${b_{\xi }}=2$ for all models. For ease of implementation, the fractal parameter, ν is fixed at 0.5, producing the exponential covariance kernel. We consider $1\times {10^{5}}$ MCMC iterations for generating samples from respective posteriors with burn-in diagnosed at $5\times {10^{4}}$ and include every 20-th sample to compute posterior estimates as our thinning strategy. Convergence is assessed through inspecting trace plots. Proposal variances were scaled in an adaptive fashion to provide optimal acceptance rates of 58% (MALA) and 33% (MH). Predictive performance for models M1–4 is judged based on square root deviance on the out-sample data. The results are shown in Table 6. Optimal values are marked in bold. We show the results for estimated model coefficients featuring Bayesian variable selection (models M2 and M4) in Tables 4 and 5. Results for M1 and M3 are postponed to Tables S5, S6, S7 and S8 in the Supplementary Materials. Posterior estimates for the spatial effects in models M3 and M4 are showed in Figure 4. Zip-codes with significant effects are color coded appropriately.
(5.1)
\[ \begin{aligned}{}\log {\mu _{ij}}({\mathbf{s}_{i}})& =-\log {t_{ij}}+{\mathbf{x}_{ij}^{T}}({\mathbf{s}_{i}})\boldsymbol{\beta }+{\mathbf{f}_{ij}}{({\mathbf{s}_{i}})^{T}}\mathbf{w}({\mathbf{s}_{i}}),\\ {} \log {\phi _{ij}}& =-(2-\xi )\log {t_{ij}}+{\mathbf{z}_{ij}^{T}}\boldsymbol{\gamma },\end{aligned}\]Figure 4
Spatial plot showing the posterior estimates for 281 zipcodes from (left) M3, and (right) M4 in Connecticut. The locations are color coded based on significance, with white indicating a location with 0 in its HPD interval, blue (red) indicating HPD interval with both endpoints negative (positive).
Comparing the models we observe that M4 produces the most optimal model fit criteria among the models considered. This extends to out-sample performance when predicting policy premiums. Plots produced for spatial effects in models M3 and M4 are mean adjusted. Since specification of M3 and M4 differ only in presence/absence of the hierarchical Bayesian variable selection component, the produced spatial effects mimic each other after adjusting for the mean. Comparing results in Tables 4 and 5 we observe that including the spatial effect results in more categories for vehicle age, driver age and deductible being selected. Overall, we observe that the findings remain consistent with our earlier research [see, 29] but within a more robust proposed model choice framework. This is evident when comparing model estimates between Table 5 and Table S7 and S8 in the Supplement. The marital status and interaction between gender and marital status is not selected. We conclude by observing that the spatial effects are significantly positive in major cities in Connecticut, indicating a higher spatial risk, as opposed to sparsely populated regions showing significantly lower risk.
6 Discussion
Double generalized linear models have not seen much use after their inception by [40]. Hindrances presented by ambiguities existing around model specification/choice have been addressed in this paper. We propose Bayesian modeling frameworks that perform model selection using continuous spike and slab priors for hierarchical double generalized linear Tweedie spatial process models. Leveraging Langevin dynamics we are able to successfully produce practical implementations for the proposed frameworks which would otherwise remain unachievable with standard MCMC techniques. The proposed algorithms are available as a publicly accessible package for the R–statistical environment. Although the formulation considers the CP-g densities, evidently such modeling could be effected under any probabilistic framework that allows for varying dispersion. The application offers some key insights into the actuarial domain. It is generally believed that marital status and gender play a key role. However, the model inference suggests otherwise, with marital status not being selected as a significant feature.
Future work is aimed at extending this framework in multiple directions. Firstly, with the advent of modern Bayesian variable selection priors—for example, the Bayesian Lasso, the Horseshoe prior etc., a comparative model selection performance remains to be seen when considered within hierarchical DGLM formulations. Secondly, with the emerging techniques for handling large spatial and spatiotemporal data [see, for e.g., 31] the DGLM framework could be extended to model spatially or, spatio-temporally indexed observations over massive geographical domains. With respect to our application, this would allow us to investigate properties of the premium surface over much larger domains, for instance a country-wide study. Finally, extending these models to a spatiotemporal setting could be achieved using spatiotemporal covariance kernels that are commonly used. Depending on the nature of spatial and temporal interaction, we can have separable and non-separable kernels at our disposal [see, 15, and references therein]. Bayesian variable selection could then be effected to examine resulting changes in model specification upon inclusion of random effects that address spatiotemporal variation in the data.