Bayesian Variable Selection in Double Generalized Linear Tweedie Spatial Process Models
Volume 1, Issue 2 (2023), pp. 187–199
Pub. online: 19 June 2023
Type: Methodology Article
Open Access
Area: Statistical Methodology
1
Equal contribution.
Accepted
31 May 2023
31 May 2023
Published
19 June 2023
19 June 2023
Abstract
Double generalized linear models provide a flexible framework for modeling data by allowing the mean and the dispersion to vary across observations. Common members of the exponential dispersion family including the Gaussian, Poisson, compound Poisson-gamma (CP-g), Gamma and inverse-Gaussian are known to admit such models. The lack of their use can be attributed to ambiguities that exist in model specification under a large number of covariates and complications that arise when data display complex spatial dependence. In this work we consider a hierarchical specification for the CP-g model with a spatial random effect. The spatial effect is targeted at performing uncertainty quantification by modeling dependence within the data arising from location based indexing of the response. We focus on a Gaussian process specification for the spatial effect. Simultaneously, we tackle the problem of model specification for such models using Bayesian variable selection. It is effected through a continuous spike and slab prior on the model parameters, specifically the fixed effects. The novelty of our contribution lies in the Bayesian frameworks developed for such models. We perform various synthetic experiments to showcase the accuracy of our frameworks. They are then applied to analyze automobile insurance premiums in Connecticut, for the year of 2008.
Supplementary material
Supplementary MaterialSupplementary Material containing further details as described in Section 4 is available online. The R –package is available for installation and deployment at: https://github.com/arh926/sptwdglm.
References
Abramowitz, M., Stegun, I. A. and Romer, R. H. (1988). Handbook of mathematical functions with formulas, graphs, and mathematical tables. American Association of Physics Teachers. MR0415962
Agarwal, D. K., Gelfand, A. E. and Citron-Pousty, S. (2002). Zero-inflated models with application to spatial count data. Environmental and Ecological statistics 9(4) 341–355. https://doi.org/10.1023/A:1020910605990. MR1951713
Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike 199–213 Springer. MR1486823
Banerjee, S. and Carlin, B. P. (2004). Parametric spatial cure rate models for interval-censored time-to-relapse data. Biometrics 60(1) 268–275. https://doi.org/10.1111/j.0006-341X.2004.00032.x. MR2044123
Banerjee, S., Carlin, B. P. and Gelfand, A. E. (2014). Hierarchical Modeling and Analysis for Spatial Data. MR3362184
Berger, J. O., De Oliveira, V. and Sansó, B. (2001). Objective Bayesian analysis of spatially correlated data. Journal of the American Statistical Association 96(456) 1361–1374. https://doi.org/10.1198/016214501753382282. MR1946582
Berger, J. O., Pericchi, L. R., Ghosh, J., Samanta, T., De Santis, F., Berger, J. and Pericchi, L. (2001). Objective Bayesian methods for model selection: Introduction and comparison. Lecture Notes-Monograph Series 135–207. https://doi.org/10.1214/lnms/1215540968. MR2000753
Berliner, M. (2000). Hierarchical Bayesian modeling in the environmental sciences. AStA Advances in Statistical Analysis 2(84) 141–153. https://doi.org/10.1214/06-BA130. MR2282211
Best, N. G., Ickstadt, K. and Wolpert, R. L. (2000). Spatial Poisson regression for health and exposure data measured at disparate resolutions. Journal of the American statistical association 95(452) 1076–1088. https://doi.org/10.2307/2669744. MR1821716
Bradley, J. R., Holan, S. H. and Wikle, C. K. (2018). Computationally efficient multivariate spatio-temporal models for high-dimensional count-valued data (with discussion). Bayesian Analysis 13(1) 253–310. https://doi.org/10.1214/17-BA1069. MR3773410
Bradley, J. R., Holan, S. H. and Wikle, C. K. (2020). Bayesian hierarchical models with conjugate full-conditional distributions for dependent data from the natural exponential family. Journal of the American Statistical Association 115(532) 2037–2052. https://doi.org/10.1080/01621459.2019.1677471. MR4189775
Carlin, B. P. and Louis, T. A. (2008) Bayesian methods for data analysis. CRC press. MR2442364
Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika 97(2) 465–480. https://doi.org/10.1093/biomet/asq017. MR2650751
Cressie, N. (2015) Statistics for spatial data. John Wiley & Sons. MR3559472
Dey, D. K., Ghosh, S. K. and Mallick, B. K. (2000) Generalized linear models: A Bayesian perspective. CRC Press. MR1893779
Diggle, P. J., Tawn, J. A. and Moyeed, R. A. (1998). Model-based geostatistics. Journal of the Royal Statistical Society: Series C (Applied Statistics) 47(3) 299–350. https://doi.org/10.1111/1467-9876.00113. MR1626544
Dunn, P. K. and Smyth, G. K. (2005). Series evaluation of Tweedie exponential dispersion model densities. Statistics and Computing 15(4) 267–280. https://doi.org/10.1007/s11222-005-4070-y. MR2205390
Dunn, P. K. and Smyth, G. K. (2008). Evaluation of Tweedie exponential dispersion model densities by Fourier inversion. Statistics and Computing 18(1) 73–86. https://doi.org/10.1007/s11222-007-9039-6. MR2416440
Eidsvik, J., Finley, A. O., Banerjee, S. and Rue, H. (2012). Approximate Bayesian inference for large spatial datasets using predictive process models. Computational Statistics & Data Analysis 56(6) 1362–1380. https://doi.org/10.1016/j.csda.2011.10.022. MR2892347
Finley, A. O., Banerjee, S. and McRoberts, R. E. (2009). Hierarchical spatial models for predicting tree species assemblages across large domains. The annals of applied statistics 3(3) 1052. https://doi.org/10.1214/09-AOAS250. MR2750386
Gelfand, A. E., Sahu, S. K. and Carlin, B. P. (1996). Efficient Parametrizations for Generalized Linear Mixed Models. Bayesian Statistics 5: Proceedings of the Fifth Valencia International Meeting 165–180. MR1425405
Gelfand, A. E. (2000). Modeling and Inference for Point-Referenced Binary Spatial Data. Generalized linear models: a Bayesian perspective 373. MR1893801
Gelfand, A. E., Sahu, S. K. and Carlin, B. P. (1995). Efficient parametrisations for normal linear mixed models. Biometrika 82(3) 479–488. https://doi.org/10.1093/biomet/82.3.479. MR1366275
Gelfand, A. E., Schmidt, A. M., Wu, S., Silander Jr, J. A., Latimer, A. and Rebelo, A. G. (2005). Modelling species diversity through species level hierarchical modelling. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54(1) 1–20. https://doi.org/10.1111/j.1467-9876.2005.00466.x. MR2134594
Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(2) 123–214. https://doi.org/10.1111/j.1467-9868.2010.00765.x. MR2814492
Halder, A., Mohammed, S., Chen, K. and Dey, D. K. (2021). Spatial Tweedie exponential dispersion models: an application to insurance rate-making. Scandinavian Actuarial Journal 2021(10) 1017–1036. https://doi.org/10.1080/03461238.2021.1921017. MR4345874
Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M. et al. (2019). A Case Study Competition Among Methods For Analyzing Large Spatial Data. Journal of Agricultural, Biological and Environmental Statistics 24(3) 398–425.
Hoeting, J. A., Madigan, D., Raftery, A. E. and Volinsky, C. T. (1999). Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and EI George, and a rejoinder by the authors. Statistical science 14(4) 382–417. https://doi.org/10.1214/ss/1009212519. MR1765176
Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics 33(2) 730–773. https://doi.org/10.1214/009053604000001147. MR2163158
Jørgensen, B. (1986). Some properties of exponential dispersion models. Scandinavian Journal of Statistics 187–197. MR0873073
Jørgensen, B. (1987). Exponential dispersion models. Journal of the Royal Statistical Society: Series B (Methodological) 49(2) 127–145. MR0905186
Jorgensen, B. (1997) The theory of dispersion models. CRC Press. MR1462891
Kokonendji, C. C., Bonat, W. H. and Abid, R. (2021). Tweedie regression models and its geometric sums for (semi-) continuous data. Wiley Interdisciplinary Reviews: Computational Statistics 13(1) 1496. https://doi.org/10.1002/wics.1496. MR4186771
Lawson, A. B. (2018) Bayesian disease mapping: hierarchical modeling in spatial epidemiology. Chapman and Hall/CRC. MR2484272
Lee, Y. and Nelder, J. A. (2006). Double hierarchical generalized linear models (with discussion). Journal of the Royal Statistical Society: Series C (Applied Statistics) 55(2) 139–185. https://doi.org/10.1111/j.1467-9876.2006.00538.x. MR2226543
Li, Q. and Lin, N. (2010). The Bayesian elastic net. Bayesian analysis 5(1) 151–170. https://doi.org/10.1214/10-BA506. MR2596439
Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of g-priors for Bayesian variable selection. Journal of the American Statistical Association 103(481) 410–423. https://doi.org/10.1198/016214507000001337. MR2420243
Mallick, H., Chatterjee, S., Chowdhury, S., Chatterjee, S., Rahnavard, A. and Hicks, S. C. (2022). Differential expression of single-cell RNA-seq data using Tweedie models. Statistics in medicine 41(18) 3492–3510. https://doi.org/10.1002/sim.9430. MR4453460
Martino, S., Akerkar, R. and Rue, H. (2011). Approximate Bayesian inference for survival models. Scandinavian Journal of Statistics 38(3) 514–528. https://doi.org/10.1111/j.1467-9469.2010.00715.x. MR2833844
Matérn, B. (2013) Spatial variation 36. Springer Science & Business Media. https://doi.org/10.1007/978-1-4615-7892-5. MR0867886
Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. Journal of the american statistical association 83(404) 1023–1032. MR0997578
Mohammed, S., Bharath, K., Kurtek, S., Rao, A. and Baladandayuthapani, V. (2021). RADIOHEAD: Radiogenomic analysis incorporating tumor heterogeneity in imaging through densities. The Annals of Applied Statistics 15(4) 1808–1830. https://doi.org/10.1214/21-aoas1458. MR4355077
Morris, J. S., Brown, P. J., Herrick, R. C., Baggerly, K. A. and Coombes, K. R. (2008). Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics 64(2) 479–489. https://doi.org/10.1111/j.1541-0420.2007.00895.x. MR2432418
Nelder, J. A. and Pregibon, D. (1987). An extended quasi-likelihood function. Biometrika 74(2) 221–232. https://doi.org/10.1093/biomet/74.2.221. MR0903123
Park, T. and Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical Association 103(482) 681–686. https://doi.org/10.1198/016214508000000337. MR2524001
Raftery, A. E., Madigan, D. and Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association 92(437) 179–191. https://doi.org/10.2307/2291462. MR1436107
Roberts, G. O. and Stramer, O. (2002). Langevin diffusions and Metropolis-Hastings algorithms. Methodology and computing in applied probability 4(4) 337–357. https://doi.org/10.1023/A:1023562417138. MR2002247
Smyth, G. K. (1989). Generalized linear models with varying dispersion. Journal of the Royal Statistical Society: Series B (Methodological) 51(1) 47–60. MR0984992
Smyth, G. K. and Jørgensen, B. (2002). Fitting Tweedie’s compound Poisson model to insurance claims data: dispersion modelling. ASTIN Bulletin: The Journal of the IAA 32(1) 143–157. https://doi.org/10.2143/AST.32.1.1020. MR1930491
Swallow, B., Buckland, S. T., King, R. and Toms, M. P. (2016). Bayesian hierarchical modelling of continuous non-negative longitudinal data with a spike at zero: An application to a study of birds visiting gardens in winter. Biometrical Journal 58(2) 357–371. https://doi.org/10.1002/bimj.201400081. MR3499119
Tweedie, M. C. et al. (1984). An index which distinguishes between some important exponential families. In Statistics: Applications and new directions: Proc. Indian statistical institute golden Jubilee International conference 579 604. MR0786162
Verbyla, A. P. (1993). Modelling variance heterogeneity: residual maximum likelihood and diagnostics. Journal of the Royal Statistical Society: Series B (Methodological) 55(2) 493–508. MR1224412
Williams, C. K. and Rasmussen, C. E. (2006) Gaussian processes for machine learning 2. MIT press Cambridge, MA. MR2514435
Wolpert, R. L. and Ickstadt, K. (1998). Poisson/gamma random field models for spatial statistics. Biometrika 85(2) 251–267. https://doi.org/10.1093/biomet/85.2.251. MR1649114
Yang, Y., Qian, W. and Zou, H. (2018). Insurance premium prediction via gradient tree-boosted Tweedie compound Poisson models. Journal of Business & Economic Statistics 36(3) 456–470. https://doi.org/10.1080/07350015.2016.1200981. MR3828973
Ye, T., Lachos, V. H., Wang, X. and Dey, D. K. (2021). Comparisons of zero-augmented continuous regression models from a Bayesian perspective. Statistics in Medicine 40(5) 1073–1100. https://doi.org/10.1002/sim.8795. MR4384363
Zeger, S. L. and Karim, M. R. (1991). Generalized linear models with random effects; a Gibbs sampling approach. Journal of the American statistical association 86(413) 79–86. MR1137101
Zhang, H. (2002). On estimation and prediction for spatial generalized linear mixed models. Biometrics 58(1) 129–136. https://doi.org/10.1111/j.0006-341X.2002.00129.x. MR1891051
Zhang, Y. (2013). Likelihood-based and Bayesian methods for Tweedie compound Poisson linear mixed models. Statistics and Computing 23(6) 743–757. https://doi.org/10.1007/s11222-012-9343-7. MR3247830
Zhou, H. and Hanson, T. (2015). Bayesian spatial survival models. Nonparametric Bayesian Inference in Biostatistics 215–246. MR3411022