Variable selection is widely used in all application areas of data analytics, ranging from optimal selection of genes in large scale micro-array studies, to optimal selection of biomarkers for targeted therapy in cancer genomics to selection of optimal predictors in business analytics. A formal way to perform this selection under the Bayesian approach is to select the model with highest posterior probability. The problem may be thought as an optimization problem over the model space where the objective function is the posterior probability of model. We propose to carry out this optimization using simulated annealing and we illustrate its feasibility in high dimensional problems. By means of various simulation studies, this new approach has been shown to be efficient. Theoretical justifications are provided and applications to high dimensional datasets are discussed. The proposed method is implemented in an R package sahpm for general use and is made available on R CRAN.
Tail probability plays an important part in the extreme value theory. Sometimes the conclusions from two approaches for estimating the tail probability of extreme events, the Bayesian and the frequentist methods, can differ a lot. In 1999, a rainfall that caused more than 30,000 deaths in Venezuela was not captured by the simple frequentist extreme value techniques. However, this catastrophic rainfall was not surprising if the Bayesian inference was used to allow for parameter uncertainty and the full available data was exploited [4].
In this paper, we investigate the reasons that the Bayesian estimator of the tail probability is always higher than the frequentist estimator. Sufficient conditions for this phenomenon are established both by using Jensen’s Inequality and by looking at Taylor series approximations, both of which point to the convexity of the distribution function.
Data matrix centering is an ever-present yet under-examined aspect of data analysis. Functional data analysis (FDA) often operates with a default of centering such that the vectors in one dimension have mean zero. We find that centering along the other dimension identifies a novel useful mode of variation beyond those familiar in FDA. We explore ambiguities in both matrix orientation and nomenclature. Differences between centerings and their potential interaction can be easily misunderstood. We propose a unified framework and new terminology for centering operations. We clearly demonstrate the intuition behind and consequences of each centering choice with informative graphics. We also propose a new direction energy hypothesis test as part of a series of diagnostics for determining which choice of centering is best for a data set. We explore the application of these diagnostics in several FDA settings.
As a prominent dimension reduction method for multivariate linear regression, the envelope model has received increased attention over the past decade due to its modeling flexibility and success in enhancing estimation and prediction efficiencies. Several enveloping approaches have been proposed in the literature; among these, the partial response envelope model [57] that focuses on only enveloping the coefficients for predictors of interest, and the simultaneous envelope model [14] that combines the predictor and the response envelope models within a unified modeling framework, are noteworthy. In this article we incorporate these two approaches within a Bayesian framework, and propose a novel Bayesian simultaneous partial envelope model that generalizes and addresses some limitations of the two approaches. Our method offers the flexibility of incorporating prior information if available, and aids coherent quantification of all modeling uncertainty through the posterior distribution of model parameters. A block Metropolis-within-Gibbs algorithm for Markov chain Monte Carlo (MCMC) sampling from the posterior is developed. The utility of our model is corroborated by theoretical results, comprehensive simulations, and a real imaging genetics data application for the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study.
Approximate confidence distribution computing (ACDC) offers a new take on the rapidly developing field of likelihood-free inference from within a frequentist framework. The appeal of this computational method for statistical inference hinges upon the concept of a confidence distribution, a special type of estimator which is defined with respect to the repeated sampling principle. An ACDC method provides frequentist validation for computational inference in problems with unknown or intractable likelihoods. The main theoretical contribution of this work is the identification of a matching condition necessary for frequentist validity of inference from this method. In addition to providing an example of how a modern understanding of confidence distribution theory can be used to connect Bayesian and frequentist inferential paradigms, we present a case to expand the current scope of so-called approximate Bayesian inference to include non-Bayesian inference by targeting a confidence distribution rather than a posterior. The main practical contribution of this work is the development of a data-driven approach to drive ACDC in both Bayesian or frequentist contexts. The ACDC algorithm is data-driven by the selection of a data-dependent proposal function, the structure of which is quite general and adaptable to many settings. We explore three numerical examples that both verify the theoretical arguments in the development of ACDC and suggest instances in which ACDC outperform approximate Bayesian computing methods computationally.
Graphical models have witnessed significant growth and usage in spatial data science for modeling data referenced over a massive number of spatial-temporal coordinates. Much of this literature has focused on a single or relatively few spatially dependent outcomes. Recent attention has focused upon addressing modeling and inference for substantially large number of outcomes. While spatial factor models and multivariate basis expansions occupy a prominent place in this domain, this article elucidates a recent approach, graphical Gaussian Processes, that exploits the notion of conditional independence among a very large number of spatial processes to build scalable graphical models for fully model-based Bayesian analysis of multivariate spatial data.