1 Introduction
Multivariate spatial data consist of multiple variables that exhibit inherent associations among themselves as well as underlying spatial associations unique to each of them. While modeling each variable separately captures its spatial distribution independent of other variables, the resulting data analysis fails to account for associations among the variables, which can substantially impair prediction or interpolation [see, e.g., 9, 11, 21, 50]. Joint modeling of multivariate spatial data proceeds from vector-valued latent spatial stochastic processes or random fields, such as a multivariate Gaussian process. These are specified with matrix-valued cross-covariance functions [see, e.g., 24, 43, 33, and references therein] that model pairwise associations at distinct locations. While theoretical characterizations of cross-covariance functions are well established, modeling implications and practicability depend upon the specific application [see, e.g. 5, 34, 46, 32, 20, 44].
Multivariate spatial processes are customarily specified using a vector of mean functions and a cross-covariance matrix function. Let $z(s)={({z_{1}}(s),\dots ,{z_{q}}(s))^{\mathrm{T}}}$ be a $q\times 1$ stochastic process, where each ${z_{i}}(s)$ is a real-valued random variable at location $s\in \mathcal{D}\subseteq {\mathrm{\Re }^{d}}$. The process is specified by its means $\text{E}[{z_{i}}(s)]={\mu _{i}}(s)$ and the second-order covariance functions ${C_{ij}}(s,{s^{\prime }})=\text{Cov}\{{z_{i}}(s),{z_{j}}({s^{\prime }})\}$ for $s,{s^{\prime }}\in {\mathrm{\Re }^{d}}$ and $i,j=1,2,\dots ,q$. These covariances define the matrix-valued $q\times q$ cross-covariance function $C(s,{s^{\prime }})=\{{C_{ij}}(s,{s^{\prime }})\}$ with $(i,j)$-th entry ${C_{ij}}(s,{s^{\prime }})$. Since the mean structure can be estimated by absorbing it into a regression component, i.e., ${z_{i}}(s)={\mu _{i}}(s)+{w_{i}}(s)$, where ${w_{i}}(s)$ has zero mean, we focus on constructing a valid cross-covariance function for a zero-mean latent process. From its definition, $C(s,{s^{\prime }})$ need not be symmetric, but must satisfy $C{(s,{s^{\prime }})^{\mathrm{T}}}=C({s^{\prime }},s)$. Also, since $\text{var}\{{\textstyle\sum _{i}^{n}}{a_{i}^{\mathrm{T}}}z({s_{i}})\}\gt 0$ for any finite set of distinct locations ${s_{1}},{s_{2}},\dots ,{s_{n}}\in \mathcal{D}$ and any set of nonzero constant vectors ${a_{1}},{a_{2}},\dots ,{a_{n}}\in {\mathrm{\Re }^{q}}\setminus \{0\}$, we have ${\textstyle\sum _{i=1}^{n}}{\textstyle\sum _{j=1}^{n}}{a_{i}^{\mathrm{T}}}C({s_{i}},{s_{j}}){a_{i}}\gt 0$. See [24] for a review.
Spatial factor models [35, 40, 47, 53] are among the most widely used approaches to build multivariate spatial models when the number of spatially dependent variables is large. These models build upon the popular linear model of coregionalization (LMC) [26, 44], which specifies the multivariate random field as a linear combination of r univariate random fields. Choosing $r\ll q$ produces low-rank or spatial factor models. Computational benefits ensue; one needs specify only a small number r of univariate processes to ensure non-negative definiteness and, hence, yields large q. While computationally convenient, multivariate spatial analysis using LMC prohibits interpretation of spatial structures of each variable. For example, LMC endows each ${z_{j}}(s)$ with the same smoothness (the smoothness of the roughest latent process). This is implausible in most applications because different spatial variables typically exhibit very different degrees of smoothness. An alternate approach constructs cross-covariances by convolving univariate processes with kernel functions [48, 49, 36]. However, barring certain very special cases, the resulting cross-covariance functions are analytically intractable, hence less interpretable, and may require cumbersome numerical integration for estimating process parameters. Some of the aforementioned difficulties are obviated by a conditional approach developed in [12], where the univariate GPs are specified sequentially, conditional on the previous GPs assuming some ordering of the q variables. Other notable approaches include [1] who considered using latent dimensions to embed all the variables in a larger dimensional space and use standard covariance functions on this augmented space. However, this embedding restricts all pairwise-correlations among the different variables to be positive. [25] and [2] directly formulated multivariate Matérn cross-covariance functions where both the univariate covariance functions for each variable and the cross-covariance functions between each pair of variables are members of the Matérn family. The multivariate Matérn GP is appealing in terms of interpretability, allowing direct spatial inference for each variable via estimates of the parameters of the corresponding component Matérn GP.
While such methods have been applied to diverse data sets, they have been restricted to a moderate number of outcomes. This article focuses on the high-dimensional multivariate setting where a significantly larger number of outcomes (tens to hundreds of variables) can be measured at each spatial location. Such settings are becoming increasingly commonplace in the environmental and physical sciences where inference is sought on large numbers of dependent outcomes. [17] developed a novel class of “Graphical Gaussian Process” (GGP) models for scalable and interpretable analysis of high-dimensional multivariate spatial data. The underlying idea is to exploit conditional independence among the variables using an undirected graph with the components of the multivariate GP as the nodes. Absence of edges between nodes represent conditional independence among the corresponding pair of component processes given all the other nodes.
Graphical models are extensively used to represent the joint distribution of several variables for many types of non-spatial data. Applications in spatial settings have been concerned with reducing the computational burden for large n by replacing the complete graph between locations with nearest-neighbor graphs [14]. Current multivariate GP covariance functions do not lend themselves naturally to incorporating an inter-variable graphical model. [30] proposed a method for parsimonious joint analysis of such high-dimensional multivariate spatial data via a common basis function expansion and imposing sparse graphical models for estimating the covariances of the coefficient vectors. This approach, akin to similar approaches common in multivariate functional modeling [52, 54], rely on multiple replicate measurements of each variable at each location which may not be available in many multivariate spatial applications.
The approach of Dey et al. [17] constructs a high-dimensional multivariate model by adapting graphical Gaussian models to process-based settings and does not require replicate data. This process-level graphical model approach addresses some key properties of multivariate GPs that are deemed critical for handling highly multivariate data, including the retention of the flexibility and interpretation of spatial and non-spatial dependencies. The balance of the manuscript proceeds as follows. We offer a brief overview of inference from graphical models in spatial statistics followed by an elucidation of the GGP and its implementation in a fully Bayesian modeling framework. We illustrate with some simulation examples, offer an analysis for a highly multivariate environmental data, and conclude with some discussion and pointers for future research.
2 Graphical Models for Spatial Data Analysis
Undirected graphical models have an established history in spatial statistics in the context of specifying spatial dependencies via Markov random fields (MRF) for regionally aggregated or areal data [see, e.g., 6, 7, 10, 41, 22, 5, and references therein]. In areal modeling [37, 23, 28, 27] the nodes of the graph represent regions and the presence of an edge between two nodes indicates that the two regions are neighbors (or adjacent). Spatial models are constructed from this graph by assuming dependence between nodes with edges between them. For example, the popular conditional auto-regression (CAR) models assume conditional independence between two nonadjacent nodes given all other nodes.
Our current development departs from the multivariate areal setting in two aspects. First, we consider point-referenced settings where inference is desired at the process-level. Unlike areal settings where inference is limited to the fixed set of areas, point-referenced datasets allow inference about the variables over every conceivable location in an entire region. Second, unlike MRFs where the graph is on the set of areas specifying spatial dependencies, here the undirected graph posits inter-variable conditional independence relationships. Thus each node corresponds to the stochastic process describing the distribution of the respective variable over space.
In point-referenced settings, graphical models among locations based on spatial proximity have been used to specify nearest neighbor Gaussian Processes (NNGP) [14, 18] that offer a scalable solution to use of GP priors in Bayesian hierarchical models. However, akin to the MRF models for areal data, NNGP uses a graph among the locations to parsimoniously specify spatial dependence. Neither approaches model inter-variable relationships using graphs.
Recent developments for graphical modeling of multivariate functional or spatial data represents the multiple variable processes using a common univariate basis function expansion with multivariate (vector-valued) coefficients [52, 30]. Graphical modeling among the variables is induced by a series of graphical models, one for each of the vector-valued coefficients. As practically, basis functions are truncated to a finite number of coefficients, the representation reduces to the standard graphical model estimation for vector-valued data using graphical Lasso type techniques [19]. The advantage of the approach is that the graph is allowed to vary for each coefficient and thereby allowing conditional independence structures to be resolution-specific. However, these methods rely on replicate data on each variable at each location which is atypical in many spatial settings. Also, the assumption of a common univariate basis function expansion for the multivariate process may be inadequate.
We first present a simple alternative approach to build a multivariate spatial model that explicitly respects a given inter-variable graphical model. Let $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ be an undirected graph with $\mathcal{V}$ being the set of vertices and $\mathcal{E}$ the set of edges. A customary approach for developing probabilistic models on such undirected graphs is to specify full conditional distributions for each node given others and then deriving a joint density from the set of full conditional distributions. This is achieved using Brook’s Lemma [6, 37], which can be adapted to our setting as follows.
Let us consider a multivariate spatial process $z(s)$ with q univariate spatial processes and let ${z_{i}}={({z_{i}}({s_{i1}}),{z_{i}}({s_{i2}}),\dots ,{z_{i{n_{i}}}}({s_{i{n_{i}}}}))^{\mathrm{T}}}$ be an ${n_{i}}\times 1$ random vector corresponding to the realizations of the i-th variable, ${z_{i}}(s)$, for each $i=1,2,\dots ,q$. We specify the following sequence of full conditional distributions for modeling ${z_{i}}$’s given all other ${z_{j}}$’s:
where ${z_{-(i)}}=\{{z_{1}},{z_{2}},\dots ,{z_{q}}\}\setminus \{{z_{i}}\}$, i.e., the collection of ${z_{j}}$’s for $j=1,2,\dots ,q$ but excluding ${z_{i}}$, ${A_{ij}}$’s are fixed ${n_{i}}\times {n_{j}}$ matrices, ${A_{ii}}=O$ (the matrix of zeroes), and ${\Gamma _{i}}$’s are fixed positive definite matrices. Since each variable can exhibit its own spatial dependence structure, the ${\Gamma _{i}}$ varies by variable. Brook’s Lemma provides a straightforward method for deriving the joint density from (2.1) using the identity:
where $\tilde{z}={({\tilde{z}_{1}^{\mathrm{T}}},{\tilde{z}_{2}^{\mathrm{T}}},\dots ,{\tilde{z}_{q}^{\mathrm{T}}})^{\mathrm{T}}}$ is any fixed point in the support of $\pi (z)$ and we assume that the joint density $\pi (\cdot )\gt 0$ over its entire support. The proof is a straightforward verification proceeding from the last element of the right hand side (i.e., the joint density on the right hand side). Note that
(2.1)
\[ {z_{i}}\hspace{0.1667em}|\hspace{0.1667em}{z_{(-i)}}\sim N\left({\sum \limits_{j=1}^{q}}{A_{ij}}{z_{j}},{\Gamma _{i}}\right)\hspace{0.2778em},\hspace{1em}i=1,2,\dots ,q\hspace{0.2778em},\](2.2)
\[\begin{array}{cc}& \displaystyle \pi ({z_{1}},{z_{2}},\dots ,{z_{q}})={\prod \limits_{i=1}^{q}}\frac{\pi ({z_{i}}\hspace{0.1667em}|\hspace{0.1667em}{\tilde{z}_{1}},\dots ,{\tilde{z}_{i-1}},{z_{i+1}},\dots ,{z_{q}})}{\pi ({\tilde{z}_{i}}\hspace{0.1667em}|\hspace{0.1667em}{\tilde{z}_{1}},\dots ,{\tilde{z}_{i-1}},{z_{i+1}},\dots ,{z_{q}}))}\\ {} & \displaystyle \times \pi ({\tilde{z}_{1}},{\tilde{z}_{2}},\dots ,{\tilde{z}_{q}})\hspace{0.2778em},\end{array}\]Proceeding as above will continue to replace ${\tilde{z}_{i}}$ with ${z_{i}}$ in the joint density on the right hand side of (2.2) for each $i=q-1,q-2,\dots ,1$ and we eventually arrive at $\pi ({z_{1}},{z_{2}},\dots ,{z_{q}})$.
Applying (2.2) to the full conditional distributions in (2.1) with ${\tilde{z}_{i}}=0$ for each $i=1,\dots ,q$ yields the joint density
where $z={({z_{1}^{\mathrm{T}}},\dots ,{z_{q}^{\mathrm{T}}})^{\mathrm{T}}}$ is $({\textstyle\sum _{i=1}^{q}}{n_{i}})\times 1$ and $Q={M^{-1}}(I-A)$ is $({\textstyle\sum _{i=1}^{q}}{n_{i}})\times ({\textstyle\sum _{i=1}^{q}}{n_{i}})$ with $M=\oplus {\Gamma _{i}}$ is block-diagonal with $(i,i)$-th block ${\Gamma _{i}}$ for $i=1,\dots ,q$ and $A=({A_{ij}})$ is the $({\textstyle\sum _{i=1}^{q}}{n_{i}})\times ({\textstyle\sum _{i=1}^{q}}{n_{i}})$ block matrix with ${A_{ij}}$ as the $(i,j)$th block. For (2.3) to be a valid density, Q needs to be symmetric and positive definite and $z\sim N(0,{Q^{-1}})$ in (2.3).
(2.3)
\[ \begin{aligned}{}\pi (z)& \propto \exp \left\{-\frac{1}{2}\left({\sum \limits_{i=1}^{q}}{z_{i}^{\mathrm{T}}}{\Gamma _{i}^{-1}}{z_{i}}-{\sum \limits_{i=1}^{q}}{\sum \limits_{j\ne i}^{q}}{z_{i}^{\mathrm{T}}}{\Gamma _{i}^{-1}}{A_{ij}}{z_{j}}\right)\right\}\\ {} & \propto \exp \left(-\frac{1}{2}{z^{\mathrm{T}}}Qz\right)\hspace{0.2778em},\end{aligned}\]How, then, can we construct Q to be symmetric and positive definite while also respecting the conditional independence relationships among the outcomes from a given undirected graph? To be precise, if two distinct nodes i and j in the graph do not have an edge, then we must ensure that ${z_{i}}\perp {z_{j}}\hspace{0.1667em}|\hspace{0.1667em}{z_{-(i,j)}}$ or, equivalently, the ${n_{i}}\times {n_{j}}$ block submatrix of the precision ${Q_{ij}}=-{\Gamma _{i}^{-1}}{A_{ij}}=O$, i.e., the $(i,j)$-th block of Q must be an ${n_{i}}\times {n_{j}}$ matrix of zeros. Note that since ${A_{ii}}=O$ in (2.1), the $(i,i)$-th block of Q is ${\Gamma _{i}^{-1}}$.
Given the inter-variable graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ let $\Lambda =({\lambda _{ij}})=D-\rho W$ be the $q\times q$ graph Laplacian, where $W=({w_{ij}})$ is the adjacency matrix with nonzero ${w_{ij}}$ only if there is an edge between i and j, $D=({d_{ii}})$ is a $q\times q$ diagonal matrix with the sum of each row of W along the diagonal, i.e., ${d_{ii}}={\textstyle\sum _{j=1}^{q}}{w_{ij}}$, and ρ is a scalar parameter that ensures positive-definiteness of Λ as long as $\rho \in (1/{\zeta _{min}},{\zeta _{max}})$, where ${\zeta _{min}}$ and ${\zeta _{max}}$ are the minimum and maximum eigenvalues of ${D^{-1/2}}W{D^{-1/2}}$, respectively. Since each node corresponds to the realizations of a spatial process, which is modeled as a latent (unobserved) process, we can assume without much loss of generality that each ${z_{i}}$ is $n\times 1$, i.e. ${n_{i}}=n$ for each $i=1,\dots ,q$.
Let ${R_{i}}$ be the $n\times n$ upper triangular factor in the Cholesky decomposition of the positive definite covariance matrix of ${z_{i}}$, i.e., $\text{var}({z_{i}})={C_{ii}}={R_{i}^{\mathrm{T}}}{R_{i}}$ for each $i=1,2,\dots ,q$. We now set ${\Gamma _{i}^{-1}}={\lambda _{ii}}{R_{i}^{\mathrm{T}}}{R_{i}}$ and ${A_{ij}}=-{\lambda _{ij}}{R_{i}^{-1}}{R_{j}}$ in (2.1). Then ${Q_{ij}}={\lambda _{ij}}{R_{i}^{\mathrm{T}}}{R_{j}}$ and $Q={\tilde{R}^{\mathrm{T}}}(\Lambda \otimes I)\tilde{R}$, where $\tilde{R}={\oplus _{i=1}^{q}}{R_{i}}$. Since each ${R_{i}}$ is nonsingular and Λ is positive definite, it follows that Q is positive definite and $\det (Q)={\textstyle\prod _{i=1}^{q}}{(\det ({R_{i}}))^{2}}{(\det (\Lambda ))^{n}}$. The special case where every variable has the same spatial covariance function so that ${R_{i}}=R$ for $i=1,2,\dots ,q$ yields the separable model $Q=\Lambda \otimes ({R^{\mathrm{T}}}R)$. We accrue computational benefits by modeling the spatial process so that the ${R_{i}}$’s are easily computed for massive data [see, e.g., 14, 4, 29, 39, and references therein]. Other, even more general, structures for Q can also be derived by adapting multivariate Markov random fields [27] to incorporate spatial processes into the nodes of a graph.
The above construction yields proper densities in (2.3) that will conform to conditional independence among the variables represented by the nodes of a posited undirected graph and can also be adapted for analyzing high-dimensional multivariate spatial data, where at least one of or both n and q are large. However, an important drawback is that it will be challenging to characterize the marginal distributions of such a multivariate process parsimoniously in terms of a few parameters. To elucidate, even if ${\Gamma _{i}}$ is chosen to be from an interpretable parametric family of covariances like the Matérn family, the constuction does not ensure that ${z_{i}}$ will follow a Matérn distribution. Similarly, the cross-covariances will not correspond to standard families of valid, cross-covariance functions [24]. Furthermore, the construction of Q described above is essentially finite-dimensional and the notion of conditional independence is restricted to the finite set of locations ${s_{1}},\dots ,{s_{n}}$. It is unclear if this construction can lead to a highly-multivariate spatial process over the entire domain $\mathcal{D}$ that respects process-level conditional independences. To see how this can be achieved, we elucidate the GGP [14] using a paradigm for graphical models fundamentally different from Markov random fields—covariance selection.
3 Graphical Gaussian Processes
Rather than building a Gaussian graphical model whose nodes are finite-dimensional random vectors, GGP builds a graphical model whose nodes are the component Gaussian processes of a multivariate spatial GP and edges represent process-level conditional dependence between the incident nodes given all other nodes. To do so, we first clarify what it means for two spatial process to be conditionally independent. Let the q nodes of $\mathcal{G}$ represent q different spatial processes, $\{{z_{i}}(s):s\in \mathcal{D}\}$ for $i=1,2,\dots ,q$. Two distinct processes ${z_{i}}(\cdot )$ and ${z_{j}}(\cdot )$ are conditionally independent given the remaining $q-2$ processes, which we denote by ${z_{i}}(\cdot )\perp {z_{j}}(\cdot )\hspace{0.1667em}|\hspace{0.1667em}{z_{-(ij)}}(\cdot )$, if the covariance between ${z_{i}}(s)$ and ${z_{j}}({s^{\prime }})$ is zero for any pair of locations $s,{s^{\prime }}\in \mathcal{D}$ conditional on full realizations of all of the remaining processes. A q-variate process $z(s)$ is a GGP with respect to an inter-variable graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ if ${z_{i}}(\cdot )$ and ${z_{j}}(\cdot )$ are conditionally independent given the remaining processes whenever $(i,j)\notin \mathcal{E}$.
Any collection of q independent spatial processes is a GGP with respect to any graph $\mathcal{G}$ but is of limited use as it does not leverage inter-variable dependencies. Sparse graphical models are often used as parsimonious working models serving as a scalable approximation to a richer (or fuller) data generating model. Hence, a more relevant question is given a reasonable inter-variable graph $\mathcal{G}$, possibly inferred from scientific knowledge, how to construct a GGP that respects $\mathcal{G}$ and provides best approximation to a multivariate GP with a given cross-covariance function. Thus, we wish to construct a multivariate spatial process such that it conforms to the posited graphical relationships among the q dependent variables and best approximates the spatial structures of a given $q\times q$ cross-covariance matrix $C(s,{s^{\prime }})=({C_{ij}}(s,{s^{\prime }}))$. [17] proved that such an optimal GGP exists and is unique, and is one that preserves the marginal spatial covariance structure for each variable i as specified by ${C_{ii}}(s,{s^{\prime }})$ as well as cross-covariances ${C_{ij}}(s,{s^{\prime }})$ between any pairs of variables $(i,j)\in \mathcal{E}$.
In this regard, we recall a seminal paper by Dempster [16] on covariance selection. The key result can be described as follows. Let $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ be any undirected graph with q nodes and let $F=({f_{ij}})$ be any positive definite covariance matrix whose elements ${f_{ij}}$ give the covariance between random variables at nodes i and j. Then, the best approximation of F (in terms of the Kullback-Liebler distance) among the class of covariance matrices that satisfy the conditional independence relationships specified by the Gaussian graphical model $\mathcal{G}$ is a unique positive definite matrix $\tilde{F}=({\tilde{f}_{ij}})$ such that:
The first two conditions state that the optimal graphical model preserves all marginal variances and cross-covariances for edges included, while the third condition ensures adherence to the conditional independence relations specified by $\mathcal{G}$. Observe that covariance selection does not explicitly require Markovian assumptions, nor does it require modeling full conditionals such as in (2.1).
Covariance selection, as described above, is applicable to finite-dimensional inference, where, for example, we restrict attention to parameter estimation and random effects at a specified set of n spatial locations. Let $C=(C({s_{i}},{s_{j}}))$ be an $nq\times nq$ spatial covariance matrix, where each $C({s_{i}},{s_{j}})$ is a $q\times q$ submatrix evaluated using a valid cross-covariance function. Given an undirected graph that posits conditional independence relations among the q variables, we can compute the unique $\tilde{C}$ preserving marginal variances and covariances and conforming to the graph using an iterative proportional scaling (IPS) algorithm [45, 51].
Consider the graph in Figure 1 specifying the conditional independence structure among 4 variables. We consider each variable being observed at 10 locations uniformly sampled from a $(0,1)\times (0,1)$ grid. We assume a multivariate Maten covariance structure [2] between the processes and compute the $40\times 40$ cross-covariance matrix C. Now, Algorithm 1 lays out the steps to obtain the unique cross-covariance matrix $\tilde{C}$ corresponding to the GGP that conforms to the graph in Figure 1. The resulting precision matrix is plotted in Figure 2 which has zero entries only in the $(1,3)$ and $(2,4)$-th block, i.e., when there are no edges present between the variable nodes.
Figure 2
$40\times 40$ precision matrix of the GGP after IPS algorithm. Zero-entries are plotted as circles and non-zero entries are plotted as dots.
However, this finite-dimensional application of covariance selection or IPS does not immediately extend to the process-level formulation of conditional dependence in a multivariate GP. We now illustrate how GGP of [17] extends covariance selection to infinite-dimensional framework that will allow predictive inference at arbitrary spatial locations over the entire domain of interest $\mathcal{D}$.
We begin the construction with a given inter-variable graph $\mathcal{G}$, a multivariate cross-covariance function $C(s,{s^{\prime }})$ and with a finite set of locations $\mathcal{L}$. Using covariance selection, we obtain a covariance matrix $\tilde{C}:=\tilde{C}(\mathcal{L})$ such that we can define a multivariate spatial model on $\mathcal{L}$ as
The properties of covariance selection ensure that ${\tilde{w}_{i}}(\mathcal{L})\sim N(\text{0},{C_{ii}}(\mathcal{L}))$, $Cov({\tilde{w}_{i}}(\mathcal{L}),{\tilde{w}_{j}}(\mathcal{L}))={C_{ij}}(\mathcal{L})$ for all $(i.j)\in \mathcal{E}$, and $Cov({\tilde{w}_{i}}(\mathcal{L}),{\tilde{w}_{j}}(\mathcal{L})|{w_{-ij}}(\mathcal{L})={({\tilde{C}^{-1}})_{ij}}=O$ for all $(i,j)\notin \mathcal{E}$. Thus on $\mathcal{L}$, each component of $\tilde{w}$ retains marginal covariances as specified by C, marginal cross-covariances for variable pairs included in the graph are also preserved, and conditional dependencies align with the graph.
To extend this finite-dimensional framework on $\mathcal{L}$ into a process-level graphical model for a multivariate GP on $\mathcal{D}$, we leverage the property that a univariate GP ${w_{i}}(s)$ with covariance function ${C_{ii}}(s)$ can be constructed as the sum of two orthogonal GPs ${w_{i}^{\ast }}(s)$ and ${r_{i}}(s)$. Here ${w_{i}^{\ast }}(s)$ is a finite rank predictive process that can be represented as ${w_{i}^{\ast }}(s)={b_{i}}{(s)^{\mathrm{T}}}{\tilde{w}_{i}}(\mathcal{L})$ where ${\tilde{w}_{i}}(\mathcal{L})$ is some random vector such that ${\tilde{w}_{i}}(\mathcal{L})\stackrel{d}{=}{w_{i}}(\mathcal{L})$ and ${b_{i}}(s)$ are such that the predictive process ${w_{i}^{\ast }}(\cdot )$ agrees with ${w_{i}}(\cdot )$ on $\mathcal{L}$ [3]. The second GP ${r_{i}}(\cdot )$ is often referred to as the residual GP as it explains the residual covariance between realizations of the GP ${w_{i}}(\cdot )$ beyond what can be explained by the predictive process. Choosing ${\tilde{w}_{i}}(\mathcal{L})$ to be the components of $\tilde{w}(\mathcal{L})$ from (3.1) and ${r_{i}}(\cdot )$’s to be independent across the variables i, one creates a multivariate GP $w(s)=B(s)\tilde{w}(\mathcal{L})+r(s)$ where $B(s)=\oplus {b_{i}}(s)$ and $r(s)={({r_{1}}(s),\dots ,{r_{q}}(s))^{\mathrm{T}}}$.
Covariance selection and predictive process harmonize to ensure that the resulting process $w(\cdot )$ exactly or approximately satisfies all the three criteria for being an optimal multivariate process given a covariance function C and a graphical model $\mathcal{G}$. The component processes satisfy the law ${w_{i}}(s)\sim GP(0,{C_{ii}})$ thereby preserving the marginals. Cross-covariances for $(i,j)\in \mathcal{E}$ are also preserved exactly on $\mathcal{L}$ and approximately elsewhere when choosing $\mathcal{L}$ to be sufficiently representative of the domain. Finally, the conditional independence of $\tilde{w}(\mathcal{L})$ with respect to $\mathcal{G}$ induced by covariance selection together with the component-wise independent residual processes ${r_{i}}(\cdot )$ ensure that the resulting multivariate process $w(\cdot )$ conforms to process-level conditional independences as specified by $\mathcal{G}$ and is thus a GGP. This construction is referred to as stitching – the multiple processes can be envisioned as multiple layers of fabric which are connected to each other only at the locations $\mathcal{L}$ via the edges of $\mathcal{G}$, serving as threads.
3.1 Inference from the Graphical Matérn GP
We consider a spatial dataset on q outcomes. Each variable is recorded at a set of locations ${\mathcal{D}_{i}}$ and is associated with a set of covariates ${X_{i}}$. Both the measurement-locations and covariates are allowed to be variable-specific. If a sparse graphical model $\mathcal{G}$ can be justified, empirically or scientifically, among the q variables, then a GGP model for analyzing such data will be specified at the process-level as
Here C denotes the full q-variate cross-covariance function used to derive the covariance of the GGP. Without loss of generality, we use a reference set $\mathcal{L}$ that does not overlap with the data locations. The data likelihood corresponding to this hierarchical model can be formulated as
Here ${B_{i}}$ stacks up the ${b_{i}}{(s)^{\mathrm{T}}}$’s for $s\in {\mathcal{D}_{i}}$ and ${R_{ii}}$ is the covariance matrix for the residual process ${r_{i}}(\cdot )$ over ${\mathcal{D}_{i}}$.
(3.2)
\[\begin{aligned}{}{y_{i}}(s)& \sim {X_{i}}{(s)^{\mathrm{T}}}{\beta _{i}}+{w_{i}}(s)+{\epsilon _{i}}(s)\hspace{2.5pt}\text{for}\hspace{2.5pt}i=1,\dots ,q,\\ {} {\epsilon _{i}}(s)& \stackrel{iid}{\sim }N(0,{\tau _{i}^{2}})\hspace{2.5pt}\text{for}\hspace{2.5pt}i=1,\dots ,q,s\in \mathcal{D}\\ {} w(\cdot )& ={({w_{1}}(\cdot ),\dots ,{w_{q}}(\cdot ))^{\mathrm{T}}}\sim GGP(0,C,\mathcal{G})\end{aligned}\](3.3)
\[ \begin{aligned}{}{\prod \limits_{i=1}^{q}}& \prod \limits_{s\in {\mathcal{D}_{i}}}N({y_{i}}(s)\hspace{0.1667em}|\hspace{0.1667em}{X_{i}}{(s)^{\mathrm{T}}}{\beta _{i}},{\tau _{i}^{2}}I)\\ {} & \hspace{2em}\times {\prod \limits_{i=1}^{q}}N(w({\mathcal{S}_{i}})\hspace{0.1667em}|\hspace{0.1667em}{B_{i}}{w_{i}}(\mathcal{L}),{R_{ii}})\times N(w(\mathcal{L})|0,\tilde{C})\hspace{0.2778em}.\end{aligned}\]If C is chosen to be from the popular multivariate Matérn class, then the resulting cross-covariance for the GGP is denoted by the graphical Matérn and will retain the desirable properties of the multivariate Matérn family. In particular, each component process ${w_{i}}(\cdot )$ is exactly a Matérn GP with own set of parameters that allow interpretions about the spatial variance, smoothness and decay of the process. The cross-covariances for variable pairs included in $\mathcal{G}$ are also exactly or approximately multivariate Matérn, while unlike the multivariate Matérn the graphical Matérn allows process-level conditional independencies.
For a general graph, the data likelihood for GGP (3.3) involves manipulation of the $O(q)\times O(q)$ matrix $\tilde{C}$ (suppressing the dependence on the number of locations n, which, for this article, is assumed to be small or moderate). This matrix is derived from the full covariance matrix $C(\mathcal{L},\mathcal{L})$ using the IPS algorithm. However, in its full generality, the IPS will require $O({q^{2}})$ floating point operations or FLOPS [51]. Additionally, the parent covariance C needs to represent a valid cross-covariance class, which for the multivariate Matérn family, involves $O({q^{2}})$ parameters with constraints that require $O({q^{3}})$ FLOPS for assessment. Hence, for highly multivariate setting GGP with a general graph can remain computationally prohibitive both in terms of memory and time demands.
The computational advantages of GGP with a sparse graphical model is maximized when considering decomposable (or chordal) graphs. Such graphs are popular in Bayesian graphical models because of the computational conveniences and are often justifiably used to replace non-chordal graphs as any non-chordal graph can be embedded in a chordal one. Decomposable graphs can be represented in terms of a set of cliques K and a set of separators S such that the likelihood for the GGP on $\mathcal{L}$ can be decomposed as
where for any $A\subset \{1,\dots ,q\}$, ${w_{A}}(\mathcal{L})$ denotes the subset of $w(\mathcal{L})$ corresponding to the indices in A. Note that this obviates the necessity to explicitly obtain the $O(q)$-dimensional matrix $\tilde{C}$ via the IPS algorithm as the right hand side can be written in terms of the multivariate densities based on the parent covariance C. The maximum dimension of any matrix involved is $O({c^{3}})$ where c denotes the largest clique-size and there would be atmost $O(q)$ such matrices. Also the resulting likelihood only depends on the cross-covariances ${C_{ij}}$ for either $i=j$ or $(i,j)\in \mathcal{E}$. Thus the full parent multivariate Matérn GP with $O({q^{2}})$ parameters need not be specified, only $O(q+|\mathcal{E}|)$ parameters are required resulting in significant dimension reduction for sparse graphs. The GGP likelihood with Matérn covariance and a sparse graph $\mathcal{G}$ offers drastic reduction in both memory, time and parameter-dimensionality while retaining the benefits of spatial modeling using the Matérn family.
(3.4)
\[ N(w(\mathcal{L})|0,\tilde{C})=\frac{{\textstyle\prod _{A\in K}}N({w_{A}}(\mathcal{L})|0,C)}{{\textstyle\prod _{A\in S}}N({w_{A}}(\mathcal{L})|0,C)}\]3.2 Implementation
We elucidate, with examples, the construction of a GGP given, in particular, a multivariate Matérn cross-covariance function and an undirected decomposable graph.
3.2.1 Setup
Here, we consider $q=10$ variables with each variable being observed at 250 locations uniformly chosen from the $(0,1)\times (0,1)$ grid. The latent spatial random process $w(s)$ in (3.2) is taken as the $10\times 1$ multivariate graphical Matérn GP [17] with respect to a decomposable variable graph $\mathcal{G}$ (3). The marginal scale, variance, and smoothness parameters for each component Matérn process, ${w_{i}}(\cdot )$, are denoted by ${\phi _{ii}}$, ${\sigma _{ii}}$ and ${\nu _{ii}}$, respectively, for each $i=1,2,\dots ,q$.
To ensure a valid multivariate Matérn cross-covariance function, a sufficient condition is to limit the intra-site covariance matrix $\Sigma =({\sigma _{ij}})$ to be [see Theorem 1 of 2]
where ${\Delta _{A}}\ge 0$ and $B=({b_{ij}})\gt 0$, i.e., is positive definite. This implies $\Sigma =(B\odot ({\gamma _{ij}}))$, where ${\gamma _{ij}}$’s are constants collecting the components in (3.5) involving just ${\nu _{ij}}$’s and ${\phi _{ij}}$’s, and ⊙ indicates the Hadamard (element-wise) product. For simplicity, we will assume ${\nu _{ii}}={\nu _{jj}}={\nu _{ij}}=0.5$, ${\phi _{ij}^{2}}=\frac{{\phi _{ii}^{2}}+{\phi _{jj}^{2}}}{2}$. The constraints in (3.5) simplifies to
where $R=({r_{ij}})$ is positive definite. Hence, as part of our simulation exercise, we only need to estimate the marginal scale (${\phi _{ii}}$) and variance parameters (${\sigma _{ii}}$) and cross-correlation parameters ${r_{ij}}$.
(3.5)
\[ \begin{array}{c@{\hskip10.0pt}c}& {\sigma _{ij}}={b_{ij}}\frac{\Gamma (\frac{1}{2}({\nu _{ii}}+{\nu _{jj}}+d))\Gamma ({\nu _{ij}})}{{\phi _{ij}^{2{\Delta _{A}}+{\nu _{ii}}+{\nu _{jj}}}}\Gamma ({\nu _{ij}}+\frac{d}{2})}\hspace{0.2778em},\end{array}\](3.6)
\[ \begin{array}{c@{\hskip10.0pt}c}{\sigma _{ij}}=& {({\sigma _{ii}}{\sigma _{jj}})^{0.5}}\ast \frac{{\phi _{ii}^{0.5}}{\phi _{jj}^{0.5}}}{{\phi _{ij}^{0.5}}}{r_{ij}}\end{array}\]Observe that the graph in Figure 3 posits full conditional dependencies among the 10 spatially indexed variables. We now calculate the perfect ordering of the cliques of $\mathcal{G}$ to obtain the cliques and separators (Figure 4). Now using (3.4), the likelihood of the decomposable GGP can be written as
(3.7)
\[ \begin{aligned}{}& N(w(\mathcal{L})|0,\tilde{C})\\ {} & =\frac{N({w_{(1,2,3)}}(\mathcal{L})|0,C)N({w_{(2,3,4)}}(\mathcal{L})|0,C)}{N({w_{(6)}}(\mathcal{L})|0,C)N({w_{(8)}}(\mathcal{L})|0,C)}\times \\ {} & \frac{N({w_{(4,5,6)}}(\mathcal{L})|0,C)N({w_{(6,7,8)}}(\mathcal{L})|0,C)N({w_{(8,9,10)}}(\mathcal{L})|0,C)}{N({w_{(6)}}(\mathcal{L})|0,C)N({w_{(8)}}(\mathcal{L})|0,C)}\end{aligned}\]3.2.2 Data Generation
We choose the 10 length vector of marginal variance (${\sigma _{ii}}$) and scale (${\phi _{ii}}$) parameters as two different permutations of the sequence $(1,1.444,1.889,2.333,2.777,3.222,6.667,4.111,4.556,5)$. The cross-correlation matrix R in (3.6) is generated as the standardized random matrix ${{R_{0}}{R_{0}}^{\mathrm{T}}}$, where ${R_{0}}$ has independent entries from $Uniform(-1,1)$ distribution.
The precision matrix of the GGP $w(\mathcal{L})$ is calculated as [see Lemma 5.5 of 31]
where, for any symmetric matrix $A=({a_{ij}})$ with rows and columns indexed by $\mathcal{U}\subset \mathcal{V}\times \mathcal{L}$, ${A^{\mathcal{V}\times \mathcal{L}}}$ is defined as a $|\mathcal{V}\times \mathcal{L}|\times |\mathcal{V}\times \mathcal{L}|$ matrix so that ${({A^{\mathcal{V}\times \mathcal{L}}})_{ij}}={a_{ij}}$ if $(i,j)\in \mathcal{U}$, and ${({A^{\mathcal{V}\times \mathcal{L}}})_{ij}}=0$ otherwise. Equations (3.7) and (3.8) show that inverting the full GGP cross-covariance matrix only requires inverting the clique and separator specific covariance matrices. Hence, the computational complexity for calculating the likelihood of a multivariate GGP boils down to $O({n^{3}}{c^{3}})$, for e.g. $O({250^{3}}\ast {3^{3}})$ in our example, where the maximum size of a clique in Figure 4 is 3. On the contrary, the likelihood of a full multivariate Matern GP would need $O({n^{3}}{q^{3}})$ complexity, i.e. $O({250^{3}}\ast 1000)$.
(3.8)
\[ \begin{aligned}{}\tilde{C}{(\mathcal{L},\mathcal{L})^{-1}}& ={\sum \limits_{m=1}^{p}}{[{{C_{[{K_{m}}]}}(\mathcal{L},\mathcal{L})^{-1}}]^{\mathcal{V}\times \mathcal{L}}}\\ {} & \hspace{2em}-{\sum \limits_{m=2}^{p}}{[{{C_{[{S_{m}}]}}(\mathcal{L},\mathcal{L})^{-1}}]^{\mathcal{V}\times \mathcal{L}}}\hspace{0.2778em},\end{aligned}\]We use the cross-covariance in (3.8) to simulate the latent process $w(\cdot )$ as a $2500(250\ast 10)$-variate normal observation. Next, we use (3.2) to simulate our multivariate outcomes ${y_{i}}(.),i=1,\dots ,10$. In order to do that, we generate some random covariates ${X_{i}}(\cdot )$ from $N(0,2)$ distribution and simulate the error process ${\epsilon _{i}}(.)$ independently from $N(0,{\tau _{i}^{2}})$. We take ${\tau _{ii}^{2}}={\tau ^{2}}=0.25$ for all i.
For analyzing predictive efficiency of our method, we randomly pick $20\% $ of the locations for each outcome variable and consider them missing. We treat the full set of 250 locations as our reference set $\mathcal{L}$ and predict the test observations at every step of our sampler to compare with predictions later.
3.2.3 Data Analysis
The analysis of our simulated GGP data can be broken down in the following steps
-
1. Marginal parameter estimation: We estimate the marginal scale (${\phi _{ii}}$), variance (${\sigma _{ii}^{2}}$) and smoothness parameters (${\nu _{ii}}$) from the component Gaussian processes. We also estimate the error variance (${\tau _{i}^{2}}$) for each marginal processes. (Details in Algorithm 2)
-
2. Gibbs sampler initialization: For this step, we process the variable graph to calculate cliques and separators. Moreover, we color the nodes of the variable graph. This allows us to simulate the latent processes belonging to the same color in parallel in the Gibbs sampler.We also construct a new edge-graph ${\mathcal{G}_{E}}(\mathcal{G})=({E_{\mathcal{V}}},{E^{\ast }})$ on the set of edges ${E_{\mathcal{V}}}$, i.e., there is an edge $((i,j),({i^{\prime }},{j^{\prime }}))$ in this new graph ${\mathcal{G}_{E}}(\mathcal{G})$ if $\{i,{i^{\prime }},j,{j^{\prime }}\}$ are in some clique K of $\mathcal{G}$. This edge-graph enables us to facilitate parallel updates of cross-correlation parameters corresponding to edges of the same color. These are specific examples of chromatic Gibbs samplers and significantly improve the speed of our sampling on parallel architectures. We also initialize our cross-correlation parameters at this step to start off the Gibbs sampler. The details are laid out in Algorithm 2.
-
3. Gibbs sampler: We run our Gibbs sampler to sample latent spatial processes, predict test observations and sample latent correlations. The detailed steps are laid out in Algorithm 3.Sampling cross-correlation parameters requires only checking positive-definiteness of the clique-specific cross-correlation parameter matrix ($3\times 3$ here at maximum). For likelihood calculation, the largest matrix inversion across all these updates is of the order $750\times 750$, corresponding to the largest clique. The largest matrix that needs storing is also of dimension $750\times 750$. These result in appreciable reduction of computations from any multivariate Matérn model that involves $2500\times 2500$ matrices and positive-definiteness checks for $10\times 10$ matrices at every iteration.
We run the sampler in Algorithm 3 for 1000 iterations and obtain the cross-correlation parameter estimates and test set predictions after a burn-in of 150 samples. Figure 7 shows that we have accurately estimated the marginal micro-ergodic parameters (${\sigma _{ii}}\times {\phi _{ii}}$). In Figure 8, we plot the edge-specific estimates of cross-correlation parameter ${r_{ij}}$. Here, we observe that GGP accurately estimates the cross-correlation parameters for the edges in the graph $\mathcal{G}$ ($95\% $ credible interval of all but one estimate contain true values). We also create a grid of plots across 10 variables comparing the true test data values and predicted values from our algorithm. This is presented in Figure 9. The points fairly align on $y=x$ line and mean-square error varied from 0.403 to 1.064. Figures 10 and 11 depict two instances of the convergence of cross-correlation parameter chains.
Figure 8
Estimates of the cross-correlation parameters ${r_{ij}}$ with errorbars denoting $95\% $ credible interval, red line denotes $y=x$ line.
4 Environmental Data Example
We demonstrate the practical use of GGP for modeling non-stationary spatial time-series data using daily PM2.5 measurements from monitoring stations in 11 northeastern states of the contiguous US, including Washington DC, for February 2020. The data are publicly available from the United States Environmental Protection Agency (EPA) website. Our current analysis focuses on $n=99$ monitoring stations that recorded at least 20 days of data in both 2019 and 2020. From the National Center for Environmental Prediction’s (NCEP) North American Regional Reanalysis (NARR) database, we obtained daily values of meteorological factors (temperature, barometric pressure, wind-speed, and relative humidity) that are posited to influence PM2.5 levels. We used multilevel B-spline smoothing using the MBA package in R to impute daily values of these variables and merge them with the EPA data. To account for baseline levels and weekly periodicity, we included a 7-day moving average of the PM2.5 data for each station and day in 2020, centered around the same day of 2019, and subtracted the day-of-the-week specific means from the raw PM2.5 values. We then analyzed the resulting highly multivariate (28-dimensional) spatial data set over $n=99$ locations.
After conducting exploratory analysis that revealed autocorrelations among pollutant processes on consecutive days (for both lag 1 and 2), even after adjusting for covariates, we employed a graphical Matérn GP with an $AR(2)$ graph (Figure 12). The marginal parameters for day t are ${\sigma _{tt}}$, ${\phi _{tt}}$, and ${\tau _{t}^{2}}$, the autoregressive cross-covariances are denoted by ${b_{t-1,t}}$ and ${b_{t-2,t}}$ between day t and days $t-1$ and $t-2$, respectively. GGP enables us to model non-separability in auto-covariances across both space and time, as well as time-varying marginal spatial parameters and autoregressive coefficients.
The GGP requires only 53 cross-covariance parameters (27 parameters for lag 1 and 26 parameters for lag 2). As the largest clique size in an AR(2) graph is 3, the largest matrix one must deal with for the data at 99 stations is only $297\times 297$. We present the estimates and credible intervals for lag 1 and lag 2 auto-correlation parameters ${r_{t-1,t}}$ and ${r_{t-2,t}}$ (normalized ${b_{t-1,t}}$ and ${b_{t-2,t}}$) obtained from GGP in Figures 13 and 14, respectively. These estimates display substantial variation across time, with the many spikes indicating high positive autocorrelation. Specifically, $95\% $ Bayesian credible intervals for 6 out of the 27 ($22\% $) of the lag 1 estimates and 4 out of the 26 ($15\% $) of the lag 2 estimates exclude 0, providing strong evidence in support of non-stationary autocorrelation across time. The presence of significant lag 2 autocorrelations justifies our choice of $AR(2)$ graph. The Graphical Matérn GP also yields impressive predictive performance ($93\% $ coverage) on hold-out data (Figure 15).
5 Discussion
This article has developed and elucidated, with examples, the construction of multivariate Gaussian processes with associations among variables modeled by a valid spatial cross-covariance function while conditional independence between variables conforming to a posited undirected graph. The sparsity of the posited graph accrues substantial computational gains and makes the proposed approach especially attractive for datasets with very large numbers of spatially dependent outcomes. The algorithms implemented here have been especially designed to exploit the structure of decomposable graphs. Our examples demonstrate the algorithmic efficiency of chromatic Gibbs samplers used to update the latent process and the cross-covariance parameters and also the inferential efficiency, in terms of estimation and prediction, of the graphical Gaussian process model. Future avenues for research will include incorporating scalable spatial processes, such as the Nearest-Neighbor Gaussian Process [15, 13] for spatial and spatial-temporal processes and extensive comparisons with competing methods for multivariate outcomes, such as factor models, from computational and inferential standpoints.