1 Introduction
The statistical literature has regularly analyzed censored data since the late 1900s, with the earliest instance of a statistical analysis of censored data being in 1766 [32]. Measurements are often censored due to limitations of measuring instruments, physical inability to acquire data, human error, or similar.
While most analyses on censored data focus on right-censoring [20], some applications like analyzing the concentration of contaminants such as arsenic or per-and polyfluoroalkyl substances (PFAS) in groundwater call for utilizing methods involving left-censored data. This kind of censoring is prevalent in environmental monitoring and has applications in environmental and public health, epidemiology, hydrology, agriculture, and more. Typically, these applications involve geostatistical data, where the measurements are censored because they fall below the minimum detection limit (MDL) of the measuring instrument. Early applications either remove censored observations, replace them with makeshift values such as MDL or MDL/2, or impute them with the mean or median of observed responses. Such ad-hoc imputations can result in biased estimates of the overall spatial variability, as demonstrated by [13].
Recent approaches for statistical inference of spatially distributed censored data overwhelmingly considered the Expectation-Maximization (EM) algorithm [25]. [28] proposed an exact maximum likelihood (ML) estimation of model parameters under censoring, called ‘CensSpatial’, using the Stochastic Approximation of the Expectation Maximization [SAEM; 11] algorithm. To tackle the computational complexities arising from censored likelihoods in correlated data, Monte Carlo approximations have been employed, both within the classical framework [40, 30], and the Bayesian paradigm [10, 41, 34]. For example, [36] introduced a semi-naive approach that utilizes an iterative algorithm and variogram estimation to determine imputed values at locations where data are censored. Finally, various data augmentation techniques have been proposed to facilitate analysis of spatially correlated censored data [1, 19, 13, 38]. However, the scalability of the suggested approaches is restricted, rendering them unsuitable for analyzing large spatial datasets featuring censoring, a common occurrence in contemporary scientific research.
Gaussian processes [GPs; 37] are heavily used for modeling continuous spatial data due to their several theoretical and computational advantages: the likelihood involves only the first two moments, conditional independence and zeros in the underlying precision matrix are equivalent, and various linear algebraic results are well-known in the literature that are required for computing covariance matrices [14]. However, once the number of spatial sites is large and data at a large proportion of sites are censored, likelihoods based on the underlying GPs involve an intractable high-dimensional integral of a multivariate Gaussian density. This paper aims to overcome the computational challenges inherent to censored likelihoods for high-dimensional spatial settings through a combined application of two key steps:
-
1. We focus on a fully Bayesian method for censored point referenced data, where the underlying GP is approximated as a Matérn-like Gaussian Markov random field [GMRF, 33]. The GMRF is obtained as the solution of a stochastic partial differential equation [SPDE, 24] on a fine mesh, which yields a sparse precision matrix of the underlying basis function coefficients. This sparse spatial structure then allows for fast and scalable Bayesian computations.
We draw inferences regarding model parameters using an adaptive Markov Chain Monte Carlo (MCMC) sampling approach, where we use random walk Metropolis-Hastings (MH) steps within Gibbs sampling. Extensive simulations demonstrate the scalability and performance of the proposed methodology in comparison to the ‘CensSpatial’ algorithm and a 2-dimensional B-splines basis function model across varying degrees of censoring and varying grid sizes. While traditional local likelihood methods [44, 35], when applied with Vecchia’s approximation [42], are often considered ideal for handling high-dimensional spatial datasets, their implementation becomes computationally challenging in the presence of censoring. Specifically, they also require the evaluation of high-dimensional integrals, which renders these methods infeasible for large spatial datasets with censoring. The idea of a GMRF-based measurement error model has been explored in the context of spatial extremes [9, 16], where replications of the underlying spatial processes are available and censoring a portion of the data is artificial. However, per our knowledge this modeling strategy has not been explored yet for high-dimensional censored spatial data without replications. Although the lack of temporal replications typically leads to unstable computations, the proposed stable and scalable computational framework is tailored explicitly for handling censored spatial data without requiring temporal replications. Furthermore, unlike previous studies involving the GMRF-based measurement error model, a novel feature of the proposed approach is the inclusion of spatial predictions.
PFAS constitute a substantial group of synthetic compounds absent in natural environments, notable for their resistance to heat, water, and oil. PFAS are persistent in the environment, can accumulate within the human body over time, and are toxic at relatively low concentrations [43]. Exposure to elevated levels of PFAS can lead to various adverse health outcomes, including developmental issues during pregnancy, cancer, liver impairment, immune system dysfunction, thyroid disruption, and alterations in cholesterol levels [12]. Due to their chemical robustness, PFAS endure in the environment and are resistant to degradation. Contamination of drinking water with PFAS occurs through the use or accidental spillage of products containing these substances onto land or into waterbodies [18]. PFAS present a significant public health risk, with elevated concentrations identified at 3,186 locations across the United States as of August 2023. In response, the U.S. Environmental Protection Agency (EPA) introduced new safety standards on April 10, 2024, setting permissible limits between 4.0 and 10.0 parts per trillion (ppt; also expressed as nanograms per litre (ng/L)) for six specific PFAS chemicals (including PFOS) in drinking water [31]. A recent study [2] estimated that PFAS in publicly accessible drinking water could affect as many as 200 million people across the United States. Along similar lines, a robust Bayesian hierarchical approach was proposed [39] to accommodate left-censored PFAS responses; however, the model was implemented on a limited number of sample site locations. As such, the review of existing literature highlights the necessity for further investigation into PFAS occurrences in groundwater, alongside the development of fast and efficient approaches to handle large-scale left-censored spatial data in real-time. Motivated by data on PFAS concentrations collected by the Groundwater Ambient Monitoring and Assessment (GAMA) program [26] across the state of California, we develop our Bayesian scalable model for spatially-referenced left-censored PFAS responses in an attempt to provide a more accurate quantification of the groundwater contamination within the state. These data, collected by GAMA since 2019, allow thorough quality assessments of water sources and establish safety thresholds for select PFAS constituents. Thus, our analysis can identify possible hotspots of higher PFAS concentration, providing insights for further study of impacts on public health.
The subsequent sections of the paper are organized as follows. In Section 2, we provide details regarding the dataset on the groundwater levels of PFAS within California, along with some exploratory analyses. We outline our methodology and related computational details in Section 3 and test its scalability and predictive performance on simulated datasets in Section 4. In Section 5, we apply the proposed methodology to the PFAS dataset and report the findings. We conclude with a brief discussion in Section 6.
2 Motivating PFAS data
The groundwater PFAS data for the state of California are available online at the website GAMA Groundwater Information Systems under the label Statewide PFOS Data. In this paper, we focus specifically on the measurements of the chemical substance known as Perfluorooctane sulfonate (PFOS). The dataset contains 24,959 measurements (in ng/L) of PFOS concentration and their locations (in longitudes and latitudes), as well as indicators of whether the observations are censored and the corresponding censoring limits within California. Almost half of the measurements (46.62%) are censored observations, with varying degrees of censoring limits.
Figure 1 shows transformed PFOS concentration measurements after transforming the raw PFOS by $g(\text{PFOS})=\log (1+\log (1+\text{PFOS}))$ at the 24,959 irregularly sampled spatial locations across the state of California, prompting an approximate spatial inference model to be employed, which also accounts for the considerable proportion of censored observations. Most observation sites are located in densely populated areas on the coast. The censored observations are presented as tiny black dots in Fig 1. The censored observations are distributed across the entire spatial domain, rather than being concentrated in a specific area. While the majority of measurements are below 150 ng/L, some values reach as high as 1,330,000 ng/L, and approximately 47% of the observed concentrations exceed the new safety limit established by the EPA.
Figure 1
Concentrations of (transformed) PFOS, using the transformation $g(\text{PFOS})=\log (1+\log (1+\text{PFOS}))$, measured at 24,959 irregularly-sampled spatial locations across the state of California (in ng/L). The tiny black dots indicate the sites with censored data.
The histogram of the raw non-censored PFOS observations is presented in the left panel of Figure 2. The raw data exhibit a highly positively skewed nature. Thus, a stationary Gaussian process assumption naturally becomes questionable, even after considering a spatially-smooth mean surface with covariates like longitude and latitude, commonly used for estimating spatial trends [15, 34]. Following an exploration of different transformations of the raw data such that the histogram behaves in an approximately bell-shaped fashion, we identify that the iterated log-transformation $g(\text{PFOS})=\log (1+\log (1+\text{PFOS}))$ performs reasonably well; the histogram of the transformed PFOS data is presented in the middle panel of Figure 2. We further explore the effects of the natural covariates longitude and latitude on the transformed data; following a simple linear regression, we obtain the residuals, and their histogram is presented in the right panel of Figure 2. This histogram is reasonably bell-shaped, and thus, we model this transformed PFOS data using a Gaussian process framework with a regression structure for the mean process where we allow longitude and latitude as covariates.
Figure 2
Pictoral representation of the raw PFOS responses. Left panel: Histogram of raw concentrations of PFOS across sites where the data are not censored. Middle: Histogram of non-censored PFOS concentrations after the transformation $g(\text{PFOS})=\log (1+\log (1+\text{PFOS}))$. Right panel: Histogram of the residuals obtained after regressing non-censored transformed PFOS observations to longitude and latitude via a linear model.
We further explore spatial correlation using a variogram analysis of the residuals (scaled by their sample standard deviation) discussed above. The sample semivariogram at distance d is defined as
where d is the Euclidean distance between locations ${\boldsymbol{s}_{i}}$ and ${\boldsymbol{s}_{j}},\phi \gt 0$ is the range parameter, $\gamma \in [0,1]$ is the ratio of spatial to total variation, ${\kappa _{1}}(\cdot )$ is the modified Bessel function of second kind with degree 1, and $1(\cdot )$ is the indicator function. The fitted population semivariance indicates a reasonable fit to the sample semivariogram. While these exploratory analyses are based on non-censored observations only, they indicate a need for proper spatial modeling after considering the censored nature of a large proportion of the data. Specifically, most of the observations near the eastern regions of the study domain are censored, and ignoring them in the spatial prediction would lead to poor spatial prediction for the nearby regions.
\[ \widehat{\gamma }(d)=\frac{1}{2N(d)}{\sum \limits_{i=1}^{n}}{\sum \limits_{j=1}^{i}}{w_{ij}}(d){(R({\boldsymbol{s}_{i}})-R({\boldsymbol{s}_{j}}))^{2}},\]
where, $R({\boldsymbol{s}_{i}})$ and $R({\boldsymbol{s}_{j}})$ are the residuals at spatial sites ${\boldsymbol{s}_{i}}$ and ${\boldsymbol{s}_{j}}$, ${w_{ij}}(d)=1$ if ${d_{ij}}\in (d-h,d+h)$ and ${w_{ij}}=0$ otherwise, ${d_{ij}}$ being the distance between ${\boldsymbol{s}_{i}}$ and ${\boldsymbol{s}_{j}}$. Also, $N(d)$ is the number of pairs with ${w_{ij}}(d)=1$. The sample semivariogram, presented in Figure 3, indicates the presence of spatial correlation and possible nugget effects [4]. We fit an isotropic Matérn spatial correlation function, with its smoothness parameter set to one, plus a nugget effect, given by
(2.1)
\[ \rho ({\boldsymbol{s}_{i}},{\boldsymbol{s}_{j}})\equiv \rho (d)=\gamma \frac{d}{\phi }{\kappa _{1}}\left(\frac{d}{\phi }\right)+(1-\gamma )1({\boldsymbol{s}_{i}}={\boldsymbol{s}_{j}}),\]Figure 3
Sample semivariogram of residuals obtained after regressing non-censored transformed PFOS observations to longitude and latitude as a function of distance (dots). The overlapped solid line represents the fitted population semivariance obtained from (2.1).
3 Methodology
Let $Y(\boldsymbol{s})$ represent transformed PFOS concentration at a spatial location $\boldsymbol{s}\in \mathcal{D}\subset {\mathbb{R}^{2}}$, where $\mathcal{D}$ represents the spatial domain of interest, i.e., the entire state of California in our case. We model $Y(\boldsymbol{s})$ as
\[ Y(\boldsymbol{s})=\boldsymbol{X}{(\boldsymbol{s})^{T}}\boldsymbol{\beta }+{\tau ^{-1/2}}Z(\boldsymbol{s}),\]
where, $X(\boldsymbol{s})={[{X_{1}}(\boldsymbol{s}),\dots ,{X_{p}}(\boldsymbol{s})]^{T}}$ denotes the vector of p covariates at location $\boldsymbol{s}$, $\boldsymbol{\beta }={[{\beta _{1}},\dots ,{\beta _{p}}]^{T}}$ is a vector of unknown regression parameters and $\tau \gt 0$ is a spatially-constant precision parameter. Given the absence of meaningful covariates in our data, we choose $\boldsymbol{X}(\boldsymbol{s})={[1,\text{longitude}(\boldsymbol{s}),\text{latitude}(\boldsymbol{s})]^{T}}$ for our analysis. We assume that $Z(\cdot )$ is a standard (mean zero and variance one at each site) GP with an isotropic Matérn spatial correlation plus a nugget effect, given by (2.1). While the incorporation of the nugget effect is justified by our exploratory analysis (nonzero semivariance at origin), it effectively addresses the issue of censoring in the response, thereby circumnavigating the computational burden occurring due to censored likelihoods [17, 45, 46]. Here, we fix the smoothness parameter of the Matérn correlation of the purely spatial component of (2.1) to one. In practice, it is difficult to estimate the smoothness parameter from the data; hence, it is generally fixed. Besides, we later build a stochastic partial differential equation-based approximation of the Matérn correlation structure, where fixing the smoothness parameter to one is a standard choice [9, 16].Suppose the data are observed (either censored or non-censored) at the set of sites $\mathcal{S}=\{{\boldsymbol{s}_{1}},\dots ,{\boldsymbol{s}_{n}}\}$. In matrix notations, the spatial linear model can be written as
where ${\boldsymbol{Y}_{(n\times 1)}}$ is the response vector, ${X_{(n\times p)}}$ is the matrix of covariates, ${\boldsymbol{\beta }_{(p\times 1)}}$ is the vector of regression coefficients, and ${\boldsymbol{Z}_{(n\times 1)}}\sim \text{MVN}(\mathbf{0},\gamma \boldsymbol{\Sigma }+(1-\gamma ){\boldsymbol{I}_{n}})$, where Σ is the Matérn correlation matrix, and ${\boldsymbol{I}_{n}}$ denotes the identity matrix of order n. By construction and the PFAS dataset, Σ is non-singular, and $\boldsymbol{X}$ has full rank.
In a spatial censored linear (SCL) model, it is further assumed that $Y(\boldsymbol{s})$ is not fully observed at all spatial locations. Motivated by the dataset considered, we assume $Y(\cdot )$ to be left-censored at sites ${\mathcal{S}^{(c)}}=\{{\boldsymbol{s}_{1}^{(c)}},\dots ,{\boldsymbol{s}_{{n_{c}}}^{(c)}}\}\subset \mathcal{S}$ and the corresponding censoring levels be $\mathcal{U}=\{{u_{1}},\dots ,{u_{{n_{c}}}}\}$. However, a similar approach can be applied if the response is right or interval-censored. We define the censoring indicator $\delta (\boldsymbol{s})$ as
where the integral is over the censored responses $\{\boldsymbol{y}:y({\boldsymbol{s}_{i}})\le {u_{i}}\hspace{2.5pt}\text{if}\hspace{2.5pt}{\boldsymbol{s}_{i}}\in {\mathcal{S}^{(c)}}\}$ and ${f_{\text{MVN}}}(\cdot ;\boldsymbol{\mu },\boldsymbol{\Sigma })$ denotes a multivariate normal density with mean $\boldsymbol{\mu }$ and covariance matrix Σ. A version of this likelihood has been studied in [34].
\[ \delta (\boldsymbol{s})=\left\{\begin{array}{l@{\hskip10.0pt}l}1,\hspace{1em}& \text{if}\hspace{2.5pt}Y(\boldsymbol{s})\hspace{2.5pt}\text{is censored at site}\hspace{2.5pt}\boldsymbol{s},\\ {} 0,\hspace{1em}& \text{otherwise},\end{array}\right.\]
and the vector of censored observations as $\boldsymbol{v}={[Y({\boldsymbol{s}_{i}}):\delta ({\boldsymbol{s}_{i}})=1]^{T}}\equiv {[Y({\boldsymbol{s}_{1}^{(c)}}),\dots ,Y({\boldsymbol{s}_{{n_{c}}}^{(c)}})]^{T}}$. Then, for censored spatial data, the likelihood is given by
(3.2)
\[ \mathcal{L}(\boldsymbol{\theta })={\int _{\boldsymbol{v}\le \boldsymbol{u}}}{f_{\text{MVN}}}(\boldsymbol{y};\boldsymbol{X}\boldsymbol{\beta },{\tau ^{-1}}[\gamma \boldsymbol{\Sigma }+(1-\gamma ){\boldsymbol{I}_{n}}])\hspace{1.42271pt}d\boldsymbol{v},\]We can rewrite the model (3.1) as $\boldsymbol{Y}=\boldsymbol{X}\boldsymbol{\beta }+{\tau ^{-1/2}}\boldsymbol{W}+{\tau ^{-1/2}}\boldsymbol{\varepsilon }$, where $\boldsymbol{W}\sim \text{MVN}(\mathbf{0},\gamma \boldsymbol{\Sigma })$ and $\boldsymbol{\varepsilon }\sim \text{MVN}(\mathbf{0},(1-\gamma ){\boldsymbol{I}_{n}})$, i.e., the components of $\boldsymbol{\varepsilon }$ are independently and identically distributed as $\text{N}(0,(1-\gamma ))$. Hence, given $\boldsymbol{W}$, the components of $\boldsymbol{Y}$ are independent and follow univariate normal distributions. Thus, (3.2) simplifies to
where $\Phi (\cdot )$ and $\phi (\cdot )$ are the standard normal distribution and density functions, respectively. Exploiting the conditional independence structure and the univariate normal distribution structure, the censored components can be easily imputed using sampling from truncated univariate normal distributions, bypassing computationally expensive multivariate imputations. The nugget parameter plays a critical role in this framework by allowing a hierarchical representation of the proposed model, where the data layer becomes conditionally independent across spatial locations. When $\gamma =1$, the model loses this hierarchical representation, as the data, conditioned on the latent process $W(\cdot )$, are no longer independent across the space. Hence, the simplification of (3.2) using (3.3) is not possible. In that case, the multivariate imputation cannot be replaced with univariate imputations. For real datasets where γ is unknown, restricting the parameter space to $\gamma \in [0,1)$ leads to equivalent Bayesian inferences in case of a continuous prior; hence, we assume this restriction throughout the rest of the paper.
(3.3)
\[\begin{array}{r}\displaystyle \mathcal{L}(\boldsymbol{\theta })=\int \prod \limits_{i:{\boldsymbol{s}_{i}}\notin {\mathcal{S}^{(c)}}}\frac{1}{{\tau ^{-1/2}}{(1-\gamma )^{1/2}}}\phi \left(\frac{{y_{i}}-{\boldsymbol{x}_{i}^{\mathsf{T}}}\boldsymbol{\beta }-{w_{i}}}{{\tau ^{-1/2}}{(1-\gamma )^{1/2}}}\right)\\ {} \displaystyle \times \prod \limits_{i:{\boldsymbol{s}_{i}}\in {\mathcal{S}^{(c)}}}\Phi \left(\frac{{u_{i}}-{\boldsymbol{x}_{i}^{\mathsf{T}}}\boldsymbol{\beta }-{w_{i}}}{{\tau ^{-1/2}}{(1-\gamma )^{1/2}}}\right){f_{\text{MVN}}}(\boldsymbol{w};\mathbf{0},{\tau ^{-1}}\gamma \boldsymbol{\Sigma })d\boldsymbol{w}\end{array}\]3.1 Approximation of the Matérn Gaussian Process
Although we bypass the problem of evaluating high-dimensional integrals in (3.2) by introducing a latent variable and exploiting conditional independence, evaluating the likelihood in (3.3) remains a computationally taxing problem since $\boldsymbol{W}$ is a vector of large dimension (same as $\boldsymbol{Y}$). For that purpose, we take cues from [16] for an approximation strategy of the Matérn GP. To ensure computational efficiency, we choose to approximate the Gaussian process $W(\cdot )$ with a GP $\tilde{W}(\cdot )$, constructed from a Gaussian Markov random field (GMRF) defined on a finite mesh, thereby circumventing the computational overhead associated with the dense correlation matrix inherent in the exact Matérn GP defined by (2.1). This strategy capitalizes on the direct correspondence between continuous-space Matérn GP with dense covariance matrices and GMRFs with sparse precision matrices [24], which yields an approximate data process
For $\gamma =1$, the Gaussian Matérn process $Z(\cdot )$ can be obtained as a solution to the linear Stochastic Differential Equation (SDE)
where $\mathcal{W}(\cdot )$ is a Gaussian white noise process, and Δ is the Laplacian. The solution $W(\boldsymbol{s})$ to the SPDE can be effectively approximated through finite-element methods [8] applied to a triangulated mesh defined within a bounded region of ${\mathbb{R}^{2}}$, where the triangulation is formed through a refined Delaunay triangulation process [7]. In practical applications, the mesh can be easily constructed using the (currently depreciated) inla.mesh.2d function, implemented in the R package INLA (www.r-inla.org) or the fm_mesh_2d_inla function in the R package fmesher (https://cran.r-project.org/package=fmesher); see [22] for more details. The left panel of Figure 4 depicts the mesh utilized in the data application discussed in Section 5.
(3.4)
\[ ({\phi ^{-2}}-\Delta )W(\boldsymbol{s})=4\pi {\phi ^{-2}}\mathcal{W}(\boldsymbol{s}),\hspace{14.22636pt}\boldsymbol{s}\in {\mathbb{R}^{2}},\]We choose the inner extension distance of ${0.15^{\circ }}$ and the outer extension distance of $2.{5^{\circ }}$; these choices are set using the argument offset. We set the minimum allowed distance between points (cutoff) to ${0.25^{\circ }}$. The largest permitted triangle edge lengths in the inner and outer extensions (max.edge) are set to ${0.25^{\circ }}$ and ${1^{\circ }}$. While any specific choices are not listed in the current literature, choices of the arguments are problem-specific. Some discussions are in [21]. In our case, these choices reasonably approximate the true Matérn correlation function, as seen in the right panel of Figure 4. We set the smoothness parameter to one as it is difficult to estimate this parameter from purely spatial data (where no time replications are available), like in our case. Setting the smoothness parameter to one reduces the stochastic partial differential equation (SPDE) introduced in [24] to a stochastic differential equation (SDE), as shown in (3.4). However, since the method is commonly referred to as the SPDE approach, we continue to use this terminology for consistency, even in this simplified case.
Figure 4
Triangulated mesh over California that approximates the spatial SDE process $Z(\cdot )$, where the black lines represent the edges, along with their corners representing the mesh nodes, and the red dots mark the observation locations (left panel); Comparison of the true Matérn correlation (solid line) and the pairwise covariances between two spatial locations obtained by the SDE approximation (points) as a function of distance (right panel). The parameters are set to $\phi =0.1{\Delta _{\mathcal{S}}}$, where ${\Delta _{\mathcal{S}}}$ is the maximum spatial distance between two locations in the domain, and $\gamma =0.8$.
Let ${\mathcal{S}^{\ast }}=\{{\boldsymbol{s}_{1}^{\ast }},\dots ,{\boldsymbol{s}_{N}^{\ast }}\}$ denote the set of mesh nodes. We construct a finite-element solution by writing $\tilde{W}(\boldsymbol{s})={\textstyle\sum _{j=1}^{N}}{\zeta _{j}}(\boldsymbol{s}){W_{j}^{\ast }}$, and plugging it in (3.4) in place of $W(\cdot )$. Here, $\{{\zeta _{j}}(\cdot )\}$ are piecewise linear and compactly-supported “hat” basis functions defined over the mesh, and $\{{W_{j}^{\ast }}\}$ are normally distributed weights defined for each basis function (that is, one for each mesh node in ${\mathcal{S}^{\ast }}$). Then, ${\boldsymbol{W}^{\ast }}={[{W_{1}^{\ast }},\dots ,{W_{N}^{\ast }}]^{T}}\sim \text{MVN}(\mathbf{0},{\boldsymbol{Q}_{\phi }^{-1}})$, where the $(N\times N)$-dimensional precision matrix ${\boldsymbol{Q}_{\phi }}$ can be written as
where $\boldsymbol{D}$, ${\boldsymbol{G}_{1}}$, and ${\boldsymbol{G}_{2}}$ are sparse ($N\times N$)-dimensional finite-element matrices that can be obtained as follows. The matrix $\boldsymbol{D}$ is diagonal with its jth diagonal entry ${D_{j,j}}=\langle {\zeta _{j}}(\cdot ),1\rangle $, where $\langle f,g\rangle =\textstyle\int f(\boldsymbol{s})g(\boldsymbol{s})d\boldsymbol{s}$ denotes an inner product. Similarly, ${\boldsymbol{G}_{1}}$ has the elements ${G_{1;j,k}}=\langle \nabla {\zeta _{j}}(\cdot ),\nabla {\zeta _{k}}(\cdot )\rangle $ and ${\boldsymbol{G}_{2}}={\boldsymbol{G}_{1}}{\boldsymbol{D}^{-1}}{\boldsymbol{G}_{1}}$. Efficient computation of these sparse matrices is implemented using the function inla.mesh.fem from the R package INLA. For further theoretical details, see [3] and [23].
(3.5)
\[ {\boldsymbol{Q}_{\phi }}=\frac{{\phi ^{2}}}{4\pi }\left[\frac{1}{{\phi ^{4}}}\boldsymbol{D}+\frac{2}{{\phi ^{2}}}{\boldsymbol{G}_{1}}+{\boldsymbol{G}_{2}}\right],\]In order to map the spatial random effects ${\boldsymbol{W}^{\ast }}$ (defined across mesh nodes) back to the observation locations $\mathcal{S}$, we use an $(n\times N)$-dimensional projection matrix $\boldsymbol{A}$. The ${(i,j)^{th}}$ element of this matrix corresponds to ${\zeta _{j}}({\boldsymbol{s}_{i}})$ for every spatial location ${\boldsymbol{s}_{i}}\in \mathcal{S}$ and mesh node ${\boldsymbol{s}_{j}^{\ast }}\in {\mathcal{S}^{\ast }}$, allowing us to compute $\boldsymbol{A}{\boldsymbol{W}^{\ast }}$, the projection of ${\boldsymbol{W}^{\ast }}$ at the data locations. The generation of the matrix $\boldsymbol{A}$ is carried out through the function inla.spde.make.A within the R package INLA.
We now approximate $\boldsymbol{W}$ to include the nugget effect $\gamma \in [0,1]$ as
which has the covariance matrix
\[ {\boldsymbol{\Sigma }_{\tilde{\boldsymbol{W}}}}=\gamma \boldsymbol{A}{\boldsymbol{Q}_{\phi }}{\boldsymbol{A}^{\mathsf{T}}}.\]
This subsequently leads to the approximation of (3.3) to
(3.6)
\[ \begin{aligned}{}\mathcal{L}(\boldsymbol{\theta })& =\int \prod \limits_{i:{\boldsymbol{s}_{i}}\notin {\mathcal{S}^{(c)}}}\frac{1}{{\tau ^{-1/2}}{(1-\gamma )^{1/2}}}\phi \left(\frac{{y_{i}}-{\boldsymbol{x}_{i}^{\mathsf{T}}}\boldsymbol{\beta }-{w_{i}}}{{\tau ^{-1/2}}{(1-\gamma )^{1/2}}}\right)\\ {} & \times \hspace{-0.1667em}\prod \limits_{i:{\boldsymbol{s}_{i}}\in {\mathcal{S}^{(c)}}}\hspace{-0.1667em}\Phi \left(\frac{{u_{i}}-{\boldsymbol{x}_{i}^{\mathsf{T}}}\boldsymbol{\beta }-{w_{i}}}{{\tau ^{-1/2}}{(1-\gamma )^{1/2}}}\right){f_{\text{MVN}}}(\boldsymbol{w};\mathbf{0},{\tau ^{-1}}{\boldsymbol{\Sigma }_{\tilde{\boldsymbol{W}}}})d\boldsymbol{w}.\end{aligned}\]The SPDE approach yields a precise approximation of the true correlation structure; see the right panel of Figure 4. Leveraging the sparsity of the matrix ${\boldsymbol{Q}_{\phi }^{-1}}$, we can facilitate rapid Bayesian computations.
3.2 Final Hierarchical Model
We write the vector of the final approximate data process, $\tilde{Y}(\cdot )$, evaluated at $\mathcal{S}$ by $\tilde{\boldsymbol{Y}}={[\tilde{Y}({\boldsymbol{s}_{1}}),\dots ,\tilde{Y}({\boldsymbol{s}_{n}})]^{T}}$ and define a rescaled random effects vector defined at mesh nodes by ${\tilde{\boldsymbol{W}}^{\ast }}=\sqrt{\gamma /\tau }{\boldsymbol{W}^{\ast }}$. By introducing a latent process $W(\boldsymbol{s})$ (Section 3) and approximating it by $\tilde{\boldsymbol{W}}=\sqrt{\gamma }\boldsymbol{A}{\boldsymbol{W}^{\ast }}$ (Section 3.1), we avoid the need for multiple imputations, and the hierarchical model for $\tilde{\boldsymbol{Y}}$ can then be written as
The first layer of (3.7) models the “true” data, which, when observed, is the same as the observed data. When the observed data is censored, the only information we have about the “true” data is that it is smaller than the censoring limit. These two situations are represented by the first two terms within the integral in (3.6), where the first term (normal density, except the product) conveys the contribution of the “true” data that is observed, and the second term (normal distribution function, except the product) presents the contribution of the censored observations to the data likelihood. The second layer of (3.7) corresponds to the latent spatial random effect, whose contribution is represented by the third term in (3.6). The integral in (3.6) comes from integrating out the spatial random effect to get the observed data likelihood solely, which isn’t necessary under a hierarchical modeling. Here, the last layer of the model indicates prior choices for the model parameters that we discuss in Section 3.4. Instead of the likelihood based on (3.1), we fit the approximate data process (3.7) to the actual observation process with the likelihood function in (3.2) approximated as the one in (3.6).
(3.7)
\[\begin{aligned}{}\tilde{\boldsymbol{Y}}|{\tilde{\boldsymbol{W}}^{\ast }}& \sim \text{MVN}\left(\boldsymbol{X}\boldsymbol{\beta }+\boldsymbol{A}{\tilde{\boldsymbol{W}}^{\ast }},{\tau ^{-1}}(1-\gamma ){\boldsymbol{I}_{n}}\right),\\ {} {\tilde{\boldsymbol{W}}^{\ast }}& \sim \text{MVN}(\mathbf{0},\gamma {\tau ^{-1}}{\boldsymbol{Q}_{\phi }^{-1}}),\\ {} \{\boldsymbol{\beta },\tau ,\phi ,\gamma \}& \sim \pi (\boldsymbol{\beta }|\tau )\times \pi (\tau )\times \pi (\phi )\times \pi (\gamma ).\end{aligned}\]3.3 Prediction
Let ${\mathcal{S}^{(0)}}=\{{\boldsymbol{s}_{1}^{(0)}},\dots ,{\boldsymbol{s}_{{n_{0}}}^{(0)}}\}\subset \mathcal{D}$ denote a set of ${n_{0}}$ prediction sites, and define ${\tilde{\boldsymbol{Y}}^{(0)}}={[\tilde{Y}({\boldsymbol{s}_{1}^{(0)}}),\dots ,\tilde{Y}({\boldsymbol{s}_{{n_{0}}}^{(0)}})]^{T}}$. Also, let ${\boldsymbol{X}^{(0)}}$ denote the $({n_{0}}\times p)$-dimensional design matrix, with its ith row $\boldsymbol{X}({\boldsymbol{s}_{i}^{(0)}}),i=1,\dots ,{n_{0}}$ denoting the vector of covariates at prediction location ${\boldsymbol{s}_{i}^{(0)}}$. For mapping the (scaled) spatial random effects ${\tilde{\boldsymbol{W}}^{\ast }}$ (defined across mesh nodes) to the prediction locations ${\mathcal{S}^{(0)}}$, we use an $({n_{0}}\times N)$-dimensional projection matrix ${\boldsymbol{A}^{(0)}}$. The ${(i,j)^{th}}$ element of this matrix corresponds to ${\zeta _{j}}({\boldsymbol{s}_{i}^{(0)}})$ for every spatial location ${\boldsymbol{s}_{i}^{(0)}}\in {\mathcal{S}^{(0)}}$ and mesh node ${\boldsymbol{s}_{j}^{\ast }}\in {\mathcal{S}^{\ast }}$, allowing us to compute ${\boldsymbol{A}^{(0)}}{\tilde{\boldsymbol{Z}}^{\ast }}$, the projection of ${\tilde{\boldsymbol{W}}^{\ast }}$ at ${\mathcal{S}^{(0)}}$. Then, given ${\tilde{\boldsymbol{W}}^{\ast }}$, the conditional distribution of ${\tilde{\boldsymbol{Y}}^{(0)}}$ is
3.4 Computational Details
Inference concerning the model parameters is conducted through Markov chain Monte Carlo (MCMC) sampling, implemented in R. Given the computational dependence on prior selections for the model parameters, we first specify these priors. Whenever feasible, we opt for conjugate priors and employ Gibbs sampling to update them iteratively. When prior conjugacy is unavailable, we resort to random walk Metropolis-Hastings (MH) steps for parameter updates. During the burn-in period, we adjust the candidate distributions within the MH steps to ensure that the acceptance rate throughout the post-burn-in period remains within the range of 0.3 to 0.5.
Here we draw samples from the full posterior
\[ \pi (\boldsymbol{\beta },\tau ,\phi ,\gamma ,{\tilde{\boldsymbol{W}}^{\ast }},{\tilde{\boldsymbol{Y}}^{(c)}}|{\tilde{\boldsymbol{Y}}^{(nc)}}),\]
where ${\tilde{\boldsymbol{Y}}^{(c)}}$ is the vector of censored data vector and ${\tilde{\boldsymbol{Y}}^{(nc)}}$ is the vector of non-censored data vector. For the vector of regression coefficients $\boldsymbol{\beta }$, we consider weakly-informative conjugate prior $\boldsymbol{\beta }|\tau \sim \text{MVN}(\mathbf{0},{100^{2}}{\tau ^{-1}}{\boldsymbol{I}_{p}})$. The full conditional posterior of $\boldsymbol{\beta }$ is then multivariate normal and updated using direct sampling within Gibbs steps. Due to the strong posterior correlation between $\boldsymbol{\beta }$ and ${\tilde{\boldsymbol{W}}^{\ast }}$, they are updated jointly within each Gibbs sampling step. We also consider the non-informative priors for the hyperparameters involved in the correlation function in (2.1). Specifically, we choose $\phi \sim \text{Uniform}(0,0.5{\Delta _{\mathcal{S}}})$ for the spatial range parameter, where ${\Delta _{\mathcal{S}}}$ is the largest Euclidean distance between two data locations, and $\gamma \sim \text{Uniform}(0,1)$ for the nugget effect γ. We further designate a non-informative conjugate prior for the spatially-constant precision parameter τ in the process model, namely $\tau \sim \text{Gamma}(0.1,0.1)$. The full conditional posterior distribution of ${\tilde{\boldsymbol{Y}}^{(c)}}$ is $\text{MVN}\left({\boldsymbol{X}^{(c)}}\boldsymbol{\beta }+{\boldsymbol{A}^{(c)}}{\tilde{\boldsymbol{W}}^{\ast }},{\tau ^{-1}}(1-\gamma ){\boldsymbol{I}_{n}}\right)$, where ${\boldsymbol{X}^{(c)}}$ and ${\boldsymbol{A}^{(c)}}$ are design matrix and SPDE projection matrix (from mesh nodes), respectively, corresponding to the locations with censored data, i.e., they comprise of the rows of $\boldsymbol{X}$ and $\boldsymbol{A}$ that correspond to the censored entries of $\boldsymbol{Y}$.3.5 Software
We have developed an open-source R package, called CensSpBayes, which implements the proposed approximate Matérn GP model for large left-censored spatial data. Implementation code, along with details of execution using simulated data, are made available at https://github.com/SumanM47/CensSpBayes.
4 Simulation Study
In this section, we conduct simulation studies using synthetic data to assess the efficacy of our proposed scalable modeling framework in terms of spatial prediction while imputing censored values. We simulate 100 datasets over grids ${\mathcal{D}^{\ast }}=\{(i,j):i,j\in \{1/K,2/K,\dots ,1\}\}$ of varying sizes within a spatial domain ${[0,1]^{2}}$. We consider $K\times K$ grids with $K=20,50,100$, and 200 to demonstrate the computational power and scalability inherent to our proposed methodology.
For simulating the datasets, we consider a model with two covariates and an intercept term. The values of the covariates are randomly generated from $N(0,1)$ and $N(5,0.49)$ respectively, and we assume the true value of the regression coefficient to be ${\boldsymbol{\beta }^{\text{true}}}={(3,1.2,0.5)^{\mathsf{T}}}$. The true value of the range parameter of the spatial Matérn correlation is chosen to be ${\phi ^{\text{true}}}=0.15\times {\Delta ^{\ast }}$, where ${\Delta ^{\ast }}=\sqrt{2}$, the maximum spatial distance between two locations in ${[0,1]^{2}}$. The smoothness parameter is set to one and not estimated while fitting our proposed model. The true ratio of partial sill to total variation is chosen to be ${\gamma ^{\text{true}}}=0.9$, and the true precision parameter is chosen to be ${\tau ^{\text{true}}}=1/5$. Exact simulations are conducted to generate datasets from a GP with Matérn correlation, as given in (2.1).
Once the datasets are generated, we divide each dataset randomly into 80% training and 20% test datasets. Within each training set, we consider two different levels of censoring (denoted by L1 and L2) for the response by setting different values of the minimum detection limit (MDL):
For each of the two levels of censoring, we implement our proposed approximate Matérn GP model under three different settings, denoted by S1, S2, and S3:
-
S1 We selectively exclude spatial locations where observations are censored and apply the spatial model approximated via SPDE, as elaborated in Section 3.1, exclusively to the observed locations. This does not require any imputation of the censored observations.
-
S2 We impute the censored observations using the mean of the observed data and employ the SPDE-approximated Matérn GP model, as detailed in Section 3.1. This once again circumvents the need for imputing the censored observations.
-
S3 We fit the full proposed model, treating the observations below MDL as censored observations, and implement the SPDE-approximated Matérn GP model, as in Section 3.1, while simultaneously performing imputations for the censored observations.
In each of the three scenarios, the approximated spatial process using SPDE has been fitted to assess the effects of removing or considering ad hoc imputations of censored observations in terms of mean squared prediction error (MSPE), in contrast to treating them as genuinely censored. The prior distributions for $\boldsymbol{\beta }$ and γ as described in Section 3.4 remain unchanged in the simulation study. However, for the range parameter, we assume $\phi \sim \text{Uniform}(0,0.25{\Delta ^{\ast }})$.
For comparison, we consider two additional models, S4 and S5:
-
S4 We use a 2-dimensional B-spline estimation, along with covariates, for mean modeling ignoring the covariance structure while excluding the censored value, i.e.,\[ Y(\boldsymbol{s})={\boldsymbol{x}\boldsymbol{(}\boldsymbol{s}\boldsymbol{)}^{\mathsf{T}}}\boldsymbol{\beta }+{\sum \limits_{i=1}^{{K_{1}}}}{\sum \limits_{j=1}^{{K_{2}}}}{\alpha _{ij}}B({\boldsymbol{s}_{1}};{\boldsymbol{s}_{ij,1}^{0}})B({\boldsymbol{s}_{2}};{\boldsymbol{s}_{ij,2}^{0}})+\varepsilon (\boldsymbol{s}),\]where $\varepsilon (\boldsymbol{s})$ are i.i.d. with zero mean and variance ${\tau ^{2}}$ and $B({\boldsymbol{s}_{k}};{\boldsymbol{s}_{ij,k}^{0}})$ is the evaluation of the one dimensional B-spline at the k-th coordinate of $\boldsymbol{s}$ and the k-th coordinate of the $(i,j)$th knot, $k=1,2$. Defining a model that models only the mean structure enables us to demonstrate the impact of disregarding both the covariance structure and the presence of censoring in the data. We use an ML estimation scheme here with the number of knots for the 2-dimensional B-splines set at $20\% $ of the grid size.
-
S5 We use the ‘CensSpatial’ algorithm to perform an exact ML estimation of model parameters, which implements the SAEM algorithm of [28] with the package defaults used for the optimization procedures and standard settings for initial values and search limits. The covariance model was set as Matérn covariance with smoothness parameter of 1.
Table 1
Median mean squared prediction error (MSPE) corresponding to model fitting under the five different settings (S1-S5) based on 100 simulated data sets from a Matérn GP that varies with censoring levels L1 (low-censoring; 15%) and L2 (high-censoring; 45%), and grid sizes. The values in parenthesis represent the corresponding median prediction standard errors. The lowest median MSPE in each row is in bold. Since model S5 (‘CensSpatial’) was infeasible for larger grids, table entries (median MSPE and standard errors) appear as ‘-’.
Censoring | Grid | Median Mean Squared Prediction Errors | ||||
Level | size | S1 | S2 | S3 | S4 | S5 |
L1 | $20\times 20$ | 0.88(0.83) | 2.13(1.17) | 0.76(0.84) | 1.54(1.12) | 0.81(0.94) |
$50\times 50$ | 0.67(0.75) | 2.16(1.10) | 0.60(0.78) | 0.87(0.85) | - | |
$100\times 100$ | 0.61(0.72) | 2.07(1.07) | 0.56(0.74) | 0.70(0.76) | - | |
$200\times 200$ | 0.58(0.71) | 2.01(1.05) | 0.54(0.73) | 0.63(0.73) | - | |
L2 | $20\times 20$ | 1.89(0.85) | 5.77(0.92) | 0.88(0.89) | 2.50(1.04) | 1.43(1.08) |
$50\times 50$ | 1.28(0.75) | 6.34(0.85) | 0.64(0.79) | 1.63(0.90) | - | |
$100\times 100$ | 1.03(0.70) | 6.00(0.83) | 0.58(0.76) | 2.67(0.76) | - | |
$200\times 200$ | 0.95(0.68) | 6.26(0.82) | 0.54(0.74) | 1.20(0.08) | - |
Figure 5
Boxplots of log (MSPE) values for S1 (censored data removed, GP approximated), S2 (censored data mean imputed, GP approximated), S3 (proposed methodology), S4 (censored data removed, 2D B-spline for mean modeling and no covariance modeling), and S5 (CensSpatial, only for $20\times 20$) on the simulated datasets grouped by different gridsizes for 15% (top) and 45% (bottom) censoring scenarios. Outliers were not reported to make the boxplots easier to see.
Table 1 presents the median (across 100 simulated datasets) mean squared prediction errors (MSPE), along with corresponding median standard errors, obtained from fitting the models S1-S5 to data in test sets that vary according to censoring levels and grid sizes. Notably, under low levels of censoring (L1), the proposed model in scenario S3 yields better results compared to situations where censored observations are either excluded from the analysis or imputed using the mean of observed values. However, ignoring spatial locations with censored observations entirely leads to unreliable estimates, particularly for the covariance parameters. Similarly, in high data censoring (L2) instances, the proposed model, along with the imputation of censored observations (S3), outperforms all other models. It is noteworthy that the ‘CensSpatial’ method also demonstrates relatively favorable performance for a grid size of $20\times 20$ when the data-generating model is Matérn GP; however, its computational inefficiency and inadequate scaling impeded our ability to apply the method to the higher-dimensional simulated datasets. In fact, [28] showcased the efficacy of the ‘CensSpatial’ algorithm through simulations involving only 50 and 200 spatial locations, clearly indicating its inadequacy regarding scalability. Ignoring the covariance structure does affect the performance here as the MSPE for the B-spline based method (S4) is consistently higher than those for S1. Moreover, S4 often fails to produce a respectable MSPE likely because of ignoring both the censoring and the covariance structure. These conclusions are further supported by the boxplots of log (MSPE) values shown in Figure 5. Specifically, under the high data censoring scenario ($100\times 100$ in row 2), the method S4 displays a notably tall boxplot, suggesting that it produces more unstable results compared to the other methods.
Table 2
Median computation time (in minutes) corresponding to model fitting under the five different settings (S1-S5) based on 100 simulated data sets from a Matérn GP that varies with censoring levels L1 (low-censoring; 15%) and L2 (high-censoring; 45%), and grid sizes. The values in parenthesis represent the median absolute deviation for the corresponding computing times. Since model S5 (‘CensSpatial’) was infeasible for larger grids, table entries appear as ‘-’.
Censoring | Grid | Median Computation Time (in minutes) | ||||
Level | size | S1 | S2 | S3 | S4 | S5 |
L1 | $20\times 20$ | 22.60(0.43) | 23.41(0.38) | 23.36(0.34) | $\lt 0.01$ ($\lt 0.01$) | 42.65(1.01) |
$50\times 50$ | 19.43(0.33) | 19.21(0.36) | 19.32(0.60) | $\lt 0.01$ ($\lt 0.01$) | - | |
$100\times 100$ | 25.45(0.41) | 25.31(0.50) | 25.76(0.32) | 0.05($\lt 0.01$) | - | |
$200\times 200$ | 26.06(0.59) | 26.30(0.49) | 27.10(0.41) | 2.20(0.08) | - | |
L2 | $20\times 20$ | 21.06(0.38) | 23.38(0.32) | 23.69(0.29) | $\lt 0.01$ ($\lt 0.01$) | 80.27(1.67) |
$50\times 50$ | 19.03(0.54) | 19.29(0.36) | 19.25(0.51) | $\lt 0.01$ ($\lt 0.01$) | - | |
$100\times 100$ | 25.37(0.45) | 25.57(0.39) | 25.75(0.61) | 0.04($\lt 0.01$) | - | |
$200\times 200$ | 26.35(0.81) | 26.18(1.03) | 27.04(0.60) | 0.02($\lt 0.01$) | - |
Table 2 presents the median computation time corresponding to fitting the models S1-S5 to data in test sets that vary according to censoring levels and grid size. The computation times for S1, S2, and S3 are comparable, as they all employ the SPDE-approximated Matérn GP, with runtime approximately proportional to the size of the INLA mesh used for process approximation (between 557 and 673 nodes for different datasets of different sizes). As anticipated, the runtime for the local likelihood approach utilizing Vecchia’s approximation increases with larger grid sizes. As discussed earlier, the ‘CensSpatial’ algorithm was infeasible for larger grids. The initial four methods, S1–S4, were executed on SLURM clusters with one core per job and 8 GB RAM allocation. However, due to the current version of ‘CensSpatial’ on CRAN being incompatible with the UNIX system, the algorithm was implemented on a personal Dell 7210 computer featuring 16 GB RAM, an Intel Core i5 dual-core processor, and a Windows 11 Enterprise 64-bit operating system.
5 Application: California PFAS Data
5.1 Analysis Plan and Hyperparameters
We use the iterated log-transformed data (as described in Section 2) as our input to the proposed method. Since we have no covariates in this dataset, we use the coordinates of the locations (longitude, latitude) as covariates. In California, longitude reflects the distance from the Pacific Ocean, and the southern regions tend to have more desert-like areas compared to the northern regions. Consequently, incorporating geographic coordinates (longitude and latitude) can capture potential spatial trends in the data. Furthermore, recent news reports have suggested elevated PFAS concentrations in urban areas of Southern and Central California [27, 29]. Including longitude and latitude as covariates in our study enables us to corroborate these claims within a rigorous statistical framework. The hyperparameters for the priors are the same as mentioned in Section 3.4. We fit a variogram model on the non-censored observations to obtain an initial set of parameter estimates for $\boldsymbol{\beta }$, τ, ϕ, and γ. We run three chains with different starting values that are all close to the initial parameter estimates obtained by variogram fitting to allow for checking convergence and increasing the reliability of the model output. Each chain was run for 25, 000 iterations, with the first 15, 000 samples discarded as burn-in. We thinned the post-burn-in samples by 5 to obtain 2, 000 samples from the posterior distribution of the parameters.
5.2 Results
Figure 6
Left: The predicted surface map for $g(\text{PFOS})=\log (1+\log (1+\text{PFOS}))$ concentration in the state of California. Right: The corresponding uncertainty estimates associated with the predictions for $g(\text{PFOS})$ across the pixels.
For the observed data comprising 24, 959 locations and the prediction grid of $0.1^\circ \times 0.1^\circ $, yielding 405, 893 prediction locations across California, each of the three MCMC chains completed in approximately 62 minutes. These computations were performed on an SLURM cluster with an 8GB RAM allocation. We observed a reasonable well-mixing of the three chains. The different parameter estimates, as calculated by the posterior means and the corresponding standard errors, as calculated by the posterior standard deviations, are presented in Table 3. Both estimated covariate effects are negative and are significant in our study. The estimated spatial range is relatively low (∼7km). The posterior distribution of γ exhibits left skewness, suggesting that the majority of the variance in the data is attributable to the spatial structure rather than local noise, highlighting the significance of modeling the spatial covariance. The prediction surface is smooth at places (left panel of Figure 6) with higher detailing around the regions with observed data. The prediction standard deviation is low towards the western parts, where we have more observed data, but has an intriguing pattern on the east-southeastern parts (right panel of Figure 6). Our analysis confirms the findings of previous studies and news reports, as we observe significant negative effects of longitude and latitude. We predict elevated PFOS concentrations in urban areas along the western coast of Central and Southern California, with values exceeding the news EPA safety threshold of approximately 0.95 on the transformed scale. Notably high PFOS levels are detected in and around Sonoma, Napa, Solano, Contra Costa, Alameda, San Francisco, and Santa Clara counties in central-western California, as well as in Los Angeles, Orange, and San Diego counties in southwestern California.
Table 3
Table of estimates (posterior mean) and posterior standard deviations corresponding to the parameters ${\beta _{0}},{\beta _{1}}$ and ${\beta _{2}}$ denoting the intercept and the 2 covariates, respectively, ϕ (the spatial range), τ (the precision) and γ (ratio of partial sill to total variance).
Estimates | Standard Deviation | |
${\beta _{0}}$ | $-22.04$ | 7.60 |
${\beta _{1}}$ | $-0.23$ | 0.08 |
${\beta _{2}}$ | $-0.14$ | 0.07 |
ϕ | 0.07 | 0.02 |
τ | 0.22 | 0.08 |
γ | 0.95 | 0.02 |
6 Discussion
We present a novel method to address the problem of modeling censored, spatially correlated outcomes in big data settings. We observe that the proposed model scales nicely with an increased number of observation locations and performs better than all other competing methods, even when nearly half of the observed data are censored. Despite being a fully Bayesian model, the runtime is moderate and at least comparable or better than the competing methods, highlighting its scalability, which, combined with its demonstrated accuracy, makes this method an efficient approximate method for modeling large spatial data in the presence of (left-) censoring. The real data analysis demonstrated this further as the model achieved satisfactory mixing of three chains for a large dataset in only an hour, producing sensible prediction surfaces and uncertainty quantification.
However, the data presents specific challenges during modeling, which may also be considered as limitations of the proposed method. The predicted surface in the left panel of Figure 6 is very smooth towards the east-southeast end of California. This is expected, as we have very few observations around that area to inform our spatial process. This, while being non-desirable, makes sense and is in line with what one would expect to happen for such a dataset. We can not hope to manufacture information in the absence of observations and we do not, reflecting the consistency of statistical principles being adhered to here in our analysis. The map of prediction uncertainty (right panel of Figure 6) also reflects this. We have nearly zero uncertainty for most of the region, where we observe numerous instances and have higher uncertainty whenever we move far from observations. Interestingly, we also notice a quilting pattern in the right panel of Figure 6. This is a byproduct of the mesh object and the lack of observations in the east-southeast region.
Further consideration is therefore needed to choose the mesh and smoothness parameters to fit the model. We explain our choice of mesh in Section 3.1. However, this process is ad-hoc, and a concrete workflow for selecting a mesh would greatly benefit users. We consider this to be a plausible future research direction. Another development on both the software and methodological fronts would include fractional smoothness parameters in the model, which is currently restricted to integer smoothness (we use $\nu =1$ for all our analyses). One possible approach would be to use the fractional rational approximations to the SPDE model [6, 5]. Combining the theory for fractional approximations with the software should render additional model flexibility and be more well-suited to real data applications. Further developments for multivariate extensions of the model handling multiple spatial processes, simultaneously having a mix of censored and uncensored observations, are underway.
Another important component to consider here is the assumption of stationarity in the spatial covariance model, which may not be realistic in many real-world applications. To address potential non-stationarity, we incorporate geographical locations as covariates, aiming to ensure that the residuals are stationary and can be effectively modeled using the proposed approach, while accounting for data censoring. If the residuals exhibit non-stationary spatial dependence, it would be necessary to develop a model that accommodates non-stationary spatial structures. Given the large size of the dataset, a suitable approximation method for non-stationary spatial data would be required. Although such methods are relatively rare, some efforts have been made to address this challenge. In particular, we could extend our approach by utilizing the non-stationary version of the SPDE framework, as described by [24], to account for non-stationary spatial covariance in future work.
Other than proposing a novel, scalable spatial model in its own right, the implications of this study extend beyond academic interest. By elucidating the spatial distribution of PFAS/PFOS contamination and its associated factors, we can inform targeted interventions, policy recommendations, and resource allocation to mitigate the impact of PFAS exposure on public health. Additionally, our research provides a framework that can be adapted to analyze censored data in other environmental contexts, fostering a deeper understanding of complex contamination scenarios and enabling evidence-based decision-making.