1 Introduction
Given a set of latent variables in a model, do we fit a model with a particular set of latent variables, or do we integrate out the latent variables when making predictions? Marginalization of latent variables is an iconic feature of the Bayesian analysis. The art of marginalization in statistics can at least be traced back to the De Finetti’s theorem [12], which states that an infinite sequence ${\{{X_{i}}\}_{i=1}^{\infty }}$ is exchangeable, if and if only if there exists a random variable $\theta \in \Theta $ with probability distribution $\pi (\cdot )$, and a conditional distribution $p(\cdot \mid \theta )$, such that
Marginalization of nuisance parameters for models with independent observations has been comprehensively reviewed in [8]. Bayesian model selection [6, 4] and Bayesian model averaging [42], as two other examples, both rely on the marginalization of parameters in each model.
(1.1)
\[ p({x_{1}},\dots ,{x_{N}})=\int \left\{{\prod \limits_{i=1}^{N}}p({x_{i}}\mid \theta )\right\}\pi (\theta )d\theta .\]For spatially correlated data, the Jefferys prior of the covariance parameters in a Gaussian process (GP), which is proportional to the squared root of the Fisher information matrix of the likelihood, often leads to improper posteriors [7]. The posterior of the covariance parameter becomes proper if the prior is derived based on the Fisher information matrix of the marginal likelihood, after marginalizing out the mean and variance parameters. The resulting prior, after marginalization, is a reference prior, which has been studied for modeling spatially correlated data and computer model emulation [39, 46, 28, 20, 37].
Marginalization of latent variables has lately been aware by the machine learning community as well, for purposes of uncertainty quantification and propagation. In [29], for instance, the deep ensembles of models with a scoring function were proposed to assess the uncertainty in deep neural networks, and it is closely related to Bayesian model averaging with a uniform prior on parameters. This approach was further studied in [64], where the importance of marginalization is highlighted. Neural networks with infinite depth were shown to be equivalent to a GP with a particular kernel function in [38], and it was lately shown in [32] that the results of deep neural networks can be reproduced by GPs, where the latent nodes are marginalized out.
In this work, we study the marginalization of latent variables for correlated data, particularly focusing on scalable computation. Gaussian processes have been ubiquitously used for modeling spatially correlated data [3] and emulating computer experiments [50]. Computing the likelihood in GPs and making predictions, however, cost $\mathcal{O}({N^{3}})$ operations, where N is the number of observations, due to finding the inverse and determinant of the covariance matrix. To overcome the computational bottleneck, various approximation approaches, such as inducing point approaches [54], fixed rank approximation [10], integrated nested Laplace approximation [48], stochastic partial differential equation representation [33], local Gaussian process approximation [15], and hierarchical nearest-neighbor Gaussian process models [11], circulant embedding [55], many of which can be summarized into the framework of Vecchia approximation [59, 27]. Scalable computation of a GP model with a multi-dimensional input space and a smooth covariance function is of great interest in recent years.
The exact computation of GP models with smaller computational complexity was less studied in past. To fill this knowledge gap, we will first review the stochastic differential equation representation of a GP with the Matérn covariance and one-dimensional input variable [63, 22], where the solution can be written as a dynamic linear model [62]. Kalman filter and Rauch–Tung–Striebel smoother [26, 45] can be implemented for computing the likelihood function and predictive distribution exactly, reducing the computational complexity of GP using a Matérn kernel with a half-integer roughness parameter and 1D input from $\mathcal{O}({N^{3}})$ to $\mathcal{O}(N)$ operations. Here, interestingly, the latent states of a GP model are marginalized out in Kalman Filter iteratively. Thus the Kalman filter can be considered as an example of marginalization of latent variables, which leads to efficient computation. Note that the Kalman filter is not directly applicable for GP with multivariate inputs, yet GPs with some of the widely used covariance structures, such as the product or separable kernel [5] and linear model of coregionalization [3], can be written as state space models on an augmented lattice [18, 17]. Based on this connection, we introduce a few extensions of scalable marginalization for modeling incomplete matrices of correlated data.
The contributions of this work are twofold. First, the computational scalability and efficiency of marginalizing latent variables for models of correlated data and functional data are less studied. Here we discuss the marginalization of latent states in the Kalman filter in computing the likelihood and making predictions, with only $\mathcal{O}(N)$ computational operations. We discuss recent extensions on structured data with multi-dimensional input. Second, we develop new marginalization techniques to estimate interaction kernels of particles and to forecast trajectories of particles, which have wide applications in agent-based models [9], cellular migration [23], and molecular dynamic simulation [43]. The computational gain comes from the sparse representation of inverse covariance of interaction kernels, and the use of the conjugate gradient algorithm [24] for iterative computation. Specifically, we reduce the computational order from $\mathcal{O}({(nMDL)^{3}})+\mathcal{O}({n^{4}}{L^{2}}{M^{2}}D)$ operations in recent studies [34, 13] to $\mathcal{O}(T{n^{2}}MDL)+\mathcal{O}({n^{2}}MDL\log (nMDL))$ operations based on training data of M simulation runs, each containing n particles in a D dimensional space at L time points, with T being the number of iterations in the sparse conjugate gradient algorithm. This allows us to estimate interaction kernels of dynamic systems with many more observations. Here the sparsity comes from the use of the Matérn kernel, which is distinct from any of the approximation methods based on sparse covariance structures.
The rest of the paper is organized below. We first introduce the GP as a surrogate model for approximating computationally expensive simulations in Section 2. The state space model representation of a GP with Matérn covariance and temporal input is introduced in Section 3.1. We then review the Kalman filter as a computationally scalable technique to marginalize out latent states for computing the likelihood of a GP model and making predictions in Section 3.2. In Section 3.3, we discuss the extension of latent state marginalization in linear models of coregionaliztion for multivariate functional data, spatial and spatio-temporal data on the incomplete lattice. The new computationally scalable algorithm for estimating interaction kernel and forecasting particle trajectories is introduced in Section 4. We conclude this study and discuss a few potential research directions in Section 5. The code and data used in this paper are publicly available: https://github.com/UncertaintyQuantification/scalable_marginalization.
2 Background: Gaussian Process
We briefly introduce the GP model in this section. We focus on computer model emulation, where the GP emulator is often used as a surrogate model to approximate computer experiments [52]. Consider a real-valued unknown function $z(\cdot )$, modeled by a Gaussian stochastic process (GaSP) or Gaussian process (GP), $z(\cdot )\sim \mathcal{GP}(\mu (\cdot ),{\sigma ^{2}}K(\cdot ,\cdot ))$, meaning that, for any inputs $\{{\mathbf{x}_{1}},\dots ,{\mathbf{x}_{N}}\}$ (with ${\mathbf{x}_{i}}$ being a $p\times 1$ vector), the marginal distribution of $\mathbf{z}={(z({\mathbf{x}_{1}}),\dots ,z({\mathbf{x}_{N}}))^{T}}$ follows a multivariate normal distribution,
where $\boldsymbol{\mu }={(\mu ({\mathbf{x}_{1}}),\dots ,\mu ({\mathbf{x}_{N}}))^{T}}$ is a vector of mean or trend parameters, ${\sigma ^{2}}$ is the unknown variance and R is the correlation matrix with the $(i,j)$ element modeled by a kernel $K\hspace{0.1667em}({\mathbf{x}_{i}},{\mathbf{x}_{j}})$ with parameters $\boldsymbol{\gamma }$. It is common to model the mean by $\mu (\mathbf{x})=\mathbf{h}(\mathbf{x})\boldsymbol{\beta }\hspace{0.1667em}$, where $\mathbf{h}(\mathbf{x})$ is a $1\times q$ row vector of basis function, and $\boldsymbol{\beta }$ is a $q\times 1$ vector of mean parameters.
(2.1)
\[ \mathbf{z}\mid \boldsymbol{\beta },\hspace{0.1667em}{\sigma ^{2}},\hspace{0.1667em}\boldsymbol{\gamma }\sim \mathcal{MN}(\boldsymbol{\mu },{\sigma ^{2}}\mathbf{R})\hspace{0.1667em},\]When modeling spatially correlated data, the isotropic kernel is often used, where the input of the kernel only depends on the Euclidean distance $K({\mathbf{x}_{a}},{\mathbf{x}_{b}})=K(||{\mathbf{x}_{a}}-{\mathbf{x}_{b}}||)$. In comparison, each coordinate of the latent function in computer experiments could have different physical meanings and units. Thus a product kernel is often used in constructing a GP emulator, such that correlation lengths can be different at each coordinate. For any ${\mathbf{x}_{a}}=({x_{a1}},\dots ,{x_{ap}})$ and ${\mathbf{x}_{b}}=({x_{b1}},\dots ,{x_{bp}})$, the kernel function can be written as $K({\mathbf{x}_{a}},{\mathbf{x}_{b}})={K_{1}}({x_{a1}},{x_{b1}})\times \cdots \times {K_{p}}({x_{ap}},{x_{bp}})$, where ${K_{l}}$ is a kernel for the lth coordinate with a distinct range parameter ${\gamma _{l}}$, for $l=1,\dots ,p$. Some frequently used kernels ${K_{l}}$ include power exponential and Matérn kernel functions [44]. The Matérn kernel, for instance, follows
where ${d_{l}}=|{x_{al}}-{x_{bl}}|$, $\Gamma (\cdot )$ is the gamma function, ${\mathcal{K}_{{\nu _{l}}}}(\cdot /{\gamma _{l}})$ is the modified Bessel function of the second kind with the range parameter and roughness parameter being ${\gamma _{l}}$ and ${\nu _{l}}$, respectively. The Matérn correlation has a closed-form expression when the roughness parameter is a half-integer, i.e. ${\nu _{l}}=2{k_{l}}+1/2$ with ${k_{l}}\in \mathbb{N}$. It becomes the exponential correlation and Gaussian correlation, when ${k_{l}}=0$ and ${k_{l}}\to \infty $, respectively. The GP with Matérn kernel is $\lfloor {\nu _{l}}-1\rfloor $ mean square differentiable at coordinate l. This is a good property, as the differentiability of the process is directly controlled by the roughness parameter.
(2.2)
\[ {K_{l}}({d_{l}})=\frac{1}{{2^{{\nu _{l}}-1}}\Gamma ({\nu _{l}})}{\left(\frac{\sqrt{2{\nu _{l}}}{d_{l}}}{{\gamma _{l}}}\right)^{{\nu _{l}}}}{\mathcal{K}_{{\nu _{l}}}}\left(\frac{\sqrt{2{\nu _{l}}}{d_{l}}}{{\gamma _{l}}}\right),\]Denote mean basis of observations $\mathbf{H}={({\mathbf{h}^{T}}({\mathbf{x}_{1}}),\dots ,{\mathbf{h}^{T}}({\mathbf{x}_{N}}))^{T}}$. The parameters in GP contain mean parameters $\boldsymbol{\beta }$, variance parameter ${\sigma ^{2}}$, and range parameters $\boldsymbol{\gamma }=({\gamma _{1}},\dots ,{\gamma _{p}})$. Integrating out the mean and variance parameters with respect to reference prior $\pi (\boldsymbol{\beta },\sigma )\propto 1/{\sigma ^{2}}$, the predictive distribution of any input ${\mathbf{x}^{\ast }}$ follows a student t distribution [20]:
with $N-q$ degrees of freedom, where
with $\hat{\boldsymbol{\beta }}={\left({\mathbf{H}^{T}}{\mathbf{R}^{-1}}\hspace{2.5pt}\mathbf{H}\right)^{-1}}{\mathbf{H}^{T}}{\mathbf{R}^{-1}}\mathbf{z}$ being the generalized least squares estimator of $\boldsymbol{\beta }$, $\mathbf{r}({\mathbf{x}^{\ast }})=(K({\mathbf{x}^{\ast }},{\mathbf{x}_{1}}),\dots ,K{({\mathbf{x}^{\ast }},{\mathbf{x}_{N}}))^{T}}$ and ${\mathbf{h}^{\ast }}({\mathbf{x}^{\ast }})=(\mathbf{h}({\mathbf{x}^{\ast }})-{\mathbf{H}^{T}}{\mathbf{R}^{-1}}\mathbf{r}({\mathbf{x}^{\ast }}))$.
(2.3)
\[ z({\mathbf{x}^{\ast }})\mid \mathbf{z},\hspace{0.1667em}\boldsymbol{\gamma }\sim \mathcal{T}(\hat{z}({\mathbf{x}^{\ast }}),{\hat{\sigma }^{2}}{K^{\ast \ast }},N-q)\hspace{0.1667em},\](2.4)
\[\begin{aligned}{}\hat{z}({\mathbf{x}^{\ast }})=& \mathbf{h}({\mathbf{x}^{\ast }})\hat{\boldsymbol{\beta }}+{\mathbf{r}^{T}}({\mathbf{x}^{\ast }}){\mathbf{R}^{-1}}\left(\mathbf{z}-\mathbf{H}\hat{\boldsymbol{\beta }}\right),\end{aligned}\](2.5)
\[\begin{aligned}{}{\hat{\sigma }^{2}}=& {(N-q)^{-1}}{\left(\mathbf{z}-\mathbf{H}\hat{\boldsymbol{\beta }}\right)^{T}}{\mathbf{R}^{-1}}\left(\mathbf{z}-\mathbf{H}\hat{\boldsymbol{\beta }}\right),\end{aligned}\](2.6)
\[\begin{aligned}{}{K^{\ast \ast }}=& K({\mathbf{x}^{\ast }},{\mathbf{x}^{\ast }})-{\mathbf{r}^{T}}({\mathbf{x}^{\ast }}){\mathbf{R}^{-1}}\mathbf{r}({\mathbf{x}^{\ast }})+{\mathbf{h}^{\ast }}{({\mathbf{x}^{\ast }})^{T}}\\ {} & \times {\left({\mathbf{H}^{T}}{\mathbf{R}^{-1}}\mathbf{H}\right)^{-1}}{\mathbf{h}^{\ast }}({\mathbf{x}^{\ast }}),\end{aligned}\]Direct computation of the likelihood requires $\mathcal{O}({N^{3}})$ operations due to computing the Cholesky decomposition of the covariane matrix for matrix inversion, and the determinant of the covariance matrix. Thus a posterior sampling algorithm, such as the Markov chain Monte Carlo (MCMC) algorithm can be slow, as it requires a large number of posterior samples. Plug-in estimators, such as the maximum likelihood estimator (MLE) were often used to estimate the range parameters $\boldsymbol{\gamma }$ in covariance. In [20], the maximum marginal posterior estimator (MMPE) with robust parameterizations was discussed to overcome the instability of the MLE. The MLE and MMPE of the parameters in a GP emulator with both the product kernel and the isotropic kernel are all implemented in the RobustGaSP package [19].
In some applications, we may not directly observe the latent function but a noisy realization:
where $z(.)$ is modeled as a zero-mean GP with covariance ${\sigma ^{2}}K(.,.)$, and $\epsilon (\mathbf{x})\sim \mathcal{N}(0,{\sigma _{0}^{2}})$ follows an independent Gaussian noise. This model is typically referred to as the Gaussian process regression [44], which is suitable for scenarios containing noisy observations, such as experimental or field observations, numerical solutions of differential equations with non-negligible error, and stochastic simulations. Denote the noisy observations $\mathbf{y}={(y({\mathbf{x}_{1}}),y({\mathbf{x}_{2}}),\dots ,y({\mathbf{x}_{N}}))^{T}}$ at the design input set $\{{\mathbf{x}_{1}},{\mathbf{x}_{2}},\dots ,{\mathbf{x}_{N}}\}$ and the nugget parameter $\eta ={\sigma _{0}^{2}}/{\sigma ^{2}}$. Both range and nugget parameters in GPR can be estimated by the plug-in estimators [19]. The predictive distribution of $f({\mathbf{x}^{\ast }})$ at any input ${\mathbf{x}^{\ast }}$ can be obtained by replacing R with $\mathbf{\tilde{R}}=\mathbf{R}+\eta {\mathbf{I}_{n}}$ in Equation (2.3).
Constructing a GP emulator to approximate computer simulation typically starts with a “space-filling” design, such as the Latin hypercube sampling (LHS), to fill the input space. Numerical solutions of computer models were then obtained at these design points, and the set ${\{({\mathbf{x}_{i}},{y_{i}})\}_{i=1}^{N}}$ is used for training a GP emulator. For any observed input ${\mathbf{x}^{\ast }}$, the predictive mean in (2.3) is often used for predictions, and the uncertainty of observations can be obtained through the predictive distribution. In Figure 1, we plot the predictive mean of a GP emulator to approximate the Branin function [56] with N training inputs sampled from LHS, using the default setting of the RobustGaSP package [19]. When the number of observations increases from $N=12$ (middle panel) to $N=24$ (right panel), the predictive error becomes smaller.
Figure 1
Predictions by the GP emulator of a function on 2D inputs with $N=12$ and $N=24$ observations (black circles) are shown in the left and right panels, respectively.
The computational complexity of GP models increases at the order of $\mathcal{O}({N^{3}})$, which prohibits applications on emulating complex computer simulations, when a relatively large number of simulation runs are required. In the next section, we will introduce the state space representation of GP with Matérn covariance and one-dimensional input, where the computational order scales as $\mathcal{O}(N)$ without approximation. This method can be applied to problems with high dimensional input space discussed in Section 4.
3 Marginalization in Kalman Filter
3.1 State Space Representation of GP with the Matérn Kernel
Suppose we model the observations by Equation (2.7) where the latent process $z(.)$ is assumed to follow a GP on 1D input. For simplicity, here we assume a zero mean parameter ($\mu =0$), and a mean function can be easily included in the analysis. It has been realized that a GP defined in 1D input space using a Matérn covariance with a half-integer roughness parameter input can be written as stochastic differential equations (SDEs) [63, 22], which can reduce the operations of computing the likelihood and making predictions from $\mathcal{O}({N^{3}})$ to $\mathcal{O}(N)$ operations, with the use of Kalman filter. Here we first review SDE representation and then we discuss marginalization of latent variables in the Kalman filter algorithm for scalable computation.
When the roughness parameter is $\nu =5/2$, for instance, the Matérn kernel has the expression below
where $d=|{x_{a}}-{x_{b}}|$ is the distance between any ${x_{a}},{x_{b}}\in \mathbb{R}$ and γ is a range parameter typically estimated by data. The output and two derivatives of the GP with the Matérn kernel in (3.1) can be written as the SDE below,
or in the matrix form,
where $\mathbf{w}({x_{i}})\sim \mathcal{MN}(0,\mathbf{W}({x_{i}}))$ for $i=2,\dots ,N$, and the initial state follows $\boldsymbol{\theta }({x_{1}})\sim \mathcal{MN}(\mathbf{0},\mathbf{W}({x_{1}}))$. Here $\mathbf{G}({x_{i}})={e^{\mathbf{J}({x_{i}}-{x_{i-1}})}}$ and $\mathbf{W}({x_{i}})={\textstyle\int _{0}^{{x_{i}}-{x_{i-1}}}}{e^{\mathbf{J}t}}\mathbf{L}c{\mathbf{L}^{T}}{e^{{\mathbf{J}^{T}}t}}dt$ from $i=2,\dots ,N$, and stationary distribution $\boldsymbol{\theta }({x_{i}})\sim \mathcal{MN}(0,\mathbf{W}({x_{1}}))$, with $\mathbf{W}({x_{1}})={\textstyle\int _{0}^{\infty }}{e^{\mathbf{J}t}}\mathbf{L}c{\mathbf{L}^{T}}{e^{{\mathbf{J}^{T}}t}}dt$. Both $\mathbf{G}({x_{i}})$ and $\mathbf{W}({x_{i}})$ have closed-form expressions given in the Appendix A.1. The joint distribution of the states follows ${\left({\boldsymbol{\theta }^{T}}({x_{1}}),\dots ,{\boldsymbol{\theta }^{T}}({x_{N}})\right)^{T}}\sim \mathcal{MN}(\mathbf{0},{\boldsymbol{\Lambda }^{-1}})$, where the inverse covariance Λ is a block tri-diagonal matrix discussed in Appendix A.1.
(3.1)
\[ K(d)=\left(1+\frac{\sqrt{5}d}{\gamma }+\frac{5{d^{2}}}{3{\gamma ^{2}}}\right)\exp \left(-\frac{\sqrt{5}d}{\gamma }\right),\hspace{0.1667em}\](3.2)
\[ \frac{d\boldsymbol{\theta }(x)}{dx}=\mathbf{J}\boldsymbol{\theta }(x)+\mathbf{L}\epsilon (x),\]
\[\begin{aligned}{}\frac{d}{dx}\left(\begin{array}{c}z(x)\\ {} {z^{(1)}}(x)\\ {} {z^{(2)}}(x)\end{array}\right)=& \left(\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c}0& 1& 0\\ {} 0& 0& 1\\ {} -{\lambda ^{3}}& -3{\lambda ^{2}}& -3\lambda \end{array}\right)\left(\begin{array}{c}z(x)\\ {} {z^{(1)}}(x)\\ {} {z^{(2)}}(x)\end{array}\right)\\ {} & +\left(\begin{array}{c}0\\ {} 0\\ {} 1\end{array}\right)\epsilon (x),\end{aligned}\]
where $\epsilon (x)\sim N(0,{\sigma ^{2}})$, with $\lambda =\frac{\sqrt{2\nu }}{\gamma }=\frac{\sqrt{5}}{\gamma }$, and ${z^{(l)}}(\cdot )$ is the lth derivative of the process $z(\cdot )$. Denote $c=\frac{16}{3}{\sigma ^{2}}{\lambda ^{5}}$ and $\mathbf{F}=(1,0,0)$. Assume the 1D input is ordered, i.e. ${x_{1}}\le {x_{2}}\le \cdots \le {x_{N}}$. The solution of SDE in (3.2) can be expressed as a continuous-time dynamic linear model [61],
(3.3)
\[ \begin{aligned}{}y({x_{i}})& =\mathbf{F}\boldsymbol{\theta }({x_{i}})+\epsilon ,\\ {} \boldsymbol{\theta }({x_{i}})& =\mathbf{G}({x_{i}})\boldsymbol{\theta }({x_{i-1}})+\mathbf{w}({x_{i}}),\hspace{0.1667em}\end{aligned}\]3.2 Kalman Filter as a Scalable Marginalization Technique
For dynamic linear models in (3.3), Kalman filter and Rauch–Tung–Striebel (RTS) smoother can be used as an exact and scalable approach to compute the likelihood, and predictive distributions. The Kalman filter and RTS smoother are sometimes called the forward filtering and backward smoothing/sampling algorithm, widely used in dynamic linear models of time series. We refer the readers to [61, 41] for discussion of dynamic linear models.
Write $\mathbf{G}({x_{i}})={\mathbf{G}_{i}}$, $\mathbf{W}({x_{i}})={\mathbf{W}_{i}}$, $\boldsymbol{\theta }({x_{i}})={\boldsymbol{\theta }_{i}}$ and $y({x_{i}})={y_{i}}$ for $i=1,\dots ,N$. In Lemma 1, we summarize Kalman filter and RTS smoother for the dynamic linear model in (3.3). Compared with $\mathcal{O}({N^{3}})$ computational operations and $\mathcal{O}({N^{2}})$ storage cost from GPs, the outcomes of Kalman filter and RTS smoother can be used for computing the likelihood and predictive distribution with $\mathcal{O}(N)$ operations and $\mathcal{O}(N)$ storage cost, summarized in Lemma 1. All the distributions in Lemma 1 and Lemma 2 are conditional distributions given parameters $(\gamma ,{\sigma ^{2}},{\sigma _{0}^{2}})$, which are omitted for simplicity.
Lemma 1 (Kalman Filter and RTS Smoother [26, 45]).
1. (Kalman Filter). Let ${\boldsymbol{\theta }_{i-1}}|{\mathbf{y}_{1:i-1}}\sim \mathcal{MN}({\mathbf{m}_{i-1}},{\mathbf{C}_{i-1}})$. For $i=2,\dots ,N$, iteratively we have,
2. (RTS Smoother). Denote ${\boldsymbol{\theta }_{i+1}}|{\mathbf{y}_{1:n}}\sim \mathcal{N}({s_{i+1}},{S_{i+1}})$, then recursively for $i=N-1,\dots ,1$,
where ${\mathbf{s}_{i}}={\mathbf{m}_{i}}+{\mathbf{C}_{i}}{\mathbf{G}_{i+1}^{T}}{\mathbf{B}_{i+1}^{-1}}({\mathbf{s}_{i+1}}-{\mathbf{b}_{i+1}})$ and ${\mathbf{S}_{i}}={\mathbf{C}_{i}}-{\mathbf{C}_{i}}{\mathbf{G}_{i+1}^{T}}{\mathbf{B}_{i+1}^{-1}}({\mathbf{B}_{i+1}}-{\mathbf{S}_{i+1}}){\mathbf{B}_{i+1}^{-1}}{\mathbf{G}_{i+1}}{\mathbf{C}_{i}}$.
-
(i) the one-step-ahead predictive distribution of ${\boldsymbol{\theta }_{i}}$ given ${\mathbf{y}_{1:i-1}}$,
(3.4)
\[ {\boldsymbol{\theta }_{i}}|{\mathbf{y}_{1:i-1}}\sim \mathcal{MN}({\mathbf{b}_{i}},{\mathbf{B}_{i}}),\] -
(iii) the filtering distribution of ${\boldsymbol{\theta }_{i}}$ given ${\mathbf{y}_{1:i}}$,
(3.6)
\[ {\boldsymbol{\theta }_{i}}|{\mathbf{y}_{1:i}}\sim \mathcal{MN}({\mathbf{m}_{i}},{\mathbf{C}_{i}}),\]
(3.7)
\[ {\boldsymbol{\theta }_{i}}|{\mathbf{y}_{1:N}}\sim \mathcal{MN}({\mathbf{s}_{i}},{\mathbf{S}_{i}}),\]Lemma 2 (Likelihood and predictive distribution).
1. (Likelihood). The likelihood follows
\[\begin{aligned}{}p({\mathbf{y}_{1:N}}\mid {\sigma ^{2}},{\sigma _{0}^{2}},\gamma )=& {\prod \limits_{i=1}^{N}}{(2\pi {Q_{i}})^{-\frac{1}{2}}}\exp \left\{-{\sum \limits_{i=1}^{N}}\frac{{({y_{i}}-{f_{i}})^{2}}}{2{Q_{i}}}\right\},\end{aligned}\]
where ${f_{i}}$ and ${Q_{i}}$ are given in Kalman filter. The likelihood can be used to obtain the MLE of the parameters $({\sigma ^{2}},{\sigma _{0}^{2}},\gamma )$.
2. (Predictive distribution).
-
(i) By the last step of Kalman filter, one has ${\boldsymbol{\theta }_{N}}|{\mathbf{y}_{1:N}}$ and recursively by the RTS smoother; the predictive distribution of ${\boldsymbol{\theta }_{i}}$ for $i=N-1,\dots ,1$ follows
-
(ii) For any ${x^{\ast }}$ (W.l.o.g. let ${x_{i}}\lt {x^{\ast }}\lt {x_{i+1}}$)\[ \boldsymbol{\theta }({x^{\ast }})\mid {\mathbf{y}_{1:N}}\sim \mathcal{MN}\left(\hat{\boldsymbol{\theta }}({x^{\ast }}),\hat{\boldsymbol{\Sigma }}({x^{\ast }})\right)\]where $\hat{\boldsymbol{\theta }}({x^{\ast }})={\mathbf{G}_{i}^{\ast }}{\mathbf{s}_{i}}+{\mathbf{W}_{i}^{\ast }}{({\mathbf{G}_{i+1}^{\ast }})^{T}}{({\tilde{\mathbf{W}}_{i+1}^{\ast }})^{-1}}({\mathbf{s}_{i+1}}-{\mathbf{G}_{i+1}^{\ast }}{\mathbf{G}_{i}^{\ast }}{\mathbf{s}_{i}})$ and $\hat{\boldsymbol{\Sigma }}({x^{\ast }})={({({\mathbf{W}_{i}^{\ast }})^{-1}}+{({\mathbf{G}_{i+1}^{\ast }})^{T}}{({\mathbf{W}_{i+1}^{\ast }})^{-1}}{\mathbf{G}_{i+1}^{\ast }})^{-1}}$ with terms denoted with ‘*’ given in the Appendix A.1.
Although we introduce the Matérn kernel with $\nu =5/2$ as an example, the likelihood and predictive distribution of GPs with the Matérn kernel of a small half-integer roughness parameter can be computed efficiently, for both equally spaced and not equally spaced 1D inputs. For the Matérn kernel with a very large roughness parameter, the dimension of the latent states becomes large, which makes efficient computation prohibitive. In practice, the Matérn kernel with a relatively large roughness parameter (e.g. with $\nu =5/2$) is found to be accurate for estimating a smooth latent function in computer experiments [20, 2]. Because of this reason, the Matérn kernel with $\nu =5/2$ is the default choice of the kernel function in some packages of GP emulators [47, 19].
For a model containing latent variables, one may proceed with two usual approaches:
For approach (i), the MCMC algorithm is usually much slower than the Kalman filter, as the number of the latent states is high, requiring a large number of posterior samples [17]. On the other hand, the prior correlation between states may not be taken into account directly in approach (ii), making the estimation less efficient than the Kalman filter, if data contain correlation across latent states. In comparison, the latent states in the dynamic linear model in (3.3) are iteratively marginalized out in Kalman filter, and the closed-form expression is derived in each step, which only takes $\mathcal{O}(N)$ operations and storage cost, with N being the number of observations.
In practice, when a sensible probability model or a prior of latent variables is considered, the principle is to integrate out the latent variables when making predictions. Posterior samples and optimization algorithms, on the other hand, can be very useful for approximating the marginal likelihood when closed-form expressions are not available. As an example, we will introduce applications that integrate the sparse covariance structure along with conjugate gradient optimization into estimating particle interaction kernels, and forecasting particle trajectories in Section 4, which integrates both marginalization and optimization to tackle a computationally challenging scenario.
Figure 2
Comparison between the direct computation of GP, and that by the Kalman filter (KF) and RTS smoother, both having the Matérn kernel in (3.1). The left panel shows the computational cost of computing the predictive mean over different number of observations. When $N=5000$, direct computation takes around 20 seconds, whereas the computation by KF and RTS smoother takes 0.029 seconds. The predictive mean computed in both ways is plotted in the right panel when $N=1,000$, where the root of mean squared difference between two approaches is $5.98\times {10^{-12}}$.
In Figure 2, we compare the cost for computing the predictive mean for a nonlinear function with 1D inputs [16]. The input is uniformly sampled from $[0.5,0.25]$, and an independent Gaussian white noise with a standard deviation of 0.1 is added in simulating the observations. We compare two ways of computing the predictive mean. The first approach implements direct computation of the predictive mean by Equation (2.4). The second approach is computed by the likelihood function and predictive distribution from Lemma 2 based on the Kalman filter and RTS smoother. The range and nugget parameters are fixed to be 0.5 and ${10^{-4}}$ for demonstration purposes, respectively. The computational time of this simulated experiment is shown in the left panel in Figure 2. The approach based on Kalman filter and RTS smoother is much faster, as computing the likelihood and making predictions by Kalman filter and RTS smoother only require $\mathcal{O}(N)$ operations, whereas the direct computation cost $\mathcal{O}({N^{3}})$ operations. The right panel gives the predictive mean, latent truth, and observations, when $N=1000$. The difference between the two approaches is very small, as both methods are exact.
3.3 Marginalization of Correlated Matrix Observations with Multi-dimensional Inputs
The Kalman filter is widely applied in signal processing, system control, and modeling time series. Here we introduce a few recent studies that apply Kalman filter to GP models with Matérn covariance to model spatial, spatio-temporal, and functional observations.
Let $\mathbf{y}(\mathbf{x})={({y_{1}}(\mathbf{x}),\dots ,{y_{{n_{1}}}}(\mathbf{x}))^{T}}$ be an ${n_{1}}$-dimensional real-valued output vector at a p-dimensional input vector x. For simplicity, assume the mean is zero. Consider the latent factor model:
where $\mathbf{A}=[{\mathbf{a}_{1}},\dots ,{\mathbf{a}_{d}}]$ is an ${n_{1}}\times d$ factor loading matrix and $\mathbf{z}(\mathbf{x})={({z_{1}}(\mathbf{x}),\dots ,{z_{d}}(\mathbf{x}))^{T}}$ is a d-dimensional factor processes, with $d\le {n_{1}}$. The noise process follows $\boldsymbol{\epsilon }(\mathbf{x})\sim \mathcal{N}(\mathbf{0},\hspace{0.1667em}{\sigma _{0}^{2}}{\mathbf{I}_{{n_{1}}}})$. Each factor is modeled by a zero-mean Gaussian process (GP), meaning that ${\mathbf{Z}_{l}}=({z_{l}}({\mathbf{x}_{1}}),\dots ,{z_{l}}({\mathbf{x}_{{n_{2}}}}))$ follows a multivariate normal distribution ${\mathbf{Z}_{l}^{T}}\sim \mathcal{MN}(\mathbf{0},{\boldsymbol{\Sigma }_{l}})$, where the $(i,\hspace{0.1667em}j)$ entry of ${\boldsymbol{\Sigma }_{l}}$ is parameterized by a covariance function ${\sigma _{l}^{2}}{K_{l}}({\mathbf{x}_{i}},{\mathbf{x}_{j}})$ for $l=1,\dots ,d$. The model (3.9) is often known as the semiparametric latent factor model in the machine learning community [53], and it belongs to a class of linear models of coregionalization [3]. It has a wide range of applications in modeling multivariate spatially correlated data and functional observations from computer experiments [14, 25, 40].
(3.9)
\[ \mathbf{y}(\mathbf{x})=\mathbf{A}\mathbf{z}(\mathbf{x})+\boldsymbol{\epsilon }(\mathbf{x}),\]We have the following two assumptions for model (3.9).
Assumption 1.
The prior of latent processes ${\mathbf{z}_{i}}(.)$ and ${\mathbf{z}_{j}}(.)$ are independent, for any $i\ne j$.
Assumption 2.
The factor loadings are orthogonal, i.e. ${\mathbf{A}^{T}}\mathbf{A}={\mathbf{I}_{d}}$.
The first assumption is typically assumed for modeling multivariate spatially correlated data or computer experiments [3, 25]. Secondly, note that the model in (3.9) is unchanged if we replace $(\mathbf{A},\mathbf{z}(\mathbf{x}))$ by $(\mathbf{A}\mathbf{E},{\mathbf{E}^{-1}}\mathbf{z}(\mathbf{x}))$ for an invertible matrix E, meaning that the linear subspace of A can be identified if no further constraint is imposed. Furthermore, as the variance of each latent process ${\sigma _{i}^{2}}$ is estimated by the data, imposing the unity constraint on each column of A can reduce identifiability issues. The second assumption was also assumed in other recent studies [31, 30].
Given Assumption 1 and Assumption 2, we review recent results that alleviates the computational cost. Let us first assume the observations are an $N={n_{1}}\times {n_{2}}$ matrix $\mathbf{Y}=[\mathbf{y}({\mathbf{x}_{1}}),\dots ,\mathbf{y}({\mathbf{x}_{{n_{2}}}})]$.
Lemma 3 (Posterior independence and orthogonal projection [17]).
1. (Posterior Independence). For any $l\ne m$
and for each $l=1,\dots ,d$,
\[ {\mathbf{Z}_{l}^{T}}|\mathbf{Y}\sim \mathcal{MN}({\boldsymbol{\mu }_{{Z_{l}}}},{\boldsymbol{\Sigma }_{{z_{l}}}}),\]
where ${\boldsymbol{\mu }_{{z_{l}}}}={\boldsymbol{\Sigma }_{l}}{\tilde{\boldsymbol{\Sigma }}_{l}^{-1}}{\mathbf{\tilde{y}}_{l}}$, ${\mathbf{\tilde{y}}_{l}}={\mathbf{Y}^{T}}{\mathbf{a}_{l}}$ and ${\boldsymbol{\Sigma }_{{Z_{l}}}}={\boldsymbol{\Sigma }_{l}}-{\boldsymbol{\Sigma }_{l}}{\tilde{\boldsymbol{\Sigma }}_{l}^{-1}}{\boldsymbol{\Sigma }_{l}}$ with ${\tilde{\boldsymbol{\Sigma }}_{l}}={\boldsymbol{\Sigma }_{l}}+{\sigma _{0}^{2}}{\mathbf{I}_{{n_{2}}}}$.
2. (Orthogonal Projection). After integrating $z(\cdot )$, the marginal likelihood is a product of multivariate normal densities at projected observations:
where ${\tilde{\mathbf{y}}_{c,l}}={\mathbf{Y}^{T}}{\mathbf{a}_{c,l}}$ with ${\mathbf{a}_{c,l}}$ being the lth column of ${\mathbf{A}_{c}}$, the orthogonal component of A, and $\mathcal{PN}$ denotes the density for a multivariate normal distribution.
(3.10)
\[ p(\mathbf{Y})={\prod \limits_{l=1}^{d}}\mathcal{PN}({\tilde{\mathbf{y}}_{l}};\mathbf{0},{\tilde{\boldsymbol{\Sigma }}_{l}}){\prod \limits_{l=d+1}^{{n_{1}}}}\mathcal{PN}({\tilde{\mathbf{y}}_{c,l}};\mathbf{0},{\sigma _{0}^{2}}{\mathbf{I}_{{n_{2}}}}),\]The properties in Lemma 3.9 lead to computationally scalable expressions of the maximum marginal likelihood estimator (MMLE) of factor loadings.
Theorem 1 (Generalized probabilistic principal component analysis [18]).
Assume ${\mathbf{A}^{T}}\mathbf{A}={\mathbf{I}_{d}}$, after marginalizing out ${\mathbf{Z}_{l}^{T}}\sim \mathcal{MN}(\mathbf{0},{\boldsymbol{\Sigma }_{l}})$ for $l=1,2,\dots ,d$, we have the results below.
-
• If ${\boldsymbol{\Sigma }_{1}}=\cdots ={\boldsymbol{\Sigma }_{d}}=\boldsymbol{\Sigma }$, the marginal likelihood is maximized when where U is an ${n_{1}}\times d$ matrix of the first d principal eigenvectors of $\mathbf{G}=\mathbf{Y}{({\sigma _{0}^{2}}{\boldsymbol{\Sigma }^{-1}}+{\mathbf{I}_{{n_{2}}}})^{-1}}{\mathbf{Y}^{T}}$ and S is a $d\times d$ orthogonal rotation matrix;
-
• If the covariances of the factor processes are different, denoting ${\mathbf{G}_{l}}=\mathbf{Y}{({\sigma _{0}^{2}}{\boldsymbol{\Sigma }_{l}^{-1}}+{\mathbf{I}_{{n_{2}}}})^{-1}}{\mathbf{Y}^{T}}$, the MMLE of factor loadings is
The estimator A in Theorem 1 is called the generalized probabilistic principal component analysis (GPPCA). The optimization algorithm that preserves the orthogonal constraints in (3.12) is available in [60].
In [58], the latent factor is assumed to follow independent standard normal distributions, and the authors derived the MMLE of the factor loading matrix A, which was termed the probabilistic principal component analysis (PPCA). The GPPCA extends the PPCA to correlated latent factors modeled by GPs, which incorporates the prior correlation information between outputs as a function of inputs, and the latent factor processes were marginalized out when estimating the factor loading matrix and other parameters. When the input is 1D and the Matérn covariance is used for modeling latent factors, the computational order of GPPCA is $\mathcal{O}(Nd)$, which is the same as the PCA. For correlated data, such as spatio-temporal observations and multivariate functional data, GPPCA provides a flexible and scalable approach to estimate factor loading by marginalizing out the latent factors [18].
Spatial and spatio-temporal models with a separable covariance can be written as a special case of the model in (3.9). For instance, suppose $d={n_{1}}$ and the ${n_{1}}\times {n_{2}}$ latent factor matrix follows $\mathbf{Z}\sim \mathcal{MN}(0,\hspace{0.1667em}{\sigma ^{2}}{\mathbf{R}_{1}}\otimes {\mathbf{R}_{2}})$, where ${\mathbf{R}_{1}}$ and ${\mathbf{R}_{2}}$ are ${n_{1}}\times {n_{1}}$ and ${n_{2}}\times {n_{2}}$ subcovariances, respectively. Denote the eigen decomposition ${\mathbf{R}_{1}}={\mathbf{V}_{1}}{\boldsymbol{\Lambda }_{1}}{\mathbf{V}_{1}^{T}}$ with ${\boldsymbol{\Lambda }_{1}}$ being a diagonal matrix with the eigenvalues ${\lambda _{i}}$, for $i=1,\dots ,{n_{1}}$. Then this separable model can be written as the model in (3.9), with $\mathbf{A}={\mathbf{V}_{1}}$, ${\boldsymbol{\Sigma }_{l}}={\sigma ^{2}}{\lambda _{l}}{\mathbf{R}_{2}}$. The connection suggests that the latent factor loading matrix can be specified as the eigenvector matrix of a covariance matrix, parameterized by a kernel function. This approach is studied in [17] for modeling incomplete lattice with irregular missing patterns, and the Kalman filter is integrated for accelerating computation on massive spatial and spatio-temporal observations.
4 Scalable marginalization for learning particle interaction kernels from trajectory data
Collective motions with particle interactions are very common in both microscopic and macroscopic systems [35, 36]. Learning interaction kernels between particles is important for two purposes. First, physical laws are less understood for many complex systems, such as cell migration or non-equilibrium thermodynamic processes. Estimating the interaction kernels between particles from fields or experimental data is essential for learning these processes. Second, simulation of particle interactions, such as ab initio molecular dynamics simulation, can be very computationally expensive. Statistical learning approaches can be used for reducing the computational cost of simulations.
For demonstration purposes, we consider a simple first-order system. In [34], for a system with n interacting particles, the velocity of the ith particle at time t, ${\mathbf{v}_{i}}(t)=d{\mathbf{x}_{i}}(t)/dt$, is modeled by positions between all particles,
where $\phi (\cdot )$ is a latent interaction kernel function between particle i and all other particles, with $||\cdot ||$ being the Euclidean distance, and ${\mathbf{u}_{i,j}}(t)={\mathbf{x}_{j}}(t)-{\mathbf{x}_{i}}(t)$ is a vector of differences between positions of particles i and j, for $i,j=1,\dots ,n$. Here $\phi (\cdot )$ is a two-body interaction kernel. In molecular dynamics simulation, for instance, the Lennard-Jones potential can be related to molecular forces and the acceleration of molecules in a second-order system. The statistical learning approach can be extended to a second-order system that involves acceleration and external force terms. The first-order system as (4.1) can be considered as an approximation of the second-order system. Furthermore, the interaction between particles is global, as any particle is affected by all other particles. Learning global interactions is more computationally challenging than local interactions [51], and approximating the global interaction by the local interaction is of interest in future studies.
(4.1)
\[ {\mathbf{v}_{i}}(t)={\sum \limits_{j=1}^{n}}\phi (||{\mathbf{x}_{j}}(t)-{\mathbf{x}_{i}}(t)||){\mathbf{u}_{i,j}}(t),\]One important goal is to efficiently estimate the unobservable interaction functions from the particle trajectory data, without specifying a parametric form. This goal is key for estimating the behaviors of dynamic systems in experiments and in observational studies, as the physics law in a new system may be unknown. In [13], $\phi (\cdot )$ is modeled by a zero-mean Gaussian process with a Matérn covariance:
Computing estimation of interactions of large-scale systems or more simulation runs, however, is prohibitive, as the direct inversion of the covariance matrix of observations of velocities requires $\mathcal{O}({(nMDL)^{3}})$ operations, where M is the number of simulations or experiments, n is the number of particles, D is the dimension of each particle, L denotes the number of time points for each trial. Furthermore, constructing such covariance contains computing an ${n^{2}}LM\times {n^{2}}LM$ matrix of ϕ for a D-dimensional input space, which takes $\mathcal{O}({n^{4}}{L^{2}}{M^{2}}D)$ operations. Thus, directly estimating interaction kernel with a GP model in (4.2) is only applicable to systems with a small number of observations [34, 13].
This work makes contributions from two different aspects for estimating dynamic systems of interacting particles. We first show the covariance of velocity observations can be obtained by operations on a few sparse matrices, after marginalizing out the latent interaction function. The sparsity of the inverse covariance of the latent interaction kernel allows us to modify the Kalman filter to efficiently compute the matrix product in this problem, and then apply a conjugate gradient (CG) algorithm [24, 21, 49] to solve this system iteratively. The computational complexity of computing the predictive mean and variance of a test point is at the order of $\mathcal{O}(TnN)+\mathcal{O}(nN\log (nN))$, for a system of n particles, $N=nMDL$ observations, and T is the number of iterations required in the CG algorithm. We found that typically around a few hundred CG iterations can achieve high predictive accuracy for a moderately large number of observations. The algorithm leads substantial reduction of computational cost, compared to direct computation.
Second, we study the effect of experimental designs on estimating the interaction kernel function. In previous studies, it is unclear how initial positions, time steps of trajectory and the number of particles affect the accuracy in estimating interaction kernels. Compared to other conventional problems in computer model emulation, where a “space-filling” design is often used, here we cannot directly observe the realizations of the latent function. Instead, the output velocity is a weighted average of the interaction kernel functions between particles. Besides, we cannot directly control distances between the particles moving away from the initial positions, in both simulation and experimental studies. When the distance between two particles i and j is small, the contribution $\phi (||{\mathbf{x}_{i}}(t)-{\mathbf{x}_{j}}(t)||){\mathbf{u}_{i,j}}(t)$ can be small, if the repulsive force by $\phi (\cdot )$ does not increase as fast as the distance decreases. Thus we found that the estimation of interaction kernel function can be less accurate in the input domain of small distances. This problem can be alleviated by placing initial positions of more particles close to each other, providing more data with small distance pairs that improve the accuracy in estimation.
4.1 Scalable Computation by Sparse Representation of Inverse Covariance
For illustration purposes, let us first consider a simple scenario where we have $M=1$ simulation and $L=1$ time point of a system of n interacting particles at a D dimensional space. Since we only have 1 time point, we omit the notation t when there is no confusion. The algorithm for the general scenario with $L\gt 1$ and $M\gt 1$ is discussed in Appendix A.2. In practice, the experimental observations of velocity from multiple particle tracking algorithms or particle image velocimetry typically contain noises [1]. Even for simulation data, the numerical error could be non-negligible for large and complex systems. In these scenarios, the observed velocity ${\mathbf{\tilde{v}}_{i}}={({v_{i,1}},\dots ,{v_{i,D}})^{T}}$ is a noisy realization: ${\mathbf{\tilde{v}}_{i}}={\mathbf{v}_{i}}+{\boldsymbol{\epsilon }_{i}}$, where ${\boldsymbol{\epsilon }_{i}}\sim \mathcal{MN}(0,{\sigma _{0}^{2}}{\mathbf{I}_{D}})$ denotes a vector of Gaussian noise with variance ${\sigma _{0}^{2}}$.
Assume the $nD$ observations of velocity are $\mathbf{\tilde{v}}={({\tilde{v}_{1,1}},\dots ,{\tilde{v}_{n,1}},{\tilde{v}_{1,2}},\dots ,{\tilde{v}_{n,2}},\dots ,{\tilde{v}_{n-1,D}},{\tilde{v}_{n,D}})^{T}}$. After integrating out the latent function modeled in Equation (4.2), the marginal distribution of observations follows
where U is an $nD\times {n^{2}}$ block diagonal matrix, with the ith $D\times n$ block in the diagonals being $({\mathbf{u}_{i,1}},\dots ,{\mathbf{u}_{i,n}})$, and ${\mathbf{R}_{\phi }}$ is an ${n^{2}}\times {n^{2}}$ matrix, where the $({i^{\prime }},{j^{\prime }})$ term in the $(i,j)$th $n\times n$ block is $K({d_{i,{i^{\prime }}}},{d_{j,{j^{\prime }}}})$ with ${d_{i,{i^{\prime }}}}=||{\mathbf{x}_{i}}-{\mathbf{x}^{\prime }_{i}}||$ and ${d_{j,{j^{\prime }}}}=||{\mathbf{x}_{j}}-{\mathbf{x}^{\prime }_{j}}||$ for $i,{i^{\prime }},j,{j^{\prime }}=1,\dots ,n$. Direct computation of the likelihood involves computing the inverse of an $nD\times nD$ covariance matrix and constructing an ${n^{2}}\times {n^{2}}$ matrix ${\mathbf{R}_{\phi }}$, which costs $\mathcal{O}({(nD)^{3}})+\mathcal{O}({n^{4}}D)$ operations. This is computationally expensive even for small systems.
(4.3)
\[ (\mathbf{\tilde{v}}\mid {\mathbf{R}_{\phi }},{\sigma ^{2}},{\sigma _{0}^{2}})\sim \mathcal{MN}\left(\mathbf{0},{\sigma ^{2}}\mathbf{U}{\mathbf{R}_{\phi }}{\mathbf{U}^{T}}+{\sigma _{0}^{2}}{\mathbf{I}_{nD}}\right),\]Here we use an exponential kernel function, $K(d)=\exp (-d/\gamma )$ with range parameter γ, of modeling any nonnegative distance input d for illustration, where this method can be extended to include Matérn kernels with half-integer roughness parameters. Denote distance pairs ${d_{i,j}}=||{\mathbf{x}_{i}}-{\mathbf{x}_{j}}||$, and there are $(n-1)n/2$ unique positive distance pairs. Denote the $(n-1)n/2$ distance pairs ${\mathbf{d}_{s}}={({d_{s,1}},\dots {d_{s,(n-1)n/2}})^{T}}$ in an increasing order with the subscript s meaning ‘sorted’. Here we do not need to consider the case when ${d_{i,j}}=0$, as ${\mathbf{u}_{i,j}}=\mathbf{0}$, leading to zero contribution to the velocity. Thus the model in (4.1) can be reduced to exclude the interaction between particle at zero distance. In reality, two particles at the same position are impractical, as there typically exists a repulsive force when two particles get very close. Hence, we can reduce the ${n^{2}}$ distance pairs ${d_{i,j}}$ for $i=1,\dots ,n$ and $j=1,\dots ,n$, to $(n-1)n/2$ unique positive terms ${d_{s,i}}$ in an increasing order, for $i=1,\dots ,(n-1)n/2$.
Denote the $(n-1)n/2\times (n-1)n/2$ correlation matrix of the kernel outputs $\boldsymbol{\phi }={(\phi ({d_{s,1}}),\dots ,\phi ({d_{s,(n-1)n/2}}))^{T}}$ by ${\mathbf{R}_{s}}$ and ${\mathbf{U}_{s}}$ is $nD\times (n-1)n/2$ sparse matrix with $n-1$ nonzero terms on each row, where the nonzero entries of the ith particle correspond to the distance pairs in the ${\mathbf{R}_{s}}$. Denote the nugget parameter $\eta ={\sigma _{0}^{2}}/{\sigma ^{2}}$. After marginalizing out ϕ, the covariance of velocity observations follows
with
The conditional distribution of the interaction kernel $\phi ({d^{\ast }})$ at any distance ${d^{\ast }}$ follows
where the predictive mean and variance follow
with $\mathbf{r}({d^{\ast }})={(K({d^{\ast }},{d_{s,1}}),\dots ,K({d^{\ast }},{d_{s,n(n-1)/2}}))^{T}}$. After obtaining the estimated interaction kernel, one can use it to forecast trajectories of particles and understand the physical mechanism of flocking behaviors.
(4.6)
\[ (\phi ({d^{\ast }})\mid \mathbf{\tilde{v}},\gamma ,{\sigma ^{2}},\eta )\sim \mathcal{N}(\hat{\phi }({d^{\ast }}),{\sigma ^{2}}{K^{\ast }}),\]Our primary task is to efficiently compute the predictive distribution of interaction kernel in (4.6), where the most computationally expensive terms in the predictive mean and variance is ${\mathbf{\tilde{R}}_{v}^{-1}}\mathbf{\tilde{v}}$ and ${\mathbf{\tilde{R}}_{v}^{-1}}{\mathbf{U}_{s}}\mathbf{r}({d^{\ast }})$. Note that the ${\mathbf{U}_{s}}$ is a sparse matrix with $n(n-1)d$ nonzero terms and the inverse covariance matrix ${\mathbf{R}_{s}^{-1}}$ is a tri-diagonal matrix. However, directly applying the CG algorithm is still computationally challenging, as neither ${\mathbf{\tilde{R}}_{v}}$ nor ${\mathbf{\tilde{R}}_{v}^{-1}}$ is sparse. To solve this problem, we extend a step in the Kalman filter to efficiently compute the matrix-vector multiplication with the use of sparsity induced by the choice of covariance matrix. Each step of the CG iteration in the new algorithm only costs $\mathcal{O}(nDT)$ operations for computing a system of n particles and D dimensions with T CG iteration steps. For most systems we explored, we found a few hundred iterations in the CG algorithm achieve high accuracy. The substantial reduction of the computational cost allows us to use more observations to improve the predictive accuracy. We term this approach the sparse conjugate gradient algorithm for Gaussian processes (sparse CG-GP). The algorithm for the scenario with M simulations, each containing L time frames of n particles in a D dimensional space, is discussed in Appendix A.2.
Figure 3
Estimation of particle interaction kernel by the truncated Lennard-Jones potential in [34] with $D=2$. The left panel shows the true interaction kernel (black curve) and estimated kernel functions (colored curves), based on three different ways of initial positions of $n=200$ particles. The computational time of the full GP in [13] and the sparse CG-GP approach is given in the right panel. Here the most computational intensive part of the full GP model is in constructing the correlation matrix ${\mathbf{R}_{s}}$ of ϕ for $n(n-1)/2$ distance pairs in Equation (4.5), which scales as $\mathcal{O}({n^{4}}D)$.
The comparison of the computational cost between the full GP model and the proposed sparse CG-GP method is shown in the right panel in Figure 3. The most computational expensive part of the full GP model is on constructing the $n(n-1)/2\times n(n-1)/2$ correlation matrix ${\mathbf{R}_{s}}$ of ϕ for $n(n-1)/2$ distance pairs. The sparse CG-GP algorithm is much faster as we do not need to construct this covariance matrix; instead we only need to efficiently compute matrix multiplication by utilizing the sparse structure of the inverse of ${\mathbf{R}_{s}}$ (Appendix A.2). Note the GP model with an exponential covariance naturally induces a sparse inverse covariance matrix that can be used for faster computation, which is different from imposing a sparse covariance structure for approximation.
In the left panel in Figure 3, we show the predictive mean and uncertainty assessment by the sparse CG-GP method for three different designs for sampling the initial positions of particles. From the first to the third designs, the initial value of each coordinate of the particle is sampled independently from a uniform distribution $\mathcal{U}[{a_{1}},{b_{1}}]$, normal distribution $\mathcal{N}({a_{2}},{b_{2}})$, and log uniform (reciprocal) distribution $\mathcal{LU}[\log ({a_{3}}),\log ({b_{3}})]$, respectively.
For experiments with the interaction kernel being the truncated Lennard-Jones potential given in Appendix A.2, we use ${a_{1}}=0$, ${b_{1}}=5$, ${a_{2}}=0$, ${b_{2}}=5$, ${a_{3}}={10^{-3}}$ and ${b_{3}}=5$ for three designs of initial positions. Compared with the first design, the second design of initial positions, which was assumed in [34], has a larger probability mass of distributions near 0. In the third design, the distributions of the distance between particle pairs are monotonically decreasing, with more probability mass near 0 than those in the first two designs. In all cases shown in Figure 3, we assume $M=1$, $L=1$ and the noise variance is set to be ${\sigma _{0}}={10^{-3}}$ in the simulation. For demonstration purposes, the range and nugget parameters are fixed to be $\gamma =5$ and $\eta ={10^{-5}}$ respectively, when computing the predictive distribution of ϕ. The estimation of the interaction kernel on large distances is accurate for all different designs, whereas the estimation of the interaction kernel at small distances is not satisfying for the first two designs. When particles are initialized from the third design (log-uniform), the accuracy is better, as there are more particles near each other, providing more information about the particles at small values. This result is intuitive, as the small distance pairs have relatively small contributions to the velocity based on equation (4.1), and we need more particles close to each other to estimate the interaction kernel function at small distances.
The numerical comparison between different designs allows us to better understand the learning efficiency in different scenarios, which can be used to design experiments. Because of the large improvement of computational scalability compared to previous studies [13, 34], we can accurately estimate interaction kernels based on more particles and longer trajectories.
4.2 Numerical Results
Figure 4
Estimation of interaction function based on the sparse CG-GP method, and trajectory forecast for the truncated LJ and OD simulation. In panel (a), the colored curves are estimated interactions for different initial positions and particle sizes, all based on trajectories using only $L=1$ time step, whereas the black curve is the truncated LJ used in the simulation. The colored curves in panel (b) are the same as those in panel (a), but based on trajectories in $L=10$ time steps. The panels (d) and (e) are the same as panels (a) and (b), respectively, but the simulated data are generated by the OD interaction kernel. The shared areas are the $95\% $ predictive interval. In panel (c), we graph the simulated trajectories of 10 out of 50 particles $L=200$ time steps, and the trajectory forecast based on estimated interaction function and initial positions. The arrow indicates the direction of velocities of particles at the last time step. Panel (f) is the same as panel (c), but for the OD interaction kernel.
Here we discuss two scenarios, where the interaction between particles follow the truncated Lennard-Jones (LJ) and opinion dynamics (OD) kernel functions. The LJ potential is widely used in MD simulations of interacting molecules [43]. First-order systems of form (4.1) have also been successfully applied in modeling opinion dynamics in social networks (see the survey [36] and references therein). The interaction function ϕ models how the opinions of pairs of people influence each other. In our numerical example, we consider heterophilious opinion interactions: each agent is more influenced by its neighbors slightly further away from its closest neighbors. As time evolves, the opinions of agents merge into clusters, with the number of clusters significantly smaller than the number of agents. This phenomenon was studied in [36] that heterophilious dynamics enhances consensus, contradicting the intuition that would suggest that the tendency to bond more with those who are different rather than with those who are similar would break connections and prevent clusters of consensus.
The details of the interaction functions are given in Appendix A.3. For each interaction, we test our method based on 12 configurations of 2 particle sizes ($n=50$ and $n=200$), 2 time lengths ($L=1$ and $L=10$), and 3 designs of initial positions (uniform, normal and log-uniform). The computational scalability of the sparse CG algorithm allows us to efficiently compute the predictions in most of these experimental settings within a few seconds. For each configuration, we repeat the experiments 10 times to average the effects of randomness in the initial positions of particles. The root of the mean squared error in predicting the interaction kernels by averaging these 10 experiments of each configuration is given in Appendix A.4.
In Figure 4, we show the estimation of interactions kernels and forecasts of particle trajectories with different designs, particle sizes and time points. The sparse CG-GP method is relatively accurate for almost all scenarios. Among different initial positions, the estimation of trajectories for LJ interaction is the most accurate when the initial positions of the particles are sampled by the log-uniform distribution. This is because there are more small distances between particles when the initial positions follow a log-uniform distribution, providing more data to estimate the interaction kernel at small distances. Furthermore, when we have more particles or observations at larger time intervals, the estimation of the interaction kernel from all designs becomes more accurate in terms of the normalized root mean squared error with the detailed comparison given in Appendix A.4.
In panel (c) and panel (f) of Figure 4, we plot the trajectory forecast of 10 particles over 200 time points for the truncated LJ kernel and OD kernel, respectively. In both simulation scenarios, interaction kernels are estimated based on trajectories of $n=50$ particles across $L=20$ time steps with initial positions sampled from the log-uniform design. The trajectories of only 10 particles out of 50 particles are shown for better visualization. For trajectories simulated by the truncated LJ, some particles can move very close, since the repulsive force between two particles becomes smaller as the force is proportional to the distance from Equation (4.1), and the truncation of kernel substantially reduces the repulsive force when particles move close. For the OD simulation, the particles move toward a cluster, as expected, since the particles always have attractive forces between each other. The forecast trajectories are close to the hold-out truth, indicating the high accuracy of our approach.
Compared with the results shown in previous studies [34, 13], estimating the interaction kernels and forecasting trajectories both look more accurate. The large computational reduction by the sparse CG-GP algorithm shown in Figure 3 permits the use of longer trajectories from more particles to estimate the interaction kernel, which improves the predictive accuracy. Here particle has interactions with all other particles in our simulation, making the number of distance pairs large. Yet we are able to estimate the interaction kernel and forecast the trajectories of particles within only tens of seconds in a desktop for the most time consuming scenario we considered. Since the particles typically have very small or no interaction when the distances between them are large, approximation can be made by enforcing interactions between particles within the specified radius, for further reducing the computational cost.
5 Concluding Remarks
We have introduced scalable marginalization of latent variables for correlated data. We first introduce GP models and reviewed the SDE representation of GPs with Matérn covariance and one-dimensional input. Kalman filter and RTS smoother were introduced as a scalable marginalization way to compute the likelihood function and predictive distribution, which reduces the computational complexity of GP with Matérn covariance for 1D input from $\mathcal{O}({N^{3}})$ to $\mathcal{O}(N)$ operations without approximation, where N is the number of observations. Recent efforts on extending scalable computation from 1D input to multi-dimensional input are discussed. In particular, we developed a new scalable algorithm for predicting particle interaction kernel and forecast trajectories of particles. The achievement is through the sparse representation of GPs in modeling interaction kernel, and then efficient computation for matrix multiplication by modifying the Kalman filter algorithm. An iterative algorithm based on CG can then be applied, which reduces the computational complexity.
There are a wide range of future topics relevant to this study. First, various models of spatio-temporal data can be written as random factor models in (3.9) with latent factors modeled as Gaussian processes for temporal inputs. It is of interest to utilize the computational advantage of the dynamic linear models of factor processes, extending the computational tools by relaxing the independence between prior factor processes in Assumption 1 or incorporating the Toeplitz covariance structure for stationary temporal processes. Second, for estimating systems of particle interactions, we can further reduce computation by only considering interactions within a radius between particles. Third, a comprehensively study the experimental design, initialization, and parameter estimation in will be helpful for estimating latent interaction functions that can be unidentifiable or weakly identifiable in certain scenarios. Furthermore, velocity directions and angle order parameters are essential for understanding the mechanism of active nematics and cell migration, which can motivate more complex models of interactions. Finally, the sparse CG algorithm developed in this work is of interest to reducing the computational complexity of GP models with multi-dimensional input and general designs.