1 Introduction and Motivation
Using machine learning models in real world applications, for instance for industrial optimisation and testing [9, 14], banking [31, 2, 24], or as tools in the context of social services [15], imposes stringent requirements on their validation. The same happens in the framework of computer experiments, see for instance [25, 27], where numerically efficient machine learning models are used as controlled-error approximations of mathematical models with prohibitive computational complexity, [6, 39, 10].
Model validation ideally resorts to a reserved test set, i.e. to evaluations of the modelled function on data points that have not been used neither to select nor to train the machine learning model [5, 46, 18]. Using the errors of the model on this test set enables the assessment of the model quality, using for instance estimates of the mean-squared error in the context of regression problems, or the rate of labeling errors of classifiers. This is the setting addressed in this paper. When such a test-set cannot be made available, model validation is most commonly done by cross-validation [23, 17, 7], relying on the errors of models learnt only on a subset of the learning set to infer the error of the model that integrates the entire dataset.
This paper proposes a methodology to estimate the quality of an interpolator learned on a given experimental design. More precisely, we suppose that data gathered on the points of an experimental design ${\mathbf{X}_{n}}=\{{\mathbf{x}_{1}},\dots ,{\mathbf{x}_{n}}\}$ with n points in a compact set2 $\mathcal{X}$ has been used to build a predictor of the value of the function $f\hspace{0.1667em}:\hspace{0.1667em}\mathcal{X}\to \mathbb{R}$ that produced the collected samples.
We denote by ${\mathbf{y}_{n}}={(f({\mathbf{x}_{1}}),\dots ,f({\mathbf{x}_{n}}))^{\top }}$ the vector collecting the n evaluations of f at the design points ${\mathbf{x}_{i}}$, by ${\mathcal{F}_{n}}=({\mathbf{X}_{n}},{\mathbf{y}_{n}})$ the learning dataset, and by ${\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})$ the resulting predictor of $f(\mathbf{x})$. The quality of ${\eta _{{\mathcal{F}_{n}}}}$ is assessed through a widely used measure of the precision of interpolators, the Integrated Squared Error (ISE):
see, e.g., [38] for an early reference. In the definition above the (user-defined) measure μ enables penalisation of the interpolation errors over regions of $\mathcal{X}$ which are considered to be of particular importance. We stress that we consider that the experimental design ${\mathbf{X}_{n}}$ – also referred to as the “learning design” – is given, making no assumptions on how it is has been chosen.
(1.1)
\[ \mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})={\int _{\mathcal{X}}}{\left[{\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})-f(\mathbf{x})\right]^{2}}\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\hspace{0.1667em},\]Estimation of the integral (1.1) must necessarily resort to the evaluation of the prediction error $\varepsilon (\mathbf{x})=f(\mathbf{x})-{\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})$ over only a finite set of points ${\mathbf{Z}_{m}}=\{{\mathbf{z}_{1}},\dots {\mathbf{z}_{m}}\}\subset \mathcal{X}$, which we designate by “validation design”. The integral is then approximated by replacing μ by a point mass measure $\zeta =\zeta (\mathbf{w},{\mathbf{Z}_{m}})={\textstyle\sum _{i}}{\mathbf{w}_{i}}{\delta _{{\mathbf{z}_{i}}}}$ supported on ${\mathbf{Z}_{m}}$ only.3 We generically refer to ζ as the validation measure, using the notation ${\zeta _{m}}$ to make explicit the size of the validation set. Although ζ is not necessarily the uniform distribution supported on ${\mathbf{Z}_{m}}$, with a slight abuse of terminology we refer to the corresponding ISE estimators
as empirical ISE estimators.
We address the choice of the validation measure ζ – both of the validation design ${\mathbf{Z}_{m}}$ and of the validation weights w – and investigate the properties of the resulting estimates $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ given by (1.2). The algorithms presented are iterative, defining increasing sequences of nested validation designs ${\mathbf{Z}_{m}}\subset {\mathbf{Z}_{m+1}}\subset {\mathbf{Z}_{m+2}}\subset \cdots \hspace{0.1667em}$ such that the performance of $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ improves as m increases. A preliminary version of this work has been presented in [12], in the context of a comprehensive comparison of validation methodologies.
The paper is organised as follows. Section 2.1 first relates the ISE estimators (1.2) to other ISE estimators. Then, assuming that the interpolated function f is a realisation of a Gaussian Process (GP) with known moments, we present in Section 2.2 a computable criterion $\mathcal{R}(\zeta ,{\mathcal{F}_{n}})$ that measures the precision of empirical estimators of the form (1.2). In Section 3 we discuss optimisation of $\mathcal{R}(\zeta ,{\mathcal{F}_{n}})$, detailing application of related existing algorithms to the specific conditions of the validation problem of interest here, and revealing an instrumental interpretation of the corresponding “optimal” empirical $\mathsf{ISE}$ estimators. Since the “optimal” validation measure depends on the assumed GP model, the robustness and performance of the validation methodology presented are investigated numerically in Section 4, leading to two major conclusions. One concerns the validation weights w, stating that to avoid overestimation of $\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})$ the contributions of the individual errors $\varepsilon ({\mathbf{z}_{i}})$ to $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ must be down-weighted – with respect to taking ζ as the uniform distribution over ${\mathbf{Z}_{m}}$. The second concerns the geometry of the validation design ${\mathbf{Z}_{m}}$, whose optimality is seen to be much less important that correct choice of the weights w. Based on these numerical studies we propose a default choice for the covariance kernel of the GP model used, including its scale parameter. The numerical studies of Section 4 resort to simulation from selected Gaussian processes, and consider only optimal kriging interpolators. In Section 5 we present results on two “real case” models and for more general interpolators, confirming the robustness and performance of the proposed estimator. Finally, Section 6 summarises our findings and proposes some directions for future work.
Throughout the manuscript we frequently resort to the notion of space-filling designs, i.e., designs whose points are evenly spread over $\mathcal{X}$. This notion has been extensively studied in the experimental design literature, in particular in the context of identification of surrogate models of computer experiments, and several mathematical criteria – e.g. discrepancy, or the classical minimax- and maximin-distance criteria, also called covering and packing radii, see [19, 32, 34] – have been proposed to measure how much a given design is space filling. In this paper, we use the term in a rather informal manner, meaning that the points of the design are well spread over $\mathcal{X}$, no design point being too close to the remaining points, so that for the majority of the usual space-filling criteria mentioned above it should be considered as a good design.
2 A Criterion for Validation Measures
Since f is unknown, we can at best expect to find an ISE estimator that performs well for most functions f consistent with dataset ${\mathcal{F}_{n}}$. To characterise this set of functions we adopt the Gaussian process framework – briefly recalled below – enabling us to subsequently derive a criterion to choose the validation measure ζ.
Before doing that, the next section puts our approach in perspective in relation to other (non-parametric) model validation methods.
2.1 Empirical ISE Estimation
Non-parametric estimation of the ISE of a computational model learned on a dataset ${\mathcal{F}_{n}}$ is most commonly done using ${\mathcal{F}_{n}}$ itself. In cross-validation (CV), see e.g. [8, 4], the residuals ${\varepsilon _{i}^{cv}}={\mathbf{y}_{i}}-{\eta _{{\mathcal{F}_{n}}\setminus ({\mathbf{x}_{i}},{\mathbf{y}_{i}})}}({\mathbf{x}_{i}})$ at each data point $({\mathbf{x}_{i}},{\mathbf{y}_{i}})$ of a predictor fit to all other $n-1$ points of ${\mathcal{F}_{n}}$ are computed, and $\mathsf{ISE}$ is estimated by their average:
The setup considered in this paper is in some sense dual of CV. On the one hand, CV requires more information about η, assuming the ability to build the n new predictors ${\eta _{{\mathcal{F}_{n}}\setminus ({\mathbf{x}_{i}},{\mathbf{y}_{i}})}}$ (one for each point that is “left out”) and assumes thus knowledge of how ${\eta _{{\mathcal{F}_{n}}}}$ is learned, while we consider ${\eta _{{\mathcal{F}_{n}}}}$ as a black-box model delivered by a third party, using an undisclosed modelling approach. On the other hand, CV requires no any additional observations of f, while $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ requires m new evaluations, one at each point of ${\mathbf{Z}_{m}}$.
(2.1)
\[ {\widehat{\mathsf{ISE}}_{cv}}=\frac{1}{n}{\sum \limits_{i=1}^{n}}{\left({\varepsilon _{i}^{cv}}\right)^{2}}\hspace{0.1667em}.\]Given the observations of f over a validation set ${\mathbf{Z}_{m}}$, a straightforward estimate of the ISE is the simple arithmetic mean of the squared values of the m residuals ${\varepsilon _{i}}=f({\mathbf{z}_{i}})-{\eta _{{\mathcal{F}_{n}}}}({\mathbf{z}_{i}})$ observed over the ${\mathbf{z}_{i}}\in {\mathbf{Z}_{m}}$:
a special case of (1.2), obtained by letting ζ be the uniform distribution over ${\mathbf{Z}_{m}}$: $\zeta =(1/m){\textstyle\sum _{i}}{\delta _{{\mathbf{z}_{i}}}}$.
(2.2)
\[ {\widehat{\mathsf{ISE}}_{un}}=\frac{1}{m}{\sum \limits_{i=1}^{m}}{\varepsilon _{i}^{2}}\hspace{2.5pt}\hspace{0.1667em},\]We argue below that there is no rationale for uniform weighting of the observed residuals. Let ${p_{\eta }}$ denote the (unknown) probability density of the residuals $\varepsilon (\mathbf{x})$ when $\mathbf{x}\sim \mu $, and consider situations where ${\mathbf{Z}_{m}}$ is a space-filling continuation of ${\mathbf{X}_{n}}$, sampling the regions of $\mathcal{X}$ the most distant from ${\mathbf{X}_{n}}$. We can then expect $\varepsilon ({\mathbf{Z}_{m}})=\{\varepsilon (\mathbf{z}),\mathbf{z}\in {\mathbf{Z}_{m}}\}$ to be biased towards the upper limit of the support of ${p_{\eta }}$, and thus ${\widehat{\mathsf{ISE}}_{un}}$ to over-estimate $\mathsf{ISE}$. To correct from this biased sampling of the errors, the contribution of each observed residual to $\widehat{\mathsf{ISE}}$ should be adjusted, counterbalancing the anticipated poor sampling of the smallest residual values. The validation measures ζ proposed in this paper automatically implement this variable residual weighting, relying on a prior stochastic model for f to infer how much $\varepsilon ({\mathbf{Z}_{m}})$ is expected to be representative of the errors over the entire $\mathcal{X}$. Moreover, nothing justifies enforcing ζ to be a proper probability distribution. If $\varepsilon ({\mathbf{Z}_{m}})$ is not a plausible i.i.d.4 sample from ${p_{\eta }}$, expression (2.2) cannot be assimilated to a Monte Carlo estimate of the ISE integral unless appropriate importance sampling weights are used.
This means that there is no reason to impose that ${\textstyle\sum _{i}}{\mathbf{w}_{i}}=1$, and we thus drop this common constraint, letting ζ be an un-normalised measure dictated by the geometry of ${\mathbf{Z}_{m}}$ relative to ${\mathbf{X}_{n}}$. To substantiate this choice, note that when ${\eta _{{\mathcal{F}_{n}}}}$ is an interpolator, so that $\varepsilon ({\mathbf{x}_{i}})=0$ for all ${\mathbf{x}_{i}}\in {\mathbf{X}_{n}}$, incorporation of these n zero residuals in (2.2), which should lead to a better estimator of $\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})$, yields
\[ {\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}=\frac{1}{m+n}{\sum \limits_{i=1}^{m}}{\varepsilon _{i}^{2}}\lt {\widehat{\mathsf{ISE}}_{un}}\hspace{2.5pt}\hspace{0.1667em},\]
for which ${\textstyle\sum _{i}}{\mathbf{w}_{i}}=m/(n+m)\lt 1$. Analysis of the biases of estimators ${\widehat{\mathsf{ISE}}_{un}}$ and ${\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}$ is difficult, since, as discussed above, the residuals observed over the validation design are not an i.i.d. sample from ${p_{\eta }}$. Figure 1 illustrates numerically the performance of the two estimators on a simple example, showing histograms of the errors of estimators ${\widehat{\mathsf{ISE}}_{un}}$ (in blue) and ${\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}$ (in red) over 500 realisations of a Gaussian process.5 On the top panel, $n=m=15$ while the bottom panel corresponds to a larger learning design, with $n=25$. We can see that in both cases the estimations errors of ${\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}$ are smaller than those of ${\widehat{\mathsf{ISE}}_{un}}$. The two estimators are affected by biases of opposite signs, the (positive) bias of ${\widehat{\mathsf{ISE}}_{un}}$ being larger (even when $n\gt m$) than the (negative) bias of ${\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}$. Larger values of n produce qualitatively similar comparison results (not shown). When n and m are very different, more sophisticated corrections, depending on the distinct effective sampling rates of the learning and validation designs, could be defined and should yield better results. The bottom line is that even this simplistic correction (uniform down-weighting) is able to reduce the error in the estimate of the $\mathsf{ISE}$.2.2 Choosing the Validation Measure: a GP-Based Criterion
The estimation error $|\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )-\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})|$ is not a computable criterion that we can optimise to choose ζ. A possible approach would be to consider that f belongs to some class of functions $\mathcal{C}$ and optimise the worst estimation performance over all $f\in \mathcal{C}$. Here we follow an alternative and simpler route, assuming that f is a realisation of a Gaussian Process (GP), or Gaussian Random Field, and minimising the second-order moment of the ISE estimation error under the assumed model.
Assume thus that the function f is a sample ${\mathcal{F}_{x}}$ from a GP indexed by $\mathcal{X}$, with known second-order characteristics $\mathsf{E}\{{\mathcal{F}_{x}}{\mathcal{F}_{{x^{\prime }}}}\}={\sigma ^{2}}K(\mathbf{x},{\mathbf{x}^{\prime }})$: $f\sim {\mathcal{GP}^{f}}(m(\mathbf{x}),{\sigma ^{2}}K(\mathbf{x},{\mathbf{x}^{\prime }}))$. The kernel K is supposed to be Strictly Positive Definite (SPD), and, for the sake of simplicity, we consider that $m(\mathbf{x})=\mathsf{E}\{{\mathcal{F}_{x}}\}=0$ for all $\mathbf{x}\in \mathcal{X}$. Extension of the material presented below to the case of a linearly parameterised mean, with $\mathsf{E}\{{\mathcal{F}_{x}}\}={\boldsymbol{\beta }^{\top }}\mathbf{h}(\mathbf{x})$ for a vector $\boldsymbol{\beta }$ of unknown parameters and a vector $\mathbf{h}(\mathbf{x})={({h_{1}}(\mathbf{x}),\dots ,{h_{p}}(\mathbf{x}))^{\top }}$ of p known functions of x is possible via some adaptation.
Under the assumption above $\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})$ given by (1.1) is a random variable. The statistical moments of $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ under this stochastic model for f provide computable and pertinent criteria to chose ζ. We use the Mean Squared Error (MSE) of $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ given ${\mathcal{F}_{n}}$,
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle \mathcal{R}({\zeta _{m}},{\mathcal{F}_{n}})& \displaystyle =& \displaystyle \mathsf{E}\left\{\left.{\left[\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})-\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};{\zeta _{m}})\right]^{2}}\right|{\mathcal{F}_{n}}\right\}\\ {} & & \displaystyle \hspace{-36.98866pt}=\mathsf{E}\left\{\left.{\left[{\int _{\mathcal{X}}}{[{\mathcal{F}_{x}}-{\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})]^{2}}\hspace{0.1667em}({\zeta _{m}}-\mu )(\mathrm{d}\mathbf{x})\right]^{2}}\right|{\mathcal{F}_{n}}\right\}\hspace{0.1667em},\end{array}\]
as a criterion to choose the validation design: ${\zeta _{m}^{\mathrm{\star }}}({\mathcal{F}_{n}})\in {\operatorname{arg\,min}_{{\zeta _{m}}}}\mathcal{R}({\zeta _{m}},{\mathcal{F}_{n}})$.The GP assumption defines a prior distribution for f, which given ${\mathcal{F}_{n}}$ can be updated into the posterior distribution of its values over the unobserved points, with mean $\mathsf{E}\{{\mathcal{F}_{x}}|{\mathcal{F}_{n}}\}={\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\mathbf{y}_{n}}$ and covariance $\mathsf{E}\{{\mathcal{F}_{x}}{\mathcal{F}_{{x^{\prime }}}}|{\mathcal{F}_{n}}\}={\sigma ^{2}}{K_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})$, with ${K_{|n}}$ defined by
for any x, ${\mathbf{x}^{\prime }}$ in $\mathcal{X}$, where
(2.3)
\[ {K_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})=K(\mathbf{x},{\mathbf{x}^{\prime }})-{\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\mathbf{k}_{n}}({\mathbf{x}^{\prime }})\]
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\mathbf{k}_{n}}(\mathbf{x})& \displaystyle =& \displaystyle {(K(\mathbf{x},{\mathbf{x}_{1}})\hspace{0.1667em}\dots ,K(\mathbf{x},{\mathbf{x}_{n}}))^{\top }}\\ {} \displaystyle {\{{\mathbf{K}_{n}}\}_{i,j}}& \displaystyle =& \displaystyle K({\mathbf{x}_{i}},{\mathbf{x}_{j}})\hspace{0.1667em},\hspace{2.5pt}i,j=1,\dots ,n\hspace{0.1667em}.\end{array}\]
The $n\times n$ matrix ${\mathbf{K}_{n}}$ is SPD as K is SPD (we assume that the ${\mathbf{x}_{i}}$ in ${\mathbf{X}_{n}}$ are pairwise distinct). Note that ${K_{|n}}({\mathbf{x}_{i}},\mathbf{x})=0$ for all $\mathbf{x}\in \mathcal{X}$ and all ${\mathbf{x}_{i}}\in {\mathbf{X}_{n}}$. The Integrated Mean Squared Error (IMSE) is thus
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle \mathsf{IMSE}({\mathcal{F}_{n}})& \displaystyle =& \displaystyle {\int _{\mathcal{X}}}\mathsf{E}\left\{{\left[{\mathcal{F}_{x}}-{\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})\right]^{2}}|{\mathcal{F}_{n}}\right\}\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\\ {} & & \displaystyle \hspace{-50.0pt}={\int _{\mathcal{X}}}\mathsf{E}\left\{{\left[{\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})-{\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\mathbf{y}_{n}}\right]^{2}}|{\mathcal{F}_{n}}\right\}\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\\ {} & \hspace{2.5pt}& \displaystyle +{\sigma ^{2}}{\int _{\mathcal{X}}}{K_{|n}}(\mathbf{x},\mathbf{x})\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\hspace{0.1667em}.\end{array}\]
$\mathsf{IMSE}({\mathcal{F}_{n}})$ is minimum when ${\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})$ is the posterior mean ${\mathbf{k}_{n}}{(\mathbf{x})^{\top }}{\mathbf{K}_{n}^{-1}}{\mathbf{y}_{n}}$. This minimum value depends only on the learning design ${\mathbf{X}_{n}}$ and is given by
For any kernel K and signed measure ν on $\mathcal{X}$, let ${\mathcal{E}_{K}}(\nu )$ denote the energy of ν for K,
with ${\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})=(1/{\sigma ^{4}})\mathsf{E}\left\{\left.{\varepsilon ^{2}}(\mathbf{x}){\varepsilon ^{2}}{(\mathbf{x})^{\prime }}\right|{\mathcal{F}_{n}}\right\}$, a scaled version of the second-order moment of the squared residuals; that is, $\mathcal{R}({\zeta _{m}},{\mathbf{X}_{n}})$ is proportional to the squared MMD between the measures ${\zeta _{m}}$ and μ for the kernel ${\overline{K}_{|n}}$. Under the GP model ${\mathcal{GP}^{f}}$,
and we are thus lead to
\[ {\mathcal{E}_{K}}(\nu )={\int _{{\mathcal{X}^{2}}}}K(\mathbf{x},{\mathbf{x}^{\prime }})\hspace{0.1667em}\nu (\mathrm{d}\mathbf{x})\nu (\mathrm{d}{\mathbf{x}^{\prime }})\hspace{0.1667em}.\]
When K defines a Reproducing Kernel Hilbert Space (RKHS) ${\mathcal{H}_{K}}$, for any function f in ${\mathcal{H}_{K}}$ and any probability measures ξ and μ on $\mathcal{X}$, the integration error ${\Delta _{\xi ,\mu }}(f)=\left|{\textstyle\int _{\mathcal{X}}}f(\mathbf{x})\hspace{0.1667em}\xi (\mathrm{d}\mathbf{x})-{\textstyle\int _{\mathcal{X}}}f(\mathbf{x})\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\right|$ can be bounded by the product of two terms, one depending of f only, the other on the signed measure $\xi -\mu $ but not on f. Indeed, application of (i) the reproducing property $f(\mathbf{x})={\langle f,{K_{\mathbf{x}}}\rangle _{{\mathcal{H}_{K}}}}$, where ${\langle \cdot ,\cdot \rangle _{{\mathcal{H}_{K}}}}$ denotes the scalar product in ${\mathcal{H}_{K}}$ and where, for any ${\mathbf{x}^{\prime }}\in \mathcal{X}$, ${K_{\mathbf{x}}}({\mathbf{x}^{\prime }})=K(\mathbf{x},{\mathbf{x}^{\prime }})$, and (ii) of the Cauchy-Schwarz inequality, gives ${\Delta _{\xi ,\mu }}(f)\le \| f{\| _{{\mathcal{H}_{K}}}}{\mathcal{E}_{K}^{1/2}}(\xi -\mu )$, where the quantity ${\mathcal{E}_{K}^{1/2}}(\xi -\mu )$ is called the Maximum-Mean Discrepancy (MMD) between ξ and μ; see, e.g., [41, 40, 36]. Direct calculation yields $\mathcal{R}({\zeta _{m}},{\mathcal{F}_{n}})=\mathcal{R}({\zeta _{m}},{\mathbf{X}_{n}})$, with
(2.5)
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle \mathcal{R}({\zeta _{m}},{\mathbf{X}_{n}})\hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}& \displaystyle =& \displaystyle \hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}{\sigma ^{4}}{\int _{{\mathcal{X}^{2}}}}{\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})({\zeta _{m}}-\mu )(\mathrm{d}\mathbf{x})({\zeta _{m}}-\mu )(\mathrm{d}{\mathbf{x}^{\prime }})\\ {} \hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}& \displaystyle =& \displaystyle \hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}{\sigma ^{4}}{\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )\hspace{0.1667em},\end{array}\](2.6)
\[ {\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})=2\hspace{0.1667em}{K_{|n}^{2}}(\mathbf{x},{\mathbf{x}^{\prime }})+{K_{|n}}(\mathbf{x},\mathbf{x}){K_{|n}}({\mathbf{x}^{\prime }},{\mathbf{x}^{\prime }})\hspace{0.1667em},\]
\[ {\zeta _{m}^{\mathrm{\star }}}({\mathcal{F}_{n}})\in \underset{{\zeta _{m}}}{\operatorname{arg\,min}}{\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )\hspace{0.1667em},\]
with ${\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})$ given by (2.6).When ${\eta _{{\mathcal{F}_{n}}}}$ does not interpolate ${\mathbf{y}_{n}}$, and under the same GP model for f, similar developments still give $\mathcal{R}({\zeta _{m}},{\mathcal{F}_{n}})={\sigma ^{4}}{\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$, with now
where ${\widehat{\delta }_{n}}(\mathbf{x})={\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\mathbf{y}_{n}}-{\boldsymbol{\eta }_{{\mathcal{F}_{n}}}}(\mathbf{x})$. Although in the following we will always consider that ${\eta _{{\mathcal{F}_{n}}}}$ is the optimal interpolator ${\mathbf{k}_{n}}{(\mathbf{x})^{\top }}{\mathbf{K}_{n}^{-1}}{\mathbf{y}_{n}}$, and thus that (2.6) holds, note that our approach covers generic machine learning predictors by considering ${\overline{K}_{|n}}$ defined by (2.7).
(2.7)
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})& \displaystyle =& \displaystyle 2\hspace{0.1667em}\left[{K_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})+2\hspace{0.1667em}{\widehat{\delta }_{n}}(\mathbf{x}){\widehat{\delta }_{n}}({\mathbf{x}^{\prime }})\right]{K_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})\\ {} & & \displaystyle \hspace{-56.9055pt}+\hspace{0.1667em}\left[{\widehat{\delta }_{n}^{2}}(\mathbf{x})+{K_{|n}}(\mathbf{x},\mathbf{x})\right]\left[{\widehat{\delta }_{n}^{2}}({\mathbf{x}^{\prime }})+{K_{|n}}({\mathbf{x}^{\prime }},{\mathbf{x}^{\prime }})\right]\hspace{0.1667em},\end{array}\]Kernels ${\overline{K}_{|n}}$ present a number of features which are not shared by the most commonly used GP kernels. The assumption that ${\eta _{{\mathcal{F}_{n}}}}$ is an interpolator, i.e. $\varepsilon ({\mathbf{x}_{i}})=0$, implies that ${\overline{K}_{|n}}({\mathbf{x}_{i}},\mathbf{x})=0$ for all $\mathbf{x}\in \mathcal{X}$ and all ${\mathbf{x}_{i}}\in {\mathbf{X}_{n}}$. The squared error process is thus non-stationary, with a spatial coherency structure that is strongly dictated by the geometry of ${\mathbf{X}_{n}}$. Adapting the validation weights ${\mathbf{w}_{i}}$ to this correlation structure dictates the performance of $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};{\zeta _{m}})$ in a critical manner. Yet, as the numerical studies of Section 4 show, exploiting the particular shape of ${\overline{K}_{|n}}$ when choosing the validation points ${\mathbf{Z}_{m}}$ is less critical (as long as they do not fall in the vicinity of ${\mathbf{X}_{n}}$).
Finally, notice that ${\overline{K}_{|n}}$ is PD. Indeed, the Hadamard product ${\mathbf{C}_{n}^{\circ 2}}$ with elements ${\{{\mathbf{C}_{n}^{\circ 2}}\}_{i,j}}={C^{2}}({\mathbf{x}_{i}},{\mathbf{x}_{j}})$, $i,j=1,\dots ,n$, is PD when the matrix ${\mathbf{C}_{n}}$ with elements ${\{{\mathbf{C}_{n}}\}_{i,j}}=C({\mathbf{x}_{i}},{\mathbf{x}_{j}})$ is PD. Hence, the positive definiteness of ${K_{|n}}$ implies that ${K_{|n}^{2}}$ is PD, which in turn implies that ${\overline{K}_{|n}}$ is also PD.
3 Minimisation of ${\mathcal{E}_{{\overline{K}_{|n}}}}$
We address now the minimisation of ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$ with respect to ${\zeta _{m}}$.
We drop the two constraints usually imposed on weights: besides the sum-of weights-equals-one constraint (see Section 2.1), we also do not impose ${\mathbf{w}_{i}}\ge 0$. Imposing positivity would be natural if the observations were noisy independent random samples of the interpolation error, but here the ${\varepsilon _{i}}$ are noise-free and, more importantly, strongly linked by a coherency structure dictated by both the regularity characteristics of f and the quality of ${\eta _{{\mathcal{F}_{n}}}}$ as an interpolator. Nonetheless, our numerical experiments show that the ${\mathbf{w}_{i}}$ are almost always positive; see for example Figures 13 and 14. One may refer to [21] for an investigation of conditions that ensure positivity of quadrature weights, which shows that positivity can be guaranteed only under rather specific circumstances.
Since for a given f and ${\eta _{{\mathcal{F}_{n}}}}$ the validation residuals are deterministic, repeating validation points or choosing ${\mathbf{z}_{i}}\in {\mathbf{X}_{n}}$ brings no additional information. We thus restrict ${\mathbf{Z}_{m}}$ to configurations of m distinct points in $\mathcal{X}\setminus {\mathbf{X}_{n}}$. The minimisation of ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$ with respect to the parameters of ${\zeta _{m}}$ is a non-linear optimisation problem over a large dimensional space ($m(d+1)$ scalar parameters when $\mathcal{X}\subset {\mathbb{R}^{d}}$). As briefly evoked in the introduction, rather than fixing upfront the size m of the validation design, we are interested in finding nested sequences of validation designs, generated by a sequence of identical steps, each one increasing the design size by one:
where ${\mathbf{z}_{m+1}}$ is restricted to ${\mathcal{X}_{m}}=\mathcal{X}\setminus \{{\mathbf{Z}_{m}}\cup {\mathbf{X}_{n}}\}$.
Before we present in Section 3.2 the sequential Bayesian quadrature algorithm that performs this iterative construction, greedily decreasing ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$ at each step, we present background on relevant literature on iterative energy (or, equivalently, MMD) minimisation.
3.1 Background
Kernel Herding (KH) [44] can be seen to correspond to the Frank-Wolfe conditional gradient algorithm [3] applied to MMD minimisation, that is, to the vertex-direction method with predefined step-length, commonly used in optimal experimental design since the pioneering work of H.P. Wynn [45] and V.V. Fedorov [11]. It is an accretive method,6 generating a sequence ${\mathbf{z}_{1}},{\mathbf{z}_{2}},\dots $ which can be incrementally grown to any target size m.
In Bayesian quadrature (BQ) [29, 37] the goal is to choose samples that best approximate an integral by exploiting the assumption that the integrated function is the realisation of a GP. Sequential BQ (SBQ) sequentially expands the set of sampled points by adding a new sample at the point that decreases the variance of the integral estimate the most. This variance is shown to be the MMD between the target integral measure and the discrete measure that implements the quadrature rule for the kernel of the assumed GP model.
KH and SBQ are closely related, see e.g. [16], both attempting to minimise the same MMD. The two techniques embed the problem in consideration in the RKHS of a positive definite kernel that is chosen to reflect the characteristics of the underlying data distribution (in the original formulation of KH) or of the integrated functions (in SBQ). As stressed in [16], a major distinction between the two techniques concerns the weights assigned to each sample, which are uniform for standard KH, while they are optimally selected in SBQ. The two methods differ both in complexity and performance: SBQ is superior to standard (uniform weight) KH, this improvement coming at the cost of an increased complexity, $O(n)$ for KH and $O({n^{2}})$ for SBQ when constructing an n-point design among a finite set of candidates; see [33].
Experiments combining the two methodologies, by using the optimal BQ weights for a design found by standard KH, show that correct weighting is more critical than sample placement [16, 33], affecting in particular the algorithm’s convergence rate: KH has performance similar to SBQ for small design sizes, but displays worse performance as design size grows.
The validation setup of this paper coincides with the framework assumed by BQ, our final goal being to estimate an integral from a small number of samples, and we also resort to a GP assumption. As in BQ, the weights of our empirical estimator do not need to sum to 1 and are not necessarily positive (see [21]), and the optimal solution minimises an MMD. Placing the GP assumption not directly on the function we wish to integrate – in our case ${\varepsilon ^{2}}(\mathbf{x})$ – but on the interpolated f, leads to the identification of the pertinent MMD kernel under our validation framework as the non-stationary kernel ${\overline{K}_{|n}}$, whose structure encodes the geometry of the learning design ${\mathbf{X}_{n}}$.
Both KH and BQ assume that the RKHS kernel is characteristic, meaning that the corresponding MMD between two probability measures is zero if and only if these two measures coincide. Kernel ${\overline{K}_{|n}}$ is not characteristic, and in particular it cannot differentiate between measures that differ only over the finite set ${\mathbf{X}_{n}}$, where ${\overline{K}_{|n}}$ is zero. However, as we stressed before, since we know that $\varepsilon (\mathbf{x})=0$ for $\mathbf{x}\in {\mathbf{X}_{n}}$, the set of target measures over which we minimise ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$ all put zero mass on ${\mathbf{X}_{n}}$, and thus this minimisation still makes sense.
3.2 Greedy Minimisation of ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$
In this section we briefly present the SBQ method, reinterpreting it in the validation setup of interest to us.
By noting that ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$ is quadratic in the ${\{{\mathbf{w}_{i}}\}_{i=1}^{m}}$, the optimal weights $\tilde{\mathbf{w}}({\mathbf{Z}_{m}})$ for a given ${\mathbf{Z}_{m}}$ are obtained explicitly as
where the $m\times m$ matrix ${\overline{K}_{|n}}({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})$ has generic element ${\overline{K}_{|n}}({\mathbf{z}_{i}},{\mathbf{z}_{j}})$ and the i-th entry of the m-dimensional column vector ${P_{{\overline{K}_{|n}}}}({\mathbf{Z}_{m}})$ is the potential of μ associated with kernel ${\overline{K}_{|n}}$ at validation point ${\mathbf{z}_{i}}$:
Under the posterior model, i.e., given ${\mathcal{F}_{n}}$, ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ is the Minimum MSE (MMSE) linear estimate of ${\varepsilon ^{2}}(\mathbf{x})$ given the residuals observed over ${\mathbf{Z}_{m}}$. When the weights ${\mathbf{w}_{i}}$ of the validation measure are given by (3.2), $\widehat{\mathsf{ISE}}$ has thus the following simple and enlightening expression:
Note that the weights ${\tilde{\mathbf{w}}_{i}}({\mathbf{Z}_{m}})$, and thus the estimator $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}},{\mathbf{Z}_{m}})$ itself, are independent of ${\sigma ^{2}}$. The estimators ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ and $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}},{\mathbf{Z}_{m}})$ rely on the assumed GP model ${\mathcal{GP}^{f}}$ for f, but as explained in [42, Sect. 3.2], model misspecification has a much smaller effect on estimated function values than on predictions of their MSE. One important strength of our approach is thus that our estimator of $\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})$ does not involve the predicted MSE associated with the reconstructed residuals. As shown in Appendix A, this is no longer the case when one attempts at removing the bias of $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}},{\mathbf{Z}_{m}})$, which leads to estimators that are no longer robust to model misspecification.
(3.2)
\[ \tilde{\mathbf{w}}({\mathbf{Z}_{m}})={\overline{K}_{|n}}{({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})^{-1}}{P_{{\overline{K}_{|n}}}}({\mathbf{Z}_{m}})\hspace{0.1667em},\]
\[ {\left[{P_{{\overline{K}_{|n}}}}({\mathbf{Z}_{m}})\right]_{i}}={P_{{\overline{K}_{|n}}}}({\mathbf{z}_{i}})={\int _{\mathcal{X}}}{\overline{K}_{|n}}({\mathbf{z}_{i}},\mathbf{x})\mu (d\mathbf{x})\hspace{0.1667em}.\]
Remembering that ${\sigma ^{4}}{\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})=\mathsf{E}\left\{{\varepsilon ^{2}}(\mathbf{x}){\varepsilon ^{2}}({\mathbf{x}^{\prime }})|{\mathcal{F}_{n}}\right\}$, ${P_{{\overline{K}_{|n}}}}(\mathbf{z})$ can be recognised as
\[ {P_{{\overline{K}_{|n}}}}(\mathbf{z})=\frac{1}{{\sigma ^{4}}}\mathsf{E}\left\{\left.{\varepsilon ^{2}}(\mathbf{z})\hspace{0.1667em}\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})\right|{\mathcal{F}_{n}}\right\}\hspace{0.1667em}.\]
Define
(3.3)
\[ {\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})={\overline{K}_{|n}}(\mathbf{x},{\mathbf{Z}_{m}}){\overline{K}_{|n}}{({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})^{-1}}{\varepsilon ^{2}}({\mathbf{Z}_{m}})\hspace{0.1667em}.\](3.4)
\[ \widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}},{\mathbf{Z}_{m}})=\sum \limits_{i}{\tilde{\mathbf{w}}_{i}}({\mathbf{Z}_{m}}){\varepsilon ^{2}}({\mathbf{z}_{i}})\hspace{-0.1667em}=\hspace{-0.1667em}{\int _{\mathcal{X}}}{\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(x|{\mathbf{Z}_{m}})\hspace{0.1667em}\mu (dx)\hspace{0.1667em}.\]For a given ${\zeta _{m}}$ define ${\mathcal{E}_{m}}\left(\mathbf{x}\right)={\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m+1}^{\mathrm{\star }}}-\mu )$, the energy for measure ${\zeta _{m+1}^{\mathrm{\star }}}$ having support ${\mathbf{Z}_{m+1}}(\mathbf{x})={\mathbf{Z}_{m}}\cup \{\mathbf{x}\}$ and optimal weights $\tilde{\mathbf{w}}({\mathbf{Z}_{m+1}}(\mathbf{x}))$ given by (3.2). If ${\zeta _{m}}=\zeta (\tilde{\mathbf{w}}({\mathbf{Z}_{m}}),{\mathbf{Z}_{m}})$, for $\mathbf{x}\in {\mathcal{X}_{m}}$ we have
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\mathcal{E}_{m}}\left(\mathbf{x}\right)& \displaystyle =& \displaystyle {\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )\\ {} & & \displaystyle \hspace{-56.9055pt}-\hspace{0.1667em}\frac{{\left({P_{{\overline{K}_{|n}}}}(\mathbf{x})-{\overline{K}_{|n}}(\mathbf{x},{\mathbf{Z}_{m}}){\overline{K}_{|n}}{({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})^{-1}}{P_{{\overline{K}_{|n}}}}({\mathbf{Z}_{m}})\right)^{2}}}{{s^{2}}(\mathbf{x})},\end{array}\]
where
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {s^{2}}(\mathbf{x})& \displaystyle =& \displaystyle {\overline{K}_{|n}}(\mathbf{x},\mathbf{x})\\ {} & & \displaystyle \hspace{-20.0pt}-\hspace{0.1667em}{\overline{K}_{|n}}(\mathbf{x},{\mathbf{Z}_{m}}){\overline{K}_{|n}}{({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})^{-1}}{\overline{K}_{|n}}({\mathbf{Z}_{m}},\mathbf{x})\\ {} & & \displaystyle =\frac{1}{{\sigma ^{4}}}\mathsf{E}\left\{\left.{\left({\varepsilon ^{2}}(\mathbf{x})-{\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})\right)^{2}}\right|{\mathcal{F}_{n}}\right\}\hspace{0.1667em}.\end{array}\]
The next validation point is thus a maximiser of the second term in ${\mathcal{E}_{m}}\left(\mathbf{x}\right)$, which can equivalently be written as
(3.5)
\[\begin{aligned}{}{\mathbf{z}_{m+1}}\in \underset{\mathbf{x}\in {\mathcal{X}_{m}}}{\operatorname{arg\,max}}& \frac{\mathsf{E}\left\{\left.\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})\left({\varepsilon ^{2}}(\mathbf{x})-{\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})\right)\right|{\mathcal{F}_{n}}\right\}}{\mathsf{E}\left\{\left.{\left({\varepsilon ^{2}}(\mathbf{x})-{\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})\right)^{2}}\right|{\mathcal{F}_{n}}\right\}}.\end{aligned}\]The numerator measures how much $\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})$ and the error of ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{z}|{\mathbf{Z}_{m}})$ as an estimate of ${\varepsilon ^{2}}(\mathbf{x})$ are statistically associated. Points where this term is large are good candidates to extend the current design. The denominator penalises points x where ${\varepsilon ^{2}}(\mathbf{x})$ is estimated with a large MSE, tending in particular to keep ${\mathbf{z}_{m+1}}$ away from the boundaries of $\mathcal{X}$ (where the uncertainty is in general large), as the numerical studies presented later will show.
The recursive extension of the validation measure is initiated with ${\mathbf{Z}_{1}}=\{{\mathbf{z}_{1}}\}$ solution of
In practice, a finite set ${\mathcal{X}_{L}}\subset \mathcal{X}$, for instance the L first elements of a low-discrepancy sequence in $\mathcal{X}$, or a regular grid in $\mathcal{X}$ if d is not too large, is substituted for $\mathcal{X}$ in (3.5) and (3.6). The determination of ${\mathbf{z}_{m+1}}$, $m\ge 0$, then requires the evaluation of ${P_{{\overline{K}_{|n}}}}(\mathbf{x})$ for all $\mathbf{x}\in {\mathcal{X}_{L}}\setminus {\mathbf{X}_{n}}$. This calculation is done once for all, at the initialisation of the algorithm. In the numerical examples of Sections 4 and 5, ${P_{{\overline{K}_{|n}}}}={P_{{\overline{K}_{|n}},\mu }}$ is replaced by ${P_{{\overline{K}_{|n}},{\mu _{L}}}}$, with ${\mu _{L}}$ the uniform (discrete) measure uniform on ${\mathcal{X}_{L}}$, see Appendix C for details. When K is a tensor-product kernel and μ is uniform on $\mathcal{X}={[0,1]^{d}}$, ${P_{{\overline{K}_{|n}}}}(\mathbf{x})$ can often be calculated explicitly; see Appendix B. The same approach (substitution of the discrete measure ${\mu _{L}}$ for μ, or tensorisation) can be used to evaluate (3.4).
With the aid of a one-dimensional example we formulate now a number of comments about the expected behaviour and properties of the estimators $\widehat{\mathsf{ISE}}$ obtained by repeated application of (3.5) – to extend ${\mathbf{Z}_{m}}$ to ${\mathbf{Z}_{m+1}}$ – and (3.2) – fixing the weights of ${\zeta _{m+1}}$, and thus ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m+1}})$ for the subsequent design extension. The red bold curve in the top panel of Figure 2 plots the squared residuals ${\varepsilon ^{2}}(\mathbf{x})$ of the interpolator ${\eta _{{\mathcal{F}_{n}}}}$ for the function f plotted in the bottom panel (where ${\eta _{{\mathcal{F}_{n}}}}$ and f are in red and green, respectively), trained on the learning design of size 10 indicated by the red stars. The blue and green curves on the top panel are the squared residuals ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ predicted by two distinct ${\zeta _{m}}$ ($m=10$), both generated using (3.5) and (3.2), but assuming distinct kernels $K(\mathbf{x},{\mathbf{x}^{\prime }})$: Cauchy (in green) and Matérn 3/2 (in blue), with range parameters θ as indicated in the legend.7 The (nearly coincident) validation designs ${\mathbf{Z}_{m}}$ are indicated by the squares and circles filled with the corresponding colours.
Figure 2
Top: $\varepsilon (\mathbf{x})$ (red) and ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ (blue and green) for two distinct GP models. Bottom: f, η and ${\mathbf{X}_{n}}$.
Figure 3
${\overline{K}_{|n}}({\mathbf{z}_{1}},\mathbf{x})/{\overline{K}_{|n}}({\mathbf{z}_{1}},{\mathbf{z}_{1}})$ (bold lines) and $K({\mathbf{z}_{1}},\mathbf{x})/K({\mathbf{z}_{1}},{\mathbf{z}_{1}})$ (thin lines) for the Cauchy and Matérn kernels used in Figure 2 (same colour code) and ${z_{1}}\simeq 0.1$.
Remark first that, as anticipated, both designs ${\mathbf{Z}_{m}}$ have no points in the boundaries of $\mathcal{X}$, even if the uncertainty affecting ${\varepsilon ^{2}}(\mathbf{x})$ is large in those regions. Those familiar with optimal interpolation using monotonically decreasing stationary covariance kernels may be surprised by the fact that in intervals between learning points containing no validation points (e.g. around $\mathbf{x}\simeq 0.3$) the interpolated squared residual is non-zero, i.e., ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})\gt 0$. This is a consequence of the particular shape of kernel ${\overline{K}_{|n}}$, strongly dictated by the geometry of ${\mathbf{X}_{n}}$, which has larger values at pairs of points at large distance than the original K, as shown in Figure 3. For ${\mathbf{z}_{1}}\simeq 0.1\in {\mathbf{Z}_{m}}$, the figure plots normalised versions of both the assumed (stationary) signal correlation $K({\mathbf{z}_{1}}-\mathbf{x})$ (in thin coloured lines) as well as kernel ${\overline{K}_{|n}}({\mathbf{z}_{1}},\mathbf{x})$ (bold lines), with the same colour code as in Figure 2. The similarity of the two ${\overline{K}_{|n}}$ allows us to expect that the estimator will have some robustness with respect to the assumed GP model. The numerical studies presented in Section 4 confirm this expectation.
Above, we recognised ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$, given by (3.3), as the MMSE linear estimator of ${\varepsilon ^{2}}(\mathbf{x})$ given ${\varepsilon ^{2}}({\mathbf{Z}_{m}})$. Being agnostic with respect to the expected values of the involved random variables, estimators ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$, and thus $\widehat{\mathsf{ISE}}$, are biased. We investigate in Appendix A the possibility of exploiting knowledge of the first moments, namely $\mathsf{E}\left\{\left.{\varepsilon ^{2}}(\mathbf{x})\right|{\mathcal{F}_{n}}\right\}={\sigma ^{2}}{K_{|n}}(\mathbf{x},\mathbf{x})$ and $\mathsf{E}\left\{\left.\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})\right|{\mathcal{F}_{n}}\right\}=\mathsf{IMSE}({\mathcal{F}_{n}})$, to replace ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ in (3.4) with an unbiased estimator. Unfortunately, bias correction comes at the price of loosing robustness with respect to the assumed GP model for f, as we might expect given the explicit dependency on ${\sigma ^{2}}$ of both expected values. Thus, the unbiased estimators in Appendix A cannot be considered as instrumental alternatives to $\widehat{\mathsf{ISE}}$, and we do not consider them in the numerical study of Section 4.
4 Numerical Experiments
Section 4.1 presents numerical studies that demonstrate the robustness of $\widehat{\mathsf{ISE}}$ with respect to the assumed GP model, with ${\zeta _{m}}$ found by SBQ. Section 4.2 confirms the importance of using ${\overline{K}_{|n}}$ to define the energy minimised by SBQ. We then study, in Section 4.3, the possibility of using KH, which has slightly smaller computational complexity, rather than SBQ, to find the validation support of ${\zeta _{m}}$. Our conclusion is that doing so not only leads to worse performance but is also prone to numerical instability. Finally, Section 4.4 illustrates via some examples the properties of the validation measures, in particular their space-filling properties and the fact that they down-weight the observed squared residuals. In all examples $\mathcal{X}={[0,1]^{d}}$, with $d=1$, 2 or 3. Use of larger values of d leads to similar conclusions; see [35].
Our analysis resorts to simulations from several (zero mean) GP models, and the MSE of the $\mathsf{ISE}$ estimates is approximated by averaging the squared errors of ${\widehat{\mathsf{ISE}}^{(i)}}$ on $M=500$ realisations ${\{{f^{(i)}}\}_{i=1}^{M}}$ of the assumed GP model. We reserve the notation $Q(\cdot ,\cdot ;{\theta _{0}})$ for the kernel of the GP model from which is f sampled, ${\theta _{0}}$ being thus “the true” scale parameter. The scale parameter is adapted to the size of the learning design, ${\theta _{0}}={n^{1/d}}$, such that good interpolation performance over $\mathcal{X}$ can be attained with n points. Designs ${\mathbf{X}_{n}}$ are always space filling, and ${\eta _{{\mathcal{F}_{n}}}}$ is the optimal Bayesian interpolator for the simulated GP model. See Appendix C for details.
$K(\cdot ,\cdot ;\theta )$ denotes the kernel of the GP model assumed by the design algorithm that produces ${\zeta _{m}}$, with θ its scale parameter. In all numerical examples we always consider ${\sigma ^{2}}=1$. The influence of θ is studied for $\theta \in \left[{n^{1/d}}/4,\max ({n^{1/d}},2\hspace{0.1667em}{(n+m)^{1/d}})\right]$, an interval that always contains
(as well as ${\theta _{0}}={n^{1/d}}$). All plots consider the normalisation $\theta /{\theta _{c}}$, such that ${\theta _{c}}\leftrightarrow 1$ in the plots shown. In all plots of this section the special symbols in the plotted curves indicate $\theta ={\theta _{0}}$, the scale parameter of the simulated GP model.
Figure 4
MSE of $\widehat{\mathsf{ISE}}$. Statistics over 500 realisations. Q is a Matérn 3/2 kernel; K is a Matérn kernel with $\nu =1/2$ (left), $\nu =3/2$ (middle) and $\nu =5/2$ (right); $d=1$.
Figure 5
MSE of $\widehat{\mathsf{ISE}}$. Statistics over 500 realisations. Q is a Cauchy kernel, K is as in Figure 4; $d=1$.
4.1 Robustness with Respect to Assumed GP Model
We address robustness by studying how much the MSE of $\widehat{\mathsf{ISE}}$ is affected by model mismatch, i.e., by estimating the ISE assuming that the kernel is $K(\cdot ,\cdot ;\theta )$ when in fact the data generating model uses $Q(\cdot ,\cdot ;{\theta _{0}})$. Figure 4 plots empirical estimates of $\mathcal{R}({\zeta _{m}},{\mathcal{F}_{n}})$. Kernel Q is the Matérn 3/2 kernel, ${\mathbf{X}_{n}}$ has $n=10$ points and $d=1$, ${\theta _{0}}=n$. The panels correspond to different values of the regularity parameter – $\nu =1/2,3/2,5/2$, from left to right – of the Matérn kernel K.
The three curves in each plot correspond to different sizes of ${\mathbf{Z}_{m}}$ (with ${\mathbf{Z}_{m}}$ depending on K, and thus on ν): $m\in \{5,10,20\}$ (in blue, red and yellow, respectively), plotting $\mathcal{R}$ as a function of $\theta /(n+m)$. The black stars indicate $\theta ={\theta _{0}}$. Comparison of the three panels confirms the anticipated robustness of the estimator. When K has higher regularity than Q, as in the rightmost panel ($\nu =5/2$), the curves are almost identical to the central panel, where the correct model is used. However, the assumption of a less regular model, as in the leftmost panel, may significantly degrade performance. The estimators are reasonably robust with respect to precise choice of the scale parameter if values $\theta \simeq {\theta _{c}}$ are used.
Figure 5 reproduces the same study for simulations from a process with a Cauchy kernel and for a larger ${\mathbf{Z}_{m}}$: $m\in \{10,20,30\}$ (left to right). As in previous figure, K is a Matérn kernel and the three panels correspond to different smoothness parameters $\nu \in \{1/2,3/3,5/2\}$. Here the simulated model has a weaker regularity than the models assumed, and a noticeable performance degradation is now observed for the smaller designs and the more regular Matérn kernel with $\nu =5/2$. Similar results were obtained when simulating from other models and for higher values of d.
Finally, Figure 6 shows, for the same validation designs ${\mathbf{Z}_{m}}$ as in Figure 4, the MSE of ${\widehat{\mathsf{ISE}}_{un}}$ given by equation (2.2), estimated over 500 realisations of a GP with the same Matérn 3/2 model. We can see that proper residual weighting leads to a significant decrease of the estimation error, which is nearly one order of magnitude larger in Figure 6 than for the optimal BQ weighting used in Figure 4.
The experiments in this section suggest a rule-of-thumb to choose the kernel K used by the design algorithm: K should model functions with a reasonably large degree of smoothness (Matérn 3/2 was found to be a good compromise), with a scale parameter θ dependent on the sizes of the learning and validation sets. For the Matérn family used in our experiments a good choice is $\theta \simeq {(n+m)^{1/d}}$, automatically adjusting to the actual total number of residual samples.
Figure 6
MSE of ${\widehat{\mathsf{ISE}}_{un}}$. Statistics over 500 realisations for the example in Figure 4.
4.2 Impact of ${\overline{K}_{|n}}$
Our main novel contribution is the identification of ${\overline{K}_{|n}}$ as the kernel that appears in the MMD that the validation measure ${\zeta _{m}}$, both its weights and its support, must minimise. One may question the importance of using the non-stationary conditional kernel ${\overline{K}_{|n}}$ to find ${\mathbf{Z}_{m}}$, instead of directly using kernel K. We now compare the performance of the empirical estimator $\widehat{\mathsf{ISE}}$ with ${\mathbf{Z}_{m}}$ determined by SBQ for kernel ${\overline{K}_{|n}}$, as in Section 3.2, which from now on we denote by ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$, with the estimates produced by a validation measure ${\zeta _{K}^{\mathsf{BQ}}}$ whose support ${\mathbf{Z}_{m}}$ is incrementally found by SBQ for kernel K, the continuation of ${\mathbf{X}_{n}}$ that is optimal to integrate the function f. Independently of how ${\mathbf{Z}_{m}}$ was found, the validation measures ${\zeta _{m}}$ used by the estimators $\widehat{\mathsf{ISE}}$ always have optimal weights given by (3.2).
Figures 7 ($d=1$) and 8 ($d=3$) show the empirical MSE of $\widehat{\mathsf{ISE}}$ for ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black lines) and ${\zeta _{K}^{\mathsf{BQ}}}$ (red lines) observed when Q is the Matérn 3/2 kernel (top) and the Cauchy kernel (bottom), for a learning design of size $n=10\hspace{0.1667em}d$. From left to right, K is a Matérn kernel with $\nu =1/2$, $3/2$ and $5/2$. The size of the validation designs, $m\in \{10\hspace{0.1667em}d,20\hspace{0.1667em}d,30\hspace{0.1667em}d\}$, is indicated by the line symbols (+, ⋆ and ∘, respectively). We can see that the two estimators display similar performance and robustness with respect to mis-modelling. When m is small ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ often yields smaller MSE, see top curves, but the red and black curves are almost coincident for the larger values of m. These results, which are representative of those obtained for other choices of Q and d, indicate that correct residual weighting is more important than the detailed placement of the validation points ${\mathbf{Z}_{m}}$.
Note that, in the configurations tested, the default rule-of-thumb for the choice of K and θ presented in Section 4.1 leads indeed to good and stable performance.
Figure 7
MSE of $\widehat{\mathsf{ISE}}$ for ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black) and ${\zeta _{K}^{\mathsf{BQ}}}$ (red) for $m\in \{10,20,30\}$. From left to right $\nu =1/2,3/2,5/2$. Statistics over 500 realisations. Top: Q is the Matérn 3/2 kernel, bottom: Q is the Cauchy kernel; $d=1$.
Figure 8
MSE of $\widehat{\mathsf{ISE}}$ for ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black) and ${\zeta _{K}^{\mathsf{BQ}}}$ (red) for $m\in \{20,40,60\}$. Top: Q is a Matérn 3/2 kernel; bottom: Q is the Cauchy kernel. K is always a Matérn kernel, from left to right $\nu =1/2,3/2,5/2$; $d=3$.
Figure 9
MSE of $\widehat{\mathsf{ISE}}$ for ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black solid lines), ${\zeta _{K}^{\mathsf{KH}}}$ (red dashed lines) and ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ (dotted green lines), for $m=10$ (+), $m=20$ (⋆) and $m=30$ (∘). From left to right $\nu =1/2,3/2,5/2$. Q is a Matérn 3/2 kernel; $d=1$.
Figure 10
MSE of $\widehat{\mathsf{ISE}}$ for ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black solid lines), ${\zeta _{K}^{\mathsf{KH}}}$ (red dashed lines) and ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ (dotted green lines), for $m=20$ (+), $m=30$ (⋆) and $m=60$ (∘). From left to right $\nu =1/2,3/2,5/2$. Q is a Matérn 3/2 kernel; $d=2$.
4.3 Comparison with Kernel Herding
Considering only validation measures ζ with uniform weights $1/m$, standard KH also minimises an MMD, incrementally extending ${\mathbf{Z}_{m}}$ with
\[ {\mathbf{z}_{m+1}}\in \underset{\mathbf{x}\in {\mathcal{X}_{L}}\setminus \{{\mathbf{Z}_{m}}\cup {\mathbf{X}_{n}}\}}{\operatorname{arg\,max}}{P_{{\overline{K}_{|n}}}}(\mathbf{x})-{\overline{K}_{|n}}(\mathbf{x},{\mathbf{Z}_{m}}){\mathbf{1}_{m}}/m\hspace{0.1667em},\]
where ${\mathbf{1}_{m}}$ denotes the m-dimensional vector with all components equal to one. Since KH has smaller complexity than SBQ, and the results of the previous section suggest that optimal choice of ${\mathbf{Z}_{m}}$ is less important than correct determination of the weights ${\mathbf{w}_{i}}$, we compare now ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ to two other validation measures, whose designs ${\mathbf{Z}_{m}}$ are found by extending ${\mathbf{X}_{n}}$ by KH: ${\zeta _{K}^{\mathsf{KH}}}$, that performs KH for kernel K, and ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ that uses ${\overline{K}_{|n}}$. As we will see, the SBQ design is a superior alternative, both in terms of performance and numerical stability, to the KH designs.Since ${\zeta _{K}^{\mathsf{KH}}}$ considers only, at each step, measures with uniform weights, and ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ does not take into account the optimal weights that will be applied when ${\mathbf{Z}_{m}}$ is extended to ${\mathbf{Z}_{m+1}}$, we can expect the following ranking of these estimators:
Figures 9 and 10 plot, for $d=1$ and $d=2$, respectively, the MSE of estimators $\widehat{\mathsf{ISE}}$ that use ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black solid lines), ${\zeta _{K}^{\mathsf{KH}}}$ (red dashed lines) and ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ (dotted green lines). Kernels (Q and K) and designs sizes m are as in the previous examples, see the figures’ captions. We can see that ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ has virtually always smaller MSE than the validation measures using validation designs ${\mathbf{Z}_{m}}$ found by KH, in particular for small design sizes m and the more regular models. It also appears to be more robust with respect to the choice of the GP kernel. We remark that the design found by KH for kernel ${\overline{K}_{|n}}$, i.e., the validation measure ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ (in green), often leads to the poorest performance. That use of ${\overline{K}_{|n}}$ may lead to worse performance than simply using K has already been noticed in [12], where only validation sets generated with KH were considered.
Moreover, our experiments reveal that the designs ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ can sometimes lead to ISE estimates with very large errors. This happens when KH places design points close to ${\mathbf{X}_{n}}$. In fact, the implementation of standard KH for kernel ${\overline{K}_{|n}}$ needs careful handling of possible repetition of design points, as already noted in [35] where an algorithm is proposed to accommodate this eventuality. Since the implementation used in Figures 9 and 10 simply imposes ${\mathbf{z}_{m+1}}\notin ({\mathbf{X}_{n}}\cup {\mathbf{Z}_{m}})$, a grid point very close to ${\mathbf{X}_{n}}\cup {\mathbf{Z}_{m}}$ can chosen, as shown below.
Figure 11 shows the designs ${\mathbf{Z}_{m}}$, $m=30$, for Matérn kernels with $\theta ={\theta _{0}}$, $d=1$, and regularity parameter (top to bottom panels) $\nu =1/2,3/2$ and $5/2$. The vertical red lines indicate ${\mathbf{X}_{n}}$ and the black stars, green circles and red squares the position of points of ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$, ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ and ${\zeta _{K}^{\mathsf{KH}}}$, respectively. A vertical offset is used to facilitate the visualisation of each design (from top to bottom, ${\zeta _{K}^{\mathsf{KH}}}$, ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ and ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$). Remark first that the SBQ designs are always space-filling continuations of ${\mathbf{X}_{n}}$, presenting a good stability with respect to ν, mainly moving points closer to the boundaries of $\mathcal{X}$ when ν increases. The other two designs place a few points in the vicinity of ${\mathbf{X}_{n}}$.
Figure 11
Designs for $\theta ={\theta _{0}}$ in Figure 9. From top to bottom: $\nu =1/2,3/2,5/2$. ${\mathbf{X}_{n}}$: red ∗; ${\mathbf{Z}_{m}^{\mathsf{BQ}\mathrm{\star }}}$: black ∗; ${\mathbf{Z}_{m}^{\mathsf{KH}}}$: red □ and ${\mathbf{Z}_{m}^{\mathsf{KH}\mathrm{\star }}}$: green ∘.
4.4 Properties of the Design Measures ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$
For the same set of kernels K and design sizes considered in Figure 4 (with $d=1$), we plot in Figure 12 the sum of the design weights, $S(\theta )={\textstyle\sum _{i}}{\mathbf{w}_{i}}(\theta )$, as a function of the (normalised) scale parameter of K. K is always a Matérn kernel, with regularity parameter $\nu =1/2,\hspace{0.1667em}3/2,5/2$ (top to bottom), as indicated in the legends. The learning design ${\mathbf{X}_{n}}$ ($n=10$) is the same for all cases.
Figure 12
$S={\textstyle\sum _{i}}{\mathbf{w}_{i}}$ for measure ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$, $n=10$.
Three values of m are considered, $m=10,\hspace{0.1667em}20$ and 30 (blue, red and cyan curves, respectively). In each curve the black squares indicate the value ${\theta _{0}}={n^{1/d}}$. We can see that $S(\theta )$ increases with m. For θ larger than a certain value, S becomes nearly constant, with a value smaller than one (note that the value ${\theta _{c}}$ prescribed by our rule of thumb for the scale parameter, which corresponds to the normalised value of θ equal to one, is always inside this range) while for $\theta ={n^{1/d}}$ (indicated by a square), under the more regular model with a Matérn 5/2 kernel, S may be larger than 1.
Figure 13
${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ in Figure 12, $\theta ={n^{1/d}}$. Validation design sizes: $m=10$ (blue), $m=20$ (red) and $m=30$ (cyan).
Figure 14
${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ in Figure 12, $\theta ={(n+m)^{1/d}}$. Validation design sizes: $m=10$ (blue), $m=20$ (red) and $m=30$ (cyan).
Figure 15
${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ in Figure 12, $\theta =2\hspace{0.1667em}{(n+m)^{1/d}}$. Validation design sizes: $m=10$ (blue), $m=20$ (red) and $m=30$ (cyan).
Figures 13, 14 and 15 present the designs for three values of θ: $\theta ={n^{1/d}}$ (the value used in the simulations of Figure 4, and indicated by the squares in Figure 12), for the value prescribed by our rule of thumb, $\theta ={(n+m)^{1/d}}$, and for $\theta =2\hspace{0.1667em}{(n+m)^{1/d}}$, the upper limit considered in Figure 4. In the figures, the weights of ${\zeta _{m}}$ are shown multiplied by m, to enable comparison. The distinct kernels K correspond to the three row panels, as indicated in the figure (with regularity increasing from top to bottom). The dotted black vertical lines (the same in the three panels) indicate the learning design ${\mathbf{X}_{n}}$. The colours code the validation design size: $m=10$ in blue, $m=20$ in red and $m=30$ in cyan. Remark the striking similarity of the validation measures obtained for the different kernels in Figures 14 and 15, supporting our observations concerning the robustness of the estimator. The figures also show that the validation designs are, as expected, space-filling continuations of ${\mathbf{X}_{n}}$, and that as m grows (remember ${\mathbf{Z}_{10}}\subset {\mathbf{Z}_{20}}\subset {\mathbf{Z}_{30}}$) the holes of ${\mathbf{X}_{n}}\cup {\mathbf{Z}_{m}}$ are refined. Note, however, the slow rate of population of the immediate neighborhood of ${\mathbf{X}_{n}}$, ${\mathbf{Z}_{m}}$ tending first, as m grows, to refine the interior of the wider holes of ${\mathbf{X}_{n}}$. For the Matérn 5/2 kernel and $\theta ={n^{1/d}}$ a few weights, corresponding to validation points close to ${\mathbf{X}_{n}}$, become very large, see Figure 13, explaining that $S(\theta )$ may be larger than one on the bottom panel of Figure 12. Analysis of the validation measures obtained assuming the larger value of θ in Figure 15 shows that, as the assumed correlation length decreases, ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ tends to a uniform measure, all weights having now a similar value. Note that even in this situation, use of the BQ measure, which down-weights the squared residuals, leads to a smaller error than use of the simple uniform measure over ${\mathbf{Z}_{m}}$, as the comparison of Figures 4 and 6 in Section 4.1 has shown.
5 “Real” Models
We study in this section the behaviour of the validation method proposed considering deterministic functions. More precisely, we consider the following multidimensional functions:
-
• The 2-dimensional drag model that describes the quasi-steady drag coefficient ${C_{D}}$ of a spherical particle in a compressible flow, see [30]:\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}& & \displaystyle {C_{D}}(M,{R_{e}})=\left(\alpha ({R_{e}})-\beta ({R_{e}})\right)\xi (M,{R_{e}})+\beta (Re)\\ {} & & \displaystyle \hspace{-22.76228pt}\alpha ({R_{e}})=\frac{24}{{R_{e}}}\left(1+0.107{R_{e}^{0.867}}\right)+0.646{\left(1+\frac{861}{{R_{e}^{0.634}}}\right)^{-1}}\\ {} & & \displaystyle \hspace{-22.76228pt}\beta ({R_{e}})=24\left(1+0.118{R_{e}^{0.813}}\right){R_{e}}+0.69{\left(1+\frac{3550}{{R_{e}^{0.793}}}\right)^{-1}}\\ {} & & \displaystyle \xi (M,{R_{e}})={\sum \limits_{i=1}^{3}}{f_{i}}(M)\prod \limits_{j\ne i,j\in \{1,2,3\}}\frac{\log {R_{e}}-{C_{j}}}{{C_{i}}-{C_{j}}}\end{array}\]where\[\begin{aligned}{}{C_{1}}& =6.48,\hspace{2.5pt}{C_{2}}=8.93,\hspace{2.5pt}{C_{3}}=12.21\\ {} f(M)& =\mathbf{a}+\mathbf{b}M+\mathbf{c}{M^{2}}+\mathbf{d}{M^{2}}-\mathbf{g}(M)\\ {} \mathbf{a}& =-{\left[2.963\hspace{2.5pt}6.617\hspace{2.5pt}5.866\right]^{T}}\\ {} \mathbf{b}& ={\left[4.392\hspace{2.5pt}12.11\hspace{2.5pt}11.57\right]^{T}}\\ {} \mathbf{c}& =-{\left[1.169\hspace{2.5pt}6.501\hspace{2.5pt}6.665\right]^{T}}\\ {} \mathbf{d}& ={\left[-0.027\hspace{2.5pt}1.182\hspace{2.5pt}1.312\right]^{T}}\\ {} \mathbf{g}(M)& =\left[\begin{array}{c}0.233{e^{(1-M)/0.11}}\\ {} 0.174{e^{(1-M)/0.01}}\\ {} 0.350{e^{(1-M)/0.012}}\end{array}\right]\end{aligned}\]
-
• The 7-dimensional piston model, that describes the circular motion of a piston within a cylinder, see [1]:\[\begin{aligned}{}C(\mathbf{x})& =2\pi \sqrt{\frac{M}{k+{S^{2}}\frac{{P_{0}}{V_{0}}}{{T_{0}}}\frac{{T_{a}}}{{V^{2}}}}}\\ {} V& =\frac{S}{2k}\left(\sqrt{{A^{2}}+4k{T_{A}}\frac{{P_{0}}{V_{0}}}{{T_{0}}}}-A\right)\\ {} A& ={P_{0}}S-+19.62M-k\frac{{V_{0}}}{S}\end{aligned}\]with ${x_{1}}=M\in [30,60]$, ${x_{2}}=S\in [0.005,0.020]$, ${x_{3}}={V_{0}}\in [0.002,0.010]$, ${x_{4}}=k\in [1000,5000]$, ${x_{5}}={P_{0}}\in [90000,110000]$, ${x_{6}}={T_{a}}\in [290,296]$ and ${x_{7}}={T_{0}}\in [340,360]$. In [28], a screening study is presented, indicating that only variables ${x_{i}}$ for $i\le 4$ are relevant. For this reason we consider $C(\mathbf{x})$ only as a 4-dimensional function, with the remaining three input variables (${x_{5}}$, ${x_{6}}$ and ${x_{7}}$) being fixed to the mid-point of the corresponding intervals.
The functions generated by the models above cannot be well interpolated using simple kriging unless the design size is very large, having a smooth tendency which, when not taken into account, leads to a residual signal $\epsilon (\mathbf{x})$ that strongly departs from the stationarity hypothesis assumed in this work. For that reason, we consider the estimation of ISE for interpolators of f of the form
Above, $P(\mathbf{x})$ is the complete q-degree polynomial obtained by least squares fit to the n observations ${\mathbf{y}_{n}}$ over the learning design ${\mathbf{X}_{n}}$. The term $g(\mathbf{x})$ is the simple kriging interpolator of
In all cases, ${\mathbf{X}_{n}}$ is a space filling design determined by Kernel herding for a spherical Matérn 3/2 kernel with $\theta ={n^{1/d}}$, and all validation designs ${\mathbf{Z}_{m}}$ assume a spherical Matérn kernel (several settings of its correlation length are studied). All experiments reported in this section consider $m=n/2$.
We compare the performance of the ISE estimator proposed in the paper, using the validation measure ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$, with the simple empirical estimator ${\widehat{\mathsf{ISE}}_{un}}$ that uses a uniform validation distribution with support the validation design ${\mathbf{Z}_{m}}$, see (2.2). All validation measures ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ assume a Matérn 3/2 kernel. Learning designs ${\mathbf{X}_{n}}$ are always space filling, found by Kernel Herding for a Matérn kernel with parameter $\theta ={n^{1/d}}$.
The robustness of the estimator with respect to the assumed value of the range parameter of the covariance of the GP model is studied by showing three panels, corresponding to $\theta \in \{{m^{1/d}},{n^{1/d}},{(n+m)^{1/d}}\}$ (increasing from left to right). The figures display grouped bar plots, corresponding to increasing sizes of ${\mathbf{X}_{n}}$ (and thus of ${\mathbf{Z}_{m}}$): $n\in \{10,20,30,40,50\}d$. The blue bars correspond to the true value, the red bars to ${\widehat{\mathsf{ISE}}^{\mathsf{BQ}\mathrm{\star }}}$, and the yellow bars to ${\widehat{\mathsf{ISE}}_{un}}$.
5.1 Drag Model
For the drag model ${P_{q,n}}$ has degree $q=1$.
Figure 16
Drag model. True ISE (blue) SBQ estimate (red) and ${\widehat{\mathsf{ISE}}_{un}}$ (yellow), $\eta (\mathbf{x})={P_{q,n}}(\mathbf{x})+g(\mathbf{x})$. From left to right: $\theta ={m^{1/d}},{n^{1/d}},{(n+m)^{1/d}}$.
Figure 17
Drag model. True ISE (blue) SBQ estimate (red) and ${\widehat{\mathsf{ISE}}_{un}}$ (yellow), $\eta (\mathbf{x})={P_{q,n}}(\mathbf{x})$. From left to right: $\theta ={m^{1/d}},{n^{1/d}},{(n+m)^{1/d}}$.
Figure 18
Piston model. True ISE (blue) BQ estimate (red) and ${\widehat{\mathsf{ISE}}_{un}}$ (yellow). Interpolator, $\eta (\mathbf{x})=P(\mathbf{x})+g(\mathbf{x})$. From left to right: $\theta ={m^{1/d}},{n^{1/d}},{(n+m)^{1/d}}$; $p=3$.
Figure 16 confirms the robustness of ${\widehat{\mathsf{ISE}}^{\mathsf{BQ}\mathrm{\star }}}$ with respect to the choice of θ. Also, the anticipated overestimation of the empirical estimator, as well as the negative bias of the BQ estimator, are apparent in the figure. Note that different values of the assumed θ – corresponding to the three panels of the figure – lead both to distinct validation weights and to different validation designs ${\mathbf{Z}_{m}}$. As we had anticipated, the validation weights of ${\widehat{\mathsf{ISE}}^{\mathsf{BQ}\mathrm{\star }}}$ compensate for the precise location of the points of ${\mathbf{Z}_{m}}$, and the variation of the estimates across the three panels in Figure 16 is minor. On the contrary, the empirical estimator ${\widehat{\mathsf{ISE}}_{un}}$ displays an important sensitivity with respect to the exact placement of ${\mathbf{Z}_{m}}$, changing significantly across the three panels.
Figure 17 considers $\eta (\mathbf{x})={P_{q,n}}(\mathbf{x})$, violating the interpolating assumption. As we might expect, the SBQ estimate of $\mathsf{ISE}$ (red bars) has now an important (negative) bias. Somewhat surprisingly, when the value of n is small the positive bias of ${\widehat{\mathsf{ISE}}_{un}}$ (yellow bars) partially compensates, in this example, for the non-zero residuals of η over ${\mathbf{X}_{n}}$.
5.2 Piston Model
We switch now to the higher dimensional piston model ($d=4$), for which ${P_{q,n}}$ is a polynomial of degree $q=2$. Figure 18 is the equivalent of Figure 16. It confirms the superior performance and robustness of ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ over ${\widehat{\mathsf{ISE}}_{un}}$.
When η is not an interpolator a behaviour similar to the one observed for the drag model has been observed.
6 Conclusions
The paper presents an estimator for the ISE of an interpolator based on knowledge of the design on which it has been learned, defined as the ISE for a finitely supported validation measure. The estimator proposed is the optimal MSE linear estimator under the assumption that the interpolated function is a realisation from a Gaussian process with known statistical moments. The support and weights of the validation measure are found by minimising an MMD for a non-stationary kernel that is adapted to the learning design, and a nested sequence of validation designs is greedily determined by SBQ. A default rule is proposed to select the covariance kernel of the assumed model.
Numerical experiments on both simulations from nominal Gaussian processes and on two real models of small dimension confirm the superior performance of the proposed estimator when compared to common estimation by the simple empirical average of the observed squared residuals.
The interpretation of the ISE estimator in terms of an interpolation of the squared residuals explains the utmost importance of accounting for the correct shape of their second-order moment. Moreover, it unriddles the observed robustness of the estimator with respect to the covariance of the assumed GP model.
The work presented suggests several directions for future developments. One concerns the determination of indicators of the quality of the ISE estimate itself, ideally given by the risk function that is optimised. These could both be used to define stopping rules, indicating that incorporation of further residual observations should not yield a significant improvement on the confidence of the current ISE estimate, or to flag poor performance of the current interpolator, and trigger its update including some of the residuals observed over ${\mathbf{Z}_{m}}$ in the learning dataset ${\mathcal{F}_{n}}$. A major difficulty is related to the dependency of the MSE of the interpolator on the assumed process variance, which is known to be difficult to estimate [22]. A possible source of suboptimality of the estimator presented concerns the restriction to a linear estimator. The extension to more general estimators while preserving at the same time the robustness property of the method forms a challenging objective. Finally, we believe that the analysis presented here suggests possible approaches to defining (down-)weighted CV estimators that perform better than standard ones; this is the subject of ongoing work.