Validation of Machine Learning Prediction Models

Pronzato, Luc; Rendas, Maria-João

doi:10.51387/23-NEJSDS50

Volume 1, Issue 3 (2023), pp. 394–414

Luc Pronzato Maria-João Rendas

https://doi.org/10.51387/23-NEJSDS50

Pub. online: 7 November 2023 Type: Methodology Article

Open Access

Area: Machine Learning and Data Mining

Accepted
9 June 2023

Published
7 November 2023

Abstract

We address the estimation of the Integrated Squared Error (ISE) of a predictor $\eta (x)$ of an unknown function f learned using data acquired on a given design ${\mathbf{X}_{n}}$. We consider ISE estimators that are weighted averages of the residuals of the predictor $\eta (x)$ on a set of selected points ${\mathbf{Z}_{m}}$. We show that, under a stochastic model for f, minimisation of the mean squared error of these ISE estimators is equivalent to minimisation of a Maximum Mean Discrepancy (MMD) for a non-stationary kernel that is adapted to the geometry of ${\mathbf{X}_{n}}$. Sequential Bayesian quadrature then yields sequences of nested validation designs that minimise, at each step of the construction, the relevant MMD. The optimal ISE estimate can be written in terms of the integral of a linear reconstruction, for the assumed model, of the square of the interpolator residuals over the domain of f. We present an extensive set of numerical experiments which demonstrate the good performance and robustness of the proposed solution. Moreover, we show that the validation designs obtained are space-filling continuations of ${\mathbf{X}_{n}}$, and that correct weighting of the observed interpolator residuals is more important than the precise configuration ${\mathbf{Z}_{m}}$ of the points at which they are observed.

1 Introduction and Motivation

Using machine learning models in real world applications, for instance for industrial optimisation and testing [9, 14], banking [31, 2, 24], or as tools in the context of social services [15], imposes stringent requirements on their validation. The same happens in the framework of computer experiments, see for instance [25, 27], where numerically efficient machine learning models are used as controlled-error approximations of mathematical models with prohibitive computational complexity, [6, 39, 10].

Model validation ideally resorts to a reserved test set, i.e. to evaluations of the modelled function on data points that have not been used neither to select nor to train the machine learning model [5, 46, 18]. Using the errors of the model on this test set enables the assessment of the model quality, using for instance estimates of the mean-squared error in the context of regression problems, or the rate of labeling errors of classifiers. This is the setting addressed in this paper. When such a test-set cannot be made available, model validation is most commonly done by cross-validation [23, 17, 7], relying on the errors of models learnt only on a subset of the learning set to infer the error of the model that integrates the entire dataset.

This paper proposes a methodology to estimate the quality of an interpolator learned on a given experimental design. More precisely, we suppose that data gathered on the points of an experimental design ${\mathbf{X}_{n}}=\{{\mathbf{x}_{1}},\dots ,{\mathbf{x}_{n}}\}$ with n points in a compact set2 $\mathcal{X}$ has been used to build a predictor of the value of the function $f\hspace{0.1667em}:\hspace{0.1667em}\mathcal{X}\to \mathbb{R}$ that produced the collected samples.

We denote by ${\mathbf{y}_{n}}={(f({\mathbf{x}_{1}}),\dots ,f({\mathbf{x}_{n}}))^{\top }}$ the vector collecting the n evaluations of f at the design points ${\mathbf{x}_{i}}$, by ${\mathcal{F}_{n}}=({\mathbf{X}_{n}},{\mathbf{y}_{n}})$ the learning dataset, and by ${\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})$ the resulting predictor of $f(\mathbf{x})$. The quality of ${\eta _{{\mathcal{F}_{n}}}}$ is assessed through a widely used measure of the precision of interpolators, the Integrated Squared Error (ISE):

(1.1)

\[ \mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})={\int _{\mathcal{X}}}{\left[{\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})-f(\mathbf{x})\right]^{2}}\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\hspace{0.1667em},\]

see, e.g., [38] for an early reference. In the definition above the (user-defined) measure μ enables penalisation of the interpolation errors over regions of $\mathcal{X}$ which are considered to be of particular importance. We stress that we consider that the experimental design ${\mathbf{X}_{n}}$ – also referred to as the “learning design” – is given, making no assumptions on how it is has been chosen.

Estimation of the integral (1.1) must necessarily resort to the evaluation of the prediction error $\varepsilon (\mathbf{x})=f(\mathbf{x})-{\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})$ over only a finite set of points ${\mathbf{Z}_{m}}=\{{\mathbf{z}_{1}},\dots {\mathbf{z}_{m}}\}\subset \mathcal{X}$, which we designate by “validation design”. The integral is then approximated by replacing μ by a point mass measure $\zeta =\zeta (\mathbf{w},{\mathbf{Z}_{m}})={\textstyle\sum _{i}}{\mathbf{w}_{i}}{\delta _{{\mathbf{z}_{i}}}}$ supported on ${\mathbf{Z}_{m}}$ only.3 We generically refer to ζ as the validation measure, using the notation ${\zeta _{m}}$ to make explicit the size of the validation set. Although ζ is not necessarily the uniform distribution supported on ${\mathbf{Z}_{m}}$, with a slight abuse of terminology we refer to the corresponding ISE estimators

(1.2)

\[ \widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )={\sum \limits_{i=1}^{m}}{\mathbf{w}_{i}}{\left[{\eta _{{\mathcal{F}_{n}}}}({\mathbf{z}_{i}})-f({\mathbf{z}_{i}})\right]^{2}}\]

as empirical ISE estimators.

We address the choice of the validation measure ζ – both of the validation design ${\mathbf{Z}_{m}}$ and of the validation weights w – and investigate the properties of the resulting estimates $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ given by (1.2). The algorithms presented are iterative, defining increasing sequences of nested validation designs ${\mathbf{Z}_{m}}\subset {\mathbf{Z}_{m+1}}\subset {\mathbf{Z}_{m+2}}\subset \cdots \hspace{0.1667em}$ such that the performance of $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ improves as m increases. A preliminary version of this work has been presented in [12], in the context of a comprehensive comparison of validation methodologies.

The paper is organised as follows. Section 2.1 first relates the ISE estimators (1.2) to other ISE estimators. Then, assuming that the interpolated function f is a realisation of a Gaussian Process (GP) with known moments, we present in Section 2.2 a computable criterion $\mathcal{R}(\zeta ,{\mathcal{F}_{n}})$ that measures the precision of empirical estimators of the form (1.2). In Section 3 we discuss optimisation of $\mathcal{R}(\zeta ,{\mathcal{F}_{n}})$, detailing application of related existing algorithms to the specific conditions of the validation problem of interest here, and revealing an instrumental interpretation of the corresponding “optimal” empirical $\mathsf{ISE}$ estimators. Since the “optimal” validation measure depends on the assumed GP model, the robustness and performance of the validation methodology presented are investigated numerically in Section 4, leading to two major conclusions. One concerns the validation weights w, stating that to avoid overestimation of $\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})$ the contributions of the individual errors $\varepsilon ({\mathbf{z}_{i}})$ to $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ must be down-weighted – with respect to taking ζ as the uniform distribution over ${\mathbf{Z}_{m}}$. The second concerns the geometry of the validation design ${\mathbf{Z}_{m}}$, whose optimality is seen to be much less important that correct choice of the weights w. Based on these numerical studies we propose a default choice for the covariance kernel of the GP model used, including its scale parameter. The numerical studies of Section 4 resort to simulation from selected Gaussian processes, and consider only optimal kriging interpolators. In Section 5 we present results on two “real case” models and for more general interpolators, confirming the robustness and performance of the proposed estimator. Finally, Section 6 summarises our findings and proposes some directions for future work.

Throughout the manuscript we frequently resort to the notion of space-filling designs, i.e., designs whose points are evenly spread over $\mathcal{X}$. This notion has been extensively studied in the experimental design literature, in particular in the context of identification of surrogate models of computer experiments, and several mathematical criteria – e.g. discrepancy, or the classical minimax- and maximin-distance criteria, also called covering and packing radii, see [19, 32, 34] – have been proposed to measure how much a given design is space filling. In this paper, we use the term in a rather informal manner, meaning that the points of the design are well spread over $\mathcal{X}$, no design point being too close to the remaining points, so that for the majority of the usual space-filling criteria mentioned above it should be considered as a good design.

2 A Criterion for Validation Measures

Since f is unknown, we can at best expect to find an ISE estimator that performs well for most functions f consistent with dataset ${\mathcal{F}_{n}}$. To characterise this set of functions we adopt the Gaussian process framework – briefly recalled below – enabling us to subsequently derive a criterion to choose the validation measure ζ.

Before doing that, the next section puts our approach in perspective in relation to other (non-parametric) model validation methods.

2.1 Empirical ISE Estimation

Non-parametric estimation of the ISE of a computational model learned on a dataset ${\mathcal{F}_{n}}$ is most commonly done using ${\mathcal{F}_{n}}$ itself. In cross-validation (CV), see e.g. [8, 4], the residuals ${\varepsilon _{i}^{cv}}={\mathbf{y}_{i}}-{\eta _{{\mathcal{F}_{n}}\setminus ({\mathbf{x}_{i}},{\mathbf{y}_{i}})}}({\mathbf{x}_{i}})$ at each data point $({\mathbf{x}_{i}},{\mathbf{y}_{i}})$ of a predictor fit to all other $n-1$ points of ${\mathcal{F}_{n}}$ are computed, and $\mathsf{ISE}$ is estimated by their average:

(2.1)

\[ {\widehat{\mathsf{ISE}}_{cv}}=\frac{1}{n}{\sum \limits_{i=1}^{n}}{\left({\varepsilon _{i}^{cv}}\right)^{2}}\hspace{0.1667em}.\]

The setup considered in this paper is in some sense dual of CV. On the one hand, CV requires more information about η, assuming the ability to build the n new predictors ${\eta _{{\mathcal{F}_{n}}\setminus ({\mathbf{x}_{i}},{\mathbf{y}_{i}})}}$ (one for each point that is “left out”) and assumes thus knowledge of how ${\eta _{{\mathcal{F}_{n}}}}$ is learned, while we consider ${\eta _{{\mathcal{F}_{n}}}}$ as a black-box model delivered by a third party, using an undisclosed modelling approach. On the other hand, CV requires no any additional observations of f, while $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ requires m new evaluations, one at each point of ${\mathbf{Z}_{m}}$.

Given the observations of f over a validation set ${\mathbf{Z}_{m}}$, a straightforward estimate of the ISE is the simple arithmetic mean of the squared values of the m residuals ${\varepsilon _{i}}=f({\mathbf{z}_{i}})-{\eta _{{\mathcal{F}_{n}}}}({\mathbf{z}_{i}})$ observed over the ${\mathbf{z}_{i}}\in {\mathbf{Z}_{m}}$:

(2.2)

\[ {\widehat{\mathsf{ISE}}_{un}}=\frac{1}{m}{\sum \limits_{i=1}^{m}}{\varepsilon _{i}^{2}}\hspace{2.5pt}\hspace{0.1667em},\]

a special case of (1.2), obtained by letting ζ be the uniform distribution over ${\mathbf{Z}_{m}}$: $\zeta =(1/m){\textstyle\sum _{i}}{\delta _{{\mathbf{z}_{i}}}}$.

We argue below that there is no rationale for uniform weighting of the observed residuals. Let ${p_{\eta }}$ denote the (unknown) probability density of the residuals $\varepsilon (\mathbf{x})$ when $\mathbf{x}\sim \mu $, and consider situations where ${\mathbf{Z}_{m}}$ is a space-filling continuation of ${\mathbf{X}_{n}}$, sampling the regions of $\mathcal{X}$ the most distant from ${\mathbf{X}_{n}}$. We can then expect $\varepsilon ({\mathbf{Z}_{m}})=\{\varepsilon (\mathbf{z}),\mathbf{z}\in {\mathbf{Z}_{m}}\}$ to be biased towards the upper limit of the support of ${p_{\eta }}$, and thus ${\widehat{\mathsf{ISE}}_{un}}$ to over-estimate $\mathsf{ISE}$. To correct from this biased sampling of the errors, the contribution of each observed residual to $\widehat{\mathsf{ISE}}$ should be adjusted, counterbalancing the anticipated poor sampling of the smallest residual values. The validation measures ζ proposed in this paper automatically implement this variable residual weighting, relying on a prior stochastic model for f to infer how much $\varepsilon ({\mathbf{Z}_{m}})$ is expected to be representative of the errors over the entire $\mathcal{X}$. Moreover, nothing justifies enforcing ζ to be a proper probability distribution. If $\varepsilon ({\mathbf{Z}_{m}})$ is not a plausible i.i.d.4 sample from ${p_{\eta }}$, expression (2.2) cannot be assimilated to a Monte Carlo estimate of the ISE integral unless appropriate importance sampling weights are used.

This means that there is no reason to impose that ${\textstyle\sum _{i}}{\mathbf{w}_{i}}=1$, and we thus drop this common constraint, letting ζ be an un-normalised measure dictated by the geometry of ${\mathbf{Z}_{m}}$ relative to ${\mathbf{X}_{n}}$. To substantiate this choice, note that when ${\eta _{{\mathcal{F}_{n}}}}$ is an interpolator, so that $\varepsilon ({\mathbf{x}_{i}})=0$ for all ${\mathbf{x}_{i}}\in {\mathbf{X}_{n}}$, incorporation of these n zero residuals in (2.2), which should lead to a better estimator of $\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})$, yields

\[ {\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}=\frac{1}{m+n}{\sum \limits_{i=1}^{m}}{\varepsilon _{i}^{2}}\lt {\widehat{\mathsf{ISE}}_{un}}\hspace{2.5pt}\hspace{0.1667em},\]

for which ${\textstyle\sum _{i}}{\mathbf{w}_{i}}=m/(n+m)\lt 1$. Analysis of the biases of estimators ${\widehat{\mathsf{ISE}}_{un}}$ and ${\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}$ is difficult, since, as discussed above, the residuals observed over the validation design are not an i.i.d. sample from ${p_{\eta }}$. Figure 1 illustrates numerically the performance of the two estimators on a simple example, showing histograms of the errors of estimators ${\widehat{\mathsf{ISE}}_{un}}$ (in blue) and ${\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}$ (in red) over 500 realisations of a Gaussian process.5 On the top panel, $n=m=15$ while the bottom panel corresponds to a larger learning design, with $n=25$. We can see that in both cases the estimations errors of ${\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}$ are smaller than those of ${\widehat{\mathsf{ISE}}_{un}}$. The two estimators are affected by biases of opposite signs, the (positive) bias of ${\widehat{\mathsf{ISE}}_{un}}$ being larger (even when $n\gt m$) than the (negative) bias of ${\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}$. Larger values of n produce qualitatively similar comparison results (not shown). When n and m are very different, more sophisticated corrections, depending on the distinct effective sampling rates of the learning and validation designs, could be defined and should yield better results. The bottom line is that even this simplistic correction (uniform down-weighting) is able to reduce the error in the estimate of the $\mathsf{ISE}$.

Figure 1

Histograms of the errors of estimators ${\widehat{\mathsf{ISE}}_{un}}$ and ${\widehat{\mathsf{ISE}}_{un}^{\mathrm{\star }}}$ over 500 realisations of a Gaussian process. Top: $n=m=15$. Bottom: $n=25$, $m=15$.

2.2 Choosing the Validation Measure: a GP-Based Criterion

The estimation error $|\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )-\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})|$ is not a computable criterion that we can optimise to choose ζ. A possible approach would be to consider that f belongs to some class of functions $\mathcal{C}$ and optimise the worst estimation performance over all $f\in \mathcal{C}$. Here we follow an alternative and simpler route, assuming that f is a realisation of a Gaussian Process (GP), or Gaussian Random Field, and minimising the second-order moment of the ISE estimation error under the assumed model.

Assume thus that the function f is a sample ${\mathcal{F}_{x}}$ from a GP indexed by $\mathcal{X}$, with known second-order characteristics $\mathsf{E}\{{\mathcal{F}_{x}}{\mathcal{F}_{{x^{\prime }}}}\}={\sigma ^{2}}K(\mathbf{x},{\mathbf{x}^{\prime }})$: $f\sim {\mathcal{GP}^{f}}(m(\mathbf{x}),{\sigma ^{2}}K(\mathbf{x},{\mathbf{x}^{\prime }}))$. The kernel K is supposed to be Strictly Positive Definite (SPD), and, for the sake of simplicity, we consider that $m(\mathbf{x})=\mathsf{E}\{{\mathcal{F}_{x}}\}=0$ for all $\mathbf{x}\in \mathcal{X}$. Extension of the material presented below to the case of a linearly parameterised mean, with $\mathsf{E}\{{\mathcal{F}_{x}}\}={\boldsymbol{\beta }^{\top }}\mathbf{h}(\mathbf{x})$ for a vector $\boldsymbol{\beta }$ of unknown parameters and a vector $\mathbf{h}(\mathbf{x})={({h_{1}}(\mathbf{x}),\dots ,{h_{p}}(\mathbf{x}))^{\top }}$ of p known functions of x is possible via some adaptation.

Under the assumption above $\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})$ given by (1.1) is a random variable. The statistical moments of $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ under this stochastic model for f provide computable and pertinent criteria to chose ζ. We use the Mean Squared Error (MSE) of $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};\zeta )$ given ${\mathcal{F}_{n}}$,

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle \mathcal{R}({\zeta _{m}},{\mathcal{F}_{n}})& \displaystyle =& \displaystyle \mathsf{E}\left\{\left.{\left[\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})-\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};{\zeta _{m}})\right]^{2}}\right|{\mathcal{F}_{n}}\right\}\\ {} & & \displaystyle \hspace{-36.98866pt}=\mathsf{E}\left\{\left.{\left[{\int _{\mathcal{X}}}{[{\mathcal{F}_{x}}-{\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})]^{2}}\hspace{0.1667em}({\zeta _{m}}-\mu )(\mathrm{d}\mathbf{x})\right]^{2}}\right|{\mathcal{F}_{n}}\right\}\hspace{0.1667em},\end{array}\]

as a criterion to choose the validation design: ${\zeta _{m}^{\mathrm{\star }}}({\mathcal{F}_{n}})\in {\operatorname{arg\,min}_{{\zeta _{m}}}}\mathcal{R}({\zeta _{m}},{\mathcal{F}_{n}})$.

The GP assumption defines a prior distribution for f, which given ${\mathcal{F}_{n}}$ can be updated into the posterior distribution of its values over the unobserved points, with mean $\mathsf{E}\{{\mathcal{F}_{x}}|{\mathcal{F}_{n}}\}={\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\mathbf{y}_{n}}$ and covariance $\mathsf{E}\{{\mathcal{F}_{x}}{\mathcal{F}_{{x^{\prime }}}}|{\mathcal{F}_{n}}\}={\sigma ^{2}}{K_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})$, with ${K_{|n}}$ defined by

(2.3)

\[ {K_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})=K(\mathbf{x},{\mathbf{x}^{\prime }})-{\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\mathbf{k}_{n}}({\mathbf{x}^{\prime }})\]

for any x, ${\mathbf{x}^{\prime }}$ in $\mathcal{X}$, where

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\mathbf{k}_{n}}(\mathbf{x})& \displaystyle =& \displaystyle {(K(\mathbf{x},{\mathbf{x}_{1}})\hspace{0.1667em}\dots ,K(\mathbf{x},{\mathbf{x}_{n}}))^{\top }}\\ {} \displaystyle {\{{\mathbf{K}_{n}}\}_{i,j}}& \displaystyle =& \displaystyle K({\mathbf{x}_{i}},{\mathbf{x}_{j}})\hspace{0.1667em},\hspace{2.5pt}i,j=1,\dots ,n\hspace{0.1667em}.\end{array}\]

The $n\times n$ matrix ${\mathbf{K}_{n}}$ is SPD as K is SPD (we assume that the ${\mathbf{x}_{i}}$ in ${\mathbf{X}_{n}}$ are pairwise distinct). Note that ${K_{|n}}({\mathbf{x}_{i}},\mathbf{x})=0$ for all $\mathbf{x}\in \mathcal{X}$ and all ${\mathbf{x}_{i}}\in {\mathbf{X}_{n}}$. The Integrated Mean Squared Error (IMSE) is thus

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle \mathsf{IMSE}({\mathcal{F}_{n}})& \displaystyle =& \displaystyle {\int _{\mathcal{X}}}\mathsf{E}\left\{{\left[{\mathcal{F}_{x}}-{\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})\right]^{2}}|{\mathcal{F}_{n}}\right\}\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\\ {} & & \displaystyle \hspace{-50.0pt}={\int _{\mathcal{X}}}\mathsf{E}\left\{{\left[{\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})-{\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\mathbf{y}_{n}}\right]^{2}}|{\mathcal{F}_{n}}\right\}\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\\ {} & \hspace{2.5pt}& \displaystyle +{\sigma ^{2}}{\int _{\mathcal{X}}}{K_{|n}}(\mathbf{x},\mathbf{x})\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\hspace{0.1667em}.\end{array}\]

$\mathsf{IMSE}({\mathcal{F}_{n}})$ is minimum when ${\eta _{{\mathcal{F}_{n}}}}(\mathbf{x})$ is the posterior mean ${\mathbf{k}_{n}}{(\mathbf{x})^{\top }}{\mathbf{K}_{n}^{-1}}{\mathbf{y}_{n}}$. This minimum value depends only on the learning design ${\mathbf{X}_{n}}$ and is given by

(2.4)

\[ {\mathsf{IMSE}^{\mathrm{\star }}}({\mathbf{X}_{n}})={\sigma ^{2}}{\int _{\mathcal{X}}}{K_{|n}}(\mathbf{x},\mathbf{x})\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\le \mathsf{IMSE}({\mathcal{F}_{n}})\hspace{0.1667em}.\]

For any kernel K and signed measure ν on $\mathcal{X}$, let ${\mathcal{E}_{K}}(\nu )$ denote the energy of ν for K,

\[ {\mathcal{E}_{K}}(\nu )={\int _{{\mathcal{X}^{2}}}}K(\mathbf{x},{\mathbf{x}^{\prime }})\hspace{0.1667em}\nu (\mathrm{d}\mathbf{x})\nu (\mathrm{d}{\mathbf{x}^{\prime }})\hspace{0.1667em}.\]

When K defines a Reproducing Kernel Hilbert Space (RKHS) ${\mathcal{H}_{K}}$, for any function f in ${\mathcal{H}_{K}}$ and any probability measures ξ and μ on $\mathcal{X}$, the integration error ${\Delta _{\xi ,\mu }}(f)=\left|{\textstyle\int _{\mathcal{X}}}f(\mathbf{x})\hspace{0.1667em}\xi (\mathrm{d}\mathbf{x})-{\textstyle\int _{\mathcal{X}}}f(\mathbf{x})\hspace{0.1667em}\mu (\mathrm{d}\mathbf{x})\right|$ can be bounded by the product of two terms, one depending of f only, the other on the signed measure $\xi -\mu $ but not on f. Indeed, application of (i) the reproducing property $f(\mathbf{x})={\langle f,{K_{\mathbf{x}}}\rangle _{{\mathcal{H}_{K}}}}$, where ${\langle \cdot ,\cdot \rangle _{{\mathcal{H}_{K}}}}$ denotes the scalar product in ${\mathcal{H}_{K}}$ and where, for any ${\mathbf{x}^{\prime }}\in \mathcal{X}$, ${K_{\mathbf{x}}}({\mathbf{x}^{\prime }})=K(\mathbf{x},{\mathbf{x}^{\prime }})$, and (ii) of the Cauchy-Schwarz inequality, gives ${\Delta _{\xi ,\mu }}(f)\le \| f{\| _{{\mathcal{H}_{K}}}}{\mathcal{E}_{K}^{1/2}}(\xi -\mu )$, where the quantity ${\mathcal{E}_{K}^{1/2}}(\xi -\mu )$ is called the Maximum-Mean Discrepancy (MMD) between ξ and μ; see, e.g., [41, 40, 36]. Direct calculation yields $\mathcal{R}({\zeta _{m}},{\mathcal{F}_{n}})=\mathcal{R}({\zeta _{m}},{\mathbf{X}_{n}})$, with

(2.5)

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle \mathcal{R}({\zeta _{m}},{\mathbf{X}_{n}})\hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}& \displaystyle =& \displaystyle \hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}{\sigma ^{4}}{\int _{{\mathcal{X}^{2}}}}{\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})({\zeta _{m}}-\mu )(\mathrm{d}\mathbf{x})({\zeta _{m}}-\mu )(\mathrm{d}{\mathbf{x}^{\prime }})\\ {} \hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}& \displaystyle =& \displaystyle \hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}{\sigma ^{4}}{\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )\hspace{0.1667em},\end{array}\]

with ${\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})=(1/{\sigma ^{4}})\mathsf{E}\left\{\left.{\varepsilon ^{2}}(\mathbf{x}){\varepsilon ^{2}}{(\mathbf{x})^{\prime }}\right|{\mathcal{F}_{n}}\right\}$, a scaled version of the second-order moment of the squared residuals; that is, $\mathcal{R}({\zeta _{m}},{\mathbf{X}_{n}})$ is proportional to the squared MMD between the measures ${\zeta _{m}}$ and μ for the kernel ${\overline{K}_{|n}}$. Under the GP model ${\mathcal{GP}^{f}}$,

(2.6)

\[ {\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})=2\hspace{0.1667em}{K_{|n}^{2}}(\mathbf{x},{\mathbf{x}^{\prime }})+{K_{|n}}(\mathbf{x},\mathbf{x}){K_{|n}}({\mathbf{x}^{\prime }},{\mathbf{x}^{\prime }})\hspace{0.1667em},\]

and we are thus lead to

\[ {\zeta _{m}^{\mathrm{\star }}}({\mathcal{F}_{n}})\in \underset{{\zeta _{m}}}{\operatorname{arg\,min}}{\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )\hspace{0.1667em},\]

with ${\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})$ given by (2.6).

When ${\eta _{{\mathcal{F}_{n}}}}$ does not interpolate ${\mathbf{y}_{n}}$, and under the same GP model for f, similar developments still give $\mathcal{R}({\zeta _{m}},{\mathcal{F}_{n}})={\sigma ^{4}}{\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$, with now

(2.7)

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})& \displaystyle =& \displaystyle 2\hspace{0.1667em}\left[{K_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})+2\hspace{0.1667em}{\widehat{\delta }_{n}}(\mathbf{x}){\widehat{\delta }_{n}}({\mathbf{x}^{\prime }})\right]{K_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})\\ {} & & \displaystyle \hspace{-56.9055pt}+\hspace{0.1667em}\left[{\widehat{\delta }_{n}^{2}}(\mathbf{x})+{K_{|n}}(\mathbf{x},\mathbf{x})\right]\left[{\widehat{\delta }_{n}^{2}}({\mathbf{x}^{\prime }})+{K_{|n}}({\mathbf{x}^{\prime }},{\mathbf{x}^{\prime }})\right]\hspace{0.1667em},\end{array}\]

where ${\widehat{\delta }_{n}}(\mathbf{x})={\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\mathbf{y}_{n}}-{\boldsymbol{\eta }_{{\mathcal{F}_{n}}}}(\mathbf{x})$. Although in the following we will always consider that ${\eta _{{\mathcal{F}_{n}}}}$ is the optimal interpolator ${\mathbf{k}_{n}}{(\mathbf{x})^{\top }}{\mathbf{K}_{n}^{-1}}{\mathbf{y}_{n}}$, and thus that (2.6) holds, note that our approach covers generic machine learning predictors by considering ${\overline{K}_{|n}}$ defined by (2.7).

Kernels ${\overline{K}_{|n}}$ present a number of features which are not shared by the most commonly used GP kernels. The assumption that ${\eta _{{\mathcal{F}_{n}}}}$ is an interpolator, i.e. $\varepsilon ({\mathbf{x}_{i}})=0$, implies that ${\overline{K}_{|n}}({\mathbf{x}_{i}},\mathbf{x})=0$ for all $\mathbf{x}\in \mathcal{X}$ and all ${\mathbf{x}_{i}}\in {\mathbf{X}_{n}}$. The squared error process is thus non-stationary, with a spatial coherency structure that is strongly dictated by the geometry of ${\mathbf{X}_{n}}$. Adapting the validation weights ${\mathbf{w}_{i}}$ to this correlation structure dictates the performance of $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}};{\zeta _{m}})$ in a critical manner. Yet, as the numerical studies of Section 4 show, exploiting the particular shape of ${\overline{K}_{|n}}$ when choosing the validation points ${\mathbf{Z}_{m}}$ is less critical (as long as they do not fall in the vicinity of ${\mathbf{X}_{n}}$).

Finally, notice that ${\overline{K}_{|n}}$ is PD. Indeed, the Hadamard product ${\mathbf{C}_{n}^{\circ 2}}$ with elements ${\{{\mathbf{C}_{n}^{\circ 2}}\}_{i,j}}={C^{2}}({\mathbf{x}_{i}},{\mathbf{x}_{j}})$, $i,j=1,\dots ,n$, is PD when the matrix ${\mathbf{C}_{n}}$ with elements ${\{{\mathbf{C}_{n}}\}_{i,j}}=C({\mathbf{x}_{i}},{\mathbf{x}_{j}})$ is PD. Hence, the positive definiteness of ${K_{|n}}$ implies that ${K_{|n}^{2}}$ is PD, which in turn implies that ${\overline{K}_{|n}}$ is also PD.

3 Minimisation of ${\mathcal{E}_{{\overline{K}_{|n}}}}$

We address now the minimisation of ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$ with respect to ${\zeta _{m}}$.

We drop the two constraints usually imposed on weights: besides the sum-of weights-equals-one constraint (see Section 2.1), we also do not impose ${\mathbf{w}_{i}}\ge 0$. Imposing positivity would be natural if the observations were noisy independent random samples of the interpolation error, but here the ${\varepsilon _{i}}$ are noise-free and, more importantly, strongly linked by a coherency structure dictated by both the regularity characteristics of f and the quality of ${\eta _{{\mathcal{F}_{n}}}}$ as an interpolator. Nonetheless, our numerical experiments show that the ${\mathbf{w}_{i}}$ are almost always positive; see for example Figures 13 and 14. One may refer to [21] for an investigation of conditions that ensure positivity of quadrature weights, which shows that positivity can be guaranteed only under rather specific circumstances.

Since for a given f and ${\eta _{{\mathcal{F}_{n}}}}$ the validation residuals are deterministic, repeating validation points or choosing ${\mathbf{z}_{i}}\in {\mathbf{X}_{n}}$ brings no additional information. We thus restrict ${\mathbf{Z}_{m}}$ to configurations of m distinct points in $\mathcal{X}\setminus {\mathbf{X}_{n}}$. The minimisation of ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$ with respect to the parameters of ${\zeta _{m}}$ is a non-linear optimisation problem over a large dimensional space ($m(d+1)$ scalar parameters when $\mathcal{X}\subset {\mathbb{R}^{d}}$). As briefly evoked in the introduction, rather than fixing upfront the size m of the validation design, we are interested in finding nested sequences of validation designs, generated by a sequence of identical steps, each one increasing the design size by one:

(3.1)

\[ {\mathbf{Z}_{m+1}}={\mathbf{Z}_{m}}\cup \{{\mathbf{z}_{m+1}}\}\hspace{0.1667em},\]

where ${\mathbf{z}_{m+1}}$ is restricted to ${\mathcal{X}_{m}}=\mathcal{X}\setminus \{{\mathbf{Z}_{m}}\cup {\mathbf{X}_{n}}\}$.

Before we present in Section 3.2 the sequential Bayesian quadrature algorithm that performs this iterative construction, greedily decreasing ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$ at each step, we present background on relevant literature on iterative energy (or, equivalently, MMD) minimisation.

3.1 Background

Kernel Herding (KH) [44] can be seen to correspond to the Frank-Wolfe conditional gradient algorithm [3] applied to MMD minimisation, that is, to the vertex-direction method with predefined step-length, commonly used in optimal experimental design since the pioneering work of H.P. Wynn [45] and V.V. Fedorov [11]. It is an accretive method,6 generating a sequence ${\mathbf{z}_{1}},{\mathbf{z}_{2}},\dots $ which can be incrementally grown to any target size m.

In Bayesian quadrature (BQ) [29, 37] the goal is to choose samples that best approximate an integral by exploiting the assumption that the integrated function is the realisation of a GP. Sequential BQ (SBQ) sequentially expands the set of sampled points by adding a new sample at the point that decreases the variance of the integral estimate the most. This variance is shown to be the MMD between the target integral measure and the discrete measure that implements the quadrature rule for the kernel of the assumed GP model.

KH and SBQ are closely related, see e.g. [16], both attempting to minimise the same MMD. The two techniques embed the problem in consideration in the RKHS of a positive definite kernel that is chosen to reflect the characteristics of the underlying data distribution (in the original formulation of KH) or of the integrated functions (in SBQ). As stressed in [16], a major distinction between the two techniques concerns the weights assigned to each sample, which are uniform for standard KH, while they are optimally selected in SBQ. The two methods differ both in complexity and performance: SBQ is superior to standard (uniform weight) KH, this improvement coming at the cost of an increased complexity, $O(n)$ for KH and $O({n^{2}})$ for SBQ when constructing an n-point design among a finite set of candidates; see [33].

Experiments combining the two methodologies, by using the optimal BQ weights for a design found by standard KH, show that correct weighting is more critical than sample placement [16, 33], affecting in particular the algorithm’s convergence rate: KH has performance similar to SBQ for small design sizes, but displays worse performance as design size grows.

The validation setup of this paper coincides with the framework assumed by BQ, our final goal being to estimate an integral from a small number of samples, and we also resort to a GP assumption. As in BQ, the weights of our empirical estimator do not need to sum to 1 and are not necessarily positive (see [21]), and the optimal solution minimises an MMD. Placing the GP assumption not directly on the function we wish to integrate – in our case ${\varepsilon ^{2}}(\mathbf{x})$ – but on the interpolated f, leads to the identification of the pertinent MMD kernel under our validation framework as the non-stationary kernel ${\overline{K}_{|n}}$, whose structure encodes the geometry of the learning design ${\mathbf{X}_{n}}$.

Both KH and BQ assume that the RKHS kernel is characteristic, meaning that the corresponding MMD between two probability measures is zero if and only if these two measures coincide. Kernel ${\overline{K}_{|n}}$ is not characteristic, and in particular it cannot differentiate between measures that differ only over the finite set ${\mathbf{X}_{n}}$, where ${\overline{K}_{|n}}$ is zero. However, as we stressed before, since we know that $\varepsilon (\mathbf{x})=0$ for $\mathbf{x}\in {\mathbf{X}_{n}}$, the set of target measures over which we minimise ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$ all put zero mass on ${\mathbf{X}_{n}}$, and thus this minimisation still makes sense.

3.2 Greedy Minimisation of ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$

In this section we briefly present the SBQ method, reinterpreting it in the validation setup of interest to us.

By noting that ${\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )$ is quadratic in the ${\{{\mathbf{w}_{i}}\}_{i=1}^{m}}$, the optimal weights $\tilde{\mathbf{w}}({\mathbf{Z}_{m}})$ for a given ${\mathbf{Z}_{m}}$ are obtained explicitly as

(3.2)

\[ \tilde{\mathbf{w}}({\mathbf{Z}_{m}})={\overline{K}_{|n}}{({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})^{-1}}{P_{{\overline{K}_{|n}}}}({\mathbf{Z}_{m}})\hspace{0.1667em},\]

where the $m\times m$ matrix ${\overline{K}_{|n}}({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})$ has generic element ${\overline{K}_{|n}}({\mathbf{z}_{i}},{\mathbf{z}_{j}})$ and the i-th entry of the m-dimensional column vector ${P_{{\overline{K}_{|n}}}}({\mathbf{Z}_{m}})$ is the potential of μ associated with kernel ${\overline{K}_{|n}}$ at validation point ${\mathbf{z}_{i}}$:

\[ {\left[{P_{{\overline{K}_{|n}}}}({\mathbf{Z}_{m}})\right]_{i}}={P_{{\overline{K}_{|n}}}}({\mathbf{z}_{i}})={\int _{\mathcal{X}}}{\overline{K}_{|n}}({\mathbf{z}_{i}},\mathbf{x})\mu (d\mathbf{x})\hspace{0.1667em}.\]

Remembering that ${\sigma ^{4}}{\overline{K}_{|n}}(\mathbf{x},{\mathbf{x}^{\prime }})=\mathsf{E}\left\{{\varepsilon ^{2}}(\mathbf{x}){\varepsilon ^{2}}({\mathbf{x}^{\prime }})|{\mathcal{F}_{n}}\right\}$, ${P_{{\overline{K}_{|n}}}}(\mathbf{z})$ can be recognised as

\[ {P_{{\overline{K}_{|n}}}}(\mathbf{z})=\frac{1}{{\sigma ^{4}}}\mathsf{E}\left\{\left.{\varepsilon ^{2}}(\mathbf{z})\hspace{0.1667em}\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})\right|{\mathcal{F}_{n}}\right\}\hspace{0.1667em}.\]

Define

(3.3)

\[ {\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})={\overline{K}_{|n}}(\mathbf{x},{\mathbf{Z}_{m}}){\overline{K}_{|n}}{({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})^{-1}}{\varepsilon ^{2}}({\mathbf{Z}_{m}})\hspace{0.1667em}.\]

Under the posterior model, i.e., given ${\mathcal{F}_{n}}$, ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ is the Minimum MSE (MMSE) linear estimate of ${\varepsilon ^{2}}(\mathbf{x})$ given the residuals observed over ${\mathbf{Z}_{m}}$. When the weights ${\mathbf{w}_{i}}$ of the validation measure are given by (3.2), $\widehat{\mathsf{ISE}}$ has thus the following simple and enlightening expression:

(3.4)

\[ \widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}},{\mathbf{Z}_{m}})=\sum \limits_{i}{\tilde{\mathbf{w}}_{i}}({\mathbf{Z}_{m}}){\varepsilon ^{2}}({\mathbf{z}_{i}})\hspace{-0.1667em}=\hspace{-0.1667em}{\int _{\mathcal{X}}}{\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(x|{\mathbf{Z}_{m}})\hspace{0.1667em}\mu (dx)\hspace{0.1667em}.\]

Note that the weights ${\tilde{\mathbf{w}}_{i}}({\mathbf{Z}_{m}})$, and thus the estimator $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}},{\mathbf{Z}_{m}})$ itself, are independent of ${\sigma ^{2}}$. The estimators ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ and $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}},{\mathbf{Z}_{m}})$ rely on the assumed GP model ${\mathcal{GP}^{f}}$ for f, but as explained in [42, Sect. 3.2], model misspecification has a much smaller effect on estimated function values than on predictions of their MSE. One important strength of our approach is thus that our estimator of $\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})$ does not involve the predicted MSE associated with the reconstructed residuals. As shown in Appendix A, this is no longer the case when one attempts at removing the bias of $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}},{\mathbf{Z}_{m}})$, which leads to estimators that are no longer robust to model misspecification.

For a given ${\zeta _{m}}$ define ${\mathcal{E}_{m}}\left(\mathbf{x}\right)={\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m+1}^{\mathrm{\star }}}-\mu )$, the energy for measure ${\zeta _{m+1}^{\mathrm{\star }}}$ having support ${\mathbf{Z}_{m+1}}(\mathbf{x})={\mathbf{Z}_{m}}\cup \{\mathbf{x}\}$ and optimal weights $\tilde{\mathbf{w}}({\mathbf{Z}_{m+1}}(\mathbf{x}))$ given by (3.2). If ${\zeta _{m}}=\zeta (\tilde{\mathbf{w}}({\mathbf{Z}_{m}}),{\mathbf{Z}_{m}})$, for $\mathbf{x}\in {\mathcal{X}_{m}}$ we have

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\mathcal{E}_{m}}\left(\mathbf{x}\right)& \displaystyle =& \displaystyle {\mathcal{E}_{{\overline{K}_{|n}}}}({\zeta _{m}}-\mu )\\ {} & & \displaystyle \hspace{-56.9055pt}-\hspace{0.1667em}\frac{{\left({P_{{\overline{K}_{|n}}}}(\mathbf{x})-{\overline{K}_{|n}}(\mathbf{x},{\mathbf{Z}_{m}}){\overline{K}_{|n}}{({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})^{-1}}{P_{{\overline{K}_{|n}}}}({\mathbf{Z}_{m}})\right)^{2}}}{{s^{2}}(\mathbf{x})},\end{array}\]

where

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {s^{2}}(\mathbf{x})& \displaystyle =& \displaystyle {\overline{K}_{|n}}(\mathbf{x},\mathbf{x})\\ {} & & \displaystyle \hspace{-20.0pt}-\hspace{0.1667em}{\overline{K}_{|n}}(\mathbf{x},{\mathbf{Z}_{m}}){\overline{K}_{|n}}{({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})^{-1}}{\overline{K}_{|n}}({\mathbf{Z}_{m}},\mathbf{x})\\ {} & & \displaystyle =\frac{1}{{\sigma ^{4}}}\mathsf{E}\left\{\left.{\left({\varepsilon ^{2}}(\mathbf{x})-{\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})\right)^{2}}\right|{\mathcal{F}_{n}}\right\}\hspace{0.1667em}.\end{array}\]

The next validation point is thus a maximiser of the second term in ${\mathcal{E}_{m}}\left(\mathbf{x}\right)$, which can equivalently be written as

(3.5)

\[\begin{aligned}{}{\mathbf{z}_{m+1}}\in \underset{\mathbf{x}\in {\mathcal{X}_{m}}}{\operatorname{arg\,max}}& \frac{\mathsf{E}\left\{\left.\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})\left({\varepsilon ^{2}}(\mathbf{x})-{\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})\right)\right|{\mathcal{F}_{n}}\right\}}{\mathsf{E}\left\{\left.{\left({\varepsilon ^{2}}(\mathbf{x})-{\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})\right)^{2}}\right|{\mathcal{F}_{n}}\right\}}.\end{aligned}\]

The numerator measures how much $\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})$ and the error of ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{z}|{\mathbf{Z}_{m}})$ as an estimate of ${\varepsilon ^{2}}(\mathbf{x})$ are statistically associated. Points where this term is large are good candidates to extend the current design. The denominator penalises points x where ${\varepsilon ^{2}}(\mathbf{x})$ is estimated with a large MSE, tending in particular to keep ${\mathbf{z}_{m+1}}$ away from the boundaries of $\mathcal{X}$ (where the uncertainty is in general large), as the numerical studies presented later will show.

The recursive extension of the validation measure is initiated with ${\mathbf{Z}_{1}}=\{{\mathbf{z}_{1}}\}$ solution of

(3.6)

\[ {\mathbf{z}_{1}}=\underset{\mathbf{x}\in \mathcal{X}\setminus {\mathbf{X}_{n}}}{\max }\frac{{P_{{\overline{K}_{|n}}}}{(\mathbf{x})^{2}}}{{\overline{K}_{|n}}(\mathbf{x},\mathbf{x})}\hspace{0.1667em}.\]

In practice, a finite set ${\mathcal{X}_{L}}\subset \mathcal{X}$, for instance the L first elements of a low-discrepancy sequence in $\mathcal{X}$, or a regular grid in $\mathcal{X}$ if d is not too large, is substituted for $\mathcal{X}$ in (3.5) and (3.6). The determination of ${\mathbf{z}_{m+1}}$, $m\ge 0$, then requires the evaluation of ${P_{{\overline{K}_{|n}}}}(\mathbf{x})$ for all $\mathbf{x}\in {\mathcal{X}_{L}}\setminus {\mathbf{X}_{n}}$. This calculation is done once for all, at the initialisation of the algorithm. In the numerical examples of Sections 4 and 5, ${P_{{\overline{K}_{|n}}}}={P_{{\overline{K}_{|n}},\mu }}$ is replaced by ${P_{{\overline{K}_{|n}},{\mu _{L}}}}$, with ${\mu _{L}}$ the uniform (discrete) measure uniform on ${\mathcal{X}_{L}}$, see Appendix C for details. When K is a tensor-product kernel and μ is uniform on $\mathcal{X}={[0,1]^{d}}$, ${P_{{\overline{K}_{|n}}}}(\mathbf{x})$ can often be calculated explicitly; see Appendix B. The same approach (substitution of the discrete measure ${\mu _{L}}$ for μ, or tensorisation) can be used to evaluate (3.4).

With the aid of a one-dimensional example we formulate now a number of comments about the expected behaviour and properties of the estimators $\widehat{\mathsf{ISE}}$ obtained by repeated application of (3.5) – to extend ${\mathbf{Z}_{m}}$ to ${\mathbf{Z}_{m+1}}$ – and (3.2) – fixing the weights of ${\zeta _{m+1}}$, and thus ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m+1}})$ for the subsequent design extension. The red bold curve in the top panel of Figure 2 plots the squared residuals ${\varepsilon ^{2}}(\mathbf{x})$ of the interpolator ${\eta _{{\mathcal{F}_{n}}}}$ for the function f plotted in the bottom panel (where ${\eta _{{\mathcal{F}_{n}}}}$ and f are in red and green, respectively), trained on the learning design of size 10 indicated by the red stars. The blue and green curves on the top panel are the squared residuals ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ predicted by two distinct ${\zeta _{m}}$ ($m=10$), both generated using (3.5) and (3.2), but assuming distinct kernels $K(\mathbf{x},{\mathbf{x}^{\prime }})$: Cauchy (in green) and Matérn 3/2 (in blue), with range parameters θ as indicated in the legend.7 The (nearly coincident) validation designs ${\mathbf{Z}_{m}}$ are indicated by the squares and circles filled with the corresponding colours.

Figure 2

Top: $\varepsilon (\mathbf{x})$ (red) and ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ (blue and green) for two distinct GP models. Bottom: f, η and ${\mathbf{X}_{n}}$.

Figure 3

${\overline{K}_{|n}}({\mathbf{z}_{1}},\mathbf{x})/{\overline{K}_{|n}}({\mathbf{z}_{1}},{\mathbf{z}_{1}})$ (bold lines) and $K({\mathbf{z}_{1}},\mathbf{x})/K({\mathbf{z}_{1}},{\mathbf{z}_{1}})$ (thin lines) for the Cauchy and Matérn kernels used in Figure 2 (same colour code) and ${z_{1}}\simeq 0.1$.

Remark first that, as anticipated, both designs ${\mathbf{Z}_{m}}$ have no points in the boundaries of $\mathcal{X}$, even if the uncertainty affecting ${\varepsilon ^{2}}(\mathbf{x})$ is large in those regions. Those familiar with optimal interpolation using monotonically decreasing stationary covariance kernels may be surprised by the fact that in intervals between learning points containing no validation points (e.g. around $\mathbf{x}\simeq 0.3$) the interpolated squared residual is non-zero, i.e., ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})\gt 0$. This is a consequence of the particular shape of kernel ${\overline{K}_{|n}}$, strongly dictated by the geometry of ${\mathbf{X}_{n}}$, which has larger values at pairs of points at large distance than the original K, as shown in Figure 3. For ${\mathbf{z}_{1}}\simeq 0.1\in {\mathbf{Z}_{m}}$, the figure plots normalised versions of both the assumed (stationary) signal correlation $K({\mathbf{z}_{1}}-\mathbf{x})$ (in thin coloured lines) as well as kernel ${\overline{K}_{|n}}({\mathbf{z}_{1}},\mathbf{x})$ (bold lines), with the same colour code as in Figure 2. The similarity of the two ${\overline{K}_{|n}}$ allows us to expect that the estimator will have some robustness with respect to the assumed GP model. The numerical studies presented in Section 4 confirm this expectation.

Above, we recognised ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$, given by (3.3), as the MMSE linear estimator of ${\varepsilon ^{2}}(\mathbf{x})$ given ${\varepsilon ^{2}}({\mathbf{Z}_{m}})$. Being agnostic with respect to the expected values of the involved random variables, estimators ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$, and thus $\widehat{\mathsf{ISE}}$, are biased. We investigate in Appendix A the possibility of exploiting knowledge of the first moments, namely $\mathsf{E}\left\{\left.{\varepsilon ^{2}}(\mathbf{x})\right|{\mathcal{F}_{n}}\right\}={\sigma ^{2}}{K_{|n}}(\mathbf{x},\mathbf{x})$ and $\mathsf{E}\left\{\left.\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})\right|{\mathcal{F}_{n}}\right\}=\mathsf{IMSE}({\mathcal{F}_{n}})$, to replace ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ in (3.4) with an unbiased estimator. Unfortunately, bias correction comes at the price of loosing robustness with respect to the assumed GP model for f, as we might expect given the explicit dependency on ${\sigma ^{2}}$ of both expected values. Thus, the unbiased estimators in Appendix A cannot be considered as instrumental alternatives to $\widehat{\mathsf{ISE}}$, and we do not consider them in the numerical study of Section 4.

4 Numerical Experiments

Section 4.1 presents numerical studies that demonstrate the robustness of $\widehat{\mathsf{ISE}}$ with respect to the assumed GP model, with ${\zeta _{m}}$ found by SBQ. Section 4.2 confirms the importance of using ${\overline{K}_{|n}}$ to define the energy minimised by SBQ. We then study, in Section 4.3, the possibility of using KH, which has slightly smaller computational complexity, rather than SBQ, to find the validation support of ${\zeta _{m}}$. Our conclusion is that doing so not only leads to worse performance but is also prone to numerical instability. Finally, Section 4.4 illustrates via some examples the properties of the validation measures, in particular their space-filling properties and the fact that they down-weight the observed squared residuals. In all examples $\mathcal{X}={[0,1]^{d}}$, with $d=1$, 2 or 3. Use of larger values of d leads to similar conclusions; see [35].

Our analysis resorts to simulations from several (zero mean) GP models, and the MSE of the $\mathsf{ISE}$ estimates is approximated by averaging the squared errors of ${\widehat{\mathsf{ISE}}^{(i)}}$ on $M=500$ realisations ${\{{f^{(i)}}\}_{i=1}^{M}}$ of the assumed GP model. We reserve the notation $Q(\cdot ,\cdot ;{\theta _{0}})$ for the kernel of the GP model from which is f sampled, ${\theta _{0}}$ being thus “the true” scale parameter. The scale parameter is adapted to the size of the learning design, ${\theta _{0}}={n^{1/d}}$, such that good interpolation performance over $\mathcal{X}$ can be attained with n points. Designs ${\mathbf{X}_{n}}$ are always space filling, and ${\eta _{{\mathcal{F}_{n}}}}$ is the optimal Bayesian interpolator for the simulated GP model. See Appendix C for details.

$K(\cdot ,\cdot ;\theta )$ denotes the kernel of the GP model assumed by the design algorithm that produces ${\zeta _{m}}$, with θ its scale parameter. In all numerical examples we always consider ${\sigma ^{2}}=1$. The influence of θ is studied for $\theta \in \left[{n^{1/d}}/4,\max ({n^{1/d}},2\hspace{0.1667em}{(n+m)^{1/d}})\right]$, an interval that always contains

\[ {\theta _{c}}(n,m,d)={(n+m)^{1/d}}\]

(as well as ${\theta _{0}}={n^{1/d}}$). All plots consider the normalisation $\theta /{\theta _{c}}$, such that ${\theta _{c}}\leftrightarrow 1$ in the plots shown. In all plots of this section the special symbols in the plotted curves indicate $\theta ={\theta _{0}}$, the scale parameter of the simulated GP model.

Figure 4

MSE of $\widehat{\mathsf{ISE}}$. Statistics over 500 realisations. Q is a Matérn 3/2 kernel; K is a Matérn kernel with $\nu =1/2$ (left), $\nu =3/2$ (middle) and $\nu =5/2$ (right); $d=1$.

Figure 5

MSE of $\widehat{\mathsf{ISE}}$. Statistics over 500 realisations. Q is a Cauchy kernel, K is as in Figure 4; $d=1$.

4.1 Robustness with Respect to Assumed GP Model

We address robustness by studying how much the MSE of $\widehat{\mathsf{ISE}}$ is affected by model mismatch, i.e., by estimating the ISE assuming that the kernel is $K(\cdot ,\cdot ;\theta )$ when in fact the data generating model uses $Q(\cdot ,\cdot ;{\theta _{0}})$. Figure 4 plots empirical estimates of $\mathcal{R}({\zeta _{m}},{\mathcal{F}_{n}})$. Kernel Q is the Matérn 3/2 kernel, ${\mathbf{X}_{n}}$ has $n=10$ points and $d=1$, ${\theta _{0}}=n$. The panels correspond to different values of the regularity parameter – $\nu =1/2,3/2,5/2$, from left to right – of the Matérn kernel K.

The three curves in each plot correspond to different sizes of ${\mathbf{Z}_{m}}$ (with ${\mathbf{Z}_{m}}$ depending on K, and thus on ν): $m\in \{5,10,20\}$ (in blue, red and yellow, respectively), plotting $\mathcal{R}$ as a function of $\theta /(n+m)$. The black stars indicate $\theta ={\theta _{0}}$. Comparison of the three panels confirms the anticipated robustness of the estimator. When K has higher regularity than Q, as in the rightmost panel ($\nu =5/2$), the curves are almost identical to the central panel, where the correct model is used. However, the assumption of a less regular model, as in the leftmost panel, may significantly degrade performance. The estimators are reasonably robust with respect to precise choice of the scale parameter if values $\theta \simeq {\theta _{c}}$ are used.

Figure 5 reproduces the same study for simulations from a process with a Cauchy kernel and for a larger ${\mathbf{Z}_{m}}$: $m\in \{10,20,30\}$ (left to right). As in previous figure, K is a Matérn kernel and the three panels correspond to different smoothness parameters $\nu \in \{1/2,3/3,5/2\}$. Here the simulated model has a weaker regularity than the models assumed, and a noticeable performance degradation is now observed for the smaller designs and the more regular Matérn kernel with $\nu =5/2$. Similar results were obtained when simulating from other models and for higher values of d.

Finally, Figure 6 shows, for the same validation designs ${\mathbf{Z}_{m}}$ as in Figure 4, the MSE of ${\widehat{\mathsf{ISE}}_{un}}$ given by equation (2.2), estimated over 500 realisations of a GP with the same Matérn 3/2 model. We can see that proper residual weighting leads to a significant decrease of the estimation error, which is nearly one order of magnitude larger in Figure 6 than for the optimal BQ weighting used in Figure 4.

The experiments in this section suggest a rule-of-thumb to choose the kernel K used by the design algorithm: K should model functions with a reasonably large degree of smoothness (Matérn 3/2 was found to be a good compromise), with a scale parameter θ dependent on the sizes of the learning and validation sets. For the Matérn family used in our experiments a good choice is $\theta \simeq {(n+m)^{1/d}}$, automatically adjusting to the actual total number of residual samples.

Figure 6

MSE of ${\widehat{\mathsf{ISE}}_{un}}$. Statistics over 500 realisations for the example in Figure 4.

4.2 Impact of ${\overline{K}_{|n}}$

Our main novel contribution is the identification of ${\overline{K}_{|n}}$ as the kernel that appears in the MMD that the validation measure ${\zeta _{m}}$, both its weights and its support, must minimise. One may question the importance of using the non-stationary conditional kernel ${\overline{K}_{|n}}$ to find ${\mathbf{Z}_{m}}$, instead of directly using kernel K. We now compare the performance of the empirical estimator $\widehat{\mathsf{ISE}}$ with ${\mathbf{Z}_{m}}$ determined by SBQ for kernel ${\overline{K}_{|n}}$, as in Section 3.2, which from now on we denote by ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$, with the estimates produced by a validation measure ${\zeta _{K}^{\mathsf{BQ}}}$ whose support ${\mathbf{Z}_{m}}$ is incrementally found by SBQ for kernel K, the continuation of ${\mathbf{X}_{n}}$ that is optimal to integrate the function f. Independently of how ${\mathbf{Z}_{m}}$ was found, the validation measures ${\zeta _{m}}$ used by the estimators $\widehat{\mathsf{ISE}}$ always have optimal weights given by (3.2).

Figures 7 ($d=1$) and 8 ($d=3$) show the empirical MSE of $\widehat{\mathsf{ISE}}$ for ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black lines) and ${\zeta _{K}^{\mathsf{BQ}}}$ (red lines) observed when Q is the Matérn 3/2 kernel (top) and the Cauchy kernel (bottom), for a learning design of size $n=10\hspace{0.1667em}d$. From left to right, K is a Matérn kernel with $\nu =1/2$, $3/2$ and $5/2$. The size of the validation designs, $m\in \{10\hspace{0.1667em}d,20\hspace{0.1667em}d,30\hspace{0.1667em}d\}$, is indicated by the line symbols (+, ⋆ and ∘, respectively). We can see that the two estimators display similar performance and robustness with respect to mis-modelling. When m is small ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ often yields smaller MSE, see top curves, but the red and black curves are almost coincident for the larger values of m. These results, which are representative of those obtained for other choices of Q and d, indicate that correct residual weighting is more important than the detailed placement of the validation points ${\mathbf{Z}_{m}}$.

Note that, in the configurations tested, the default rule-of-thumb for the choice of K and θ presented in Section 4.1 leads indeed to good and stable performance.

Figure 7

MSE of $\widehat{\mathsf{ISE}}$ for ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black) and ${\zeta _{K}^{\mathsf{BQ}}}$ (red) for $m\in \{10,20,30\}$. From left to right $\nu =1/2,3/2,5/2$. Statistics over 500 realisations. Top: Q is the Matérn 3/2 kernel, bottom: Q is the Cauchy kernel; $d=1$.

Figure 8

MSE of $\widehat{\mathsf{ISE}}$ for ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black) and ${\zeta _{K}^{\mathsf{BQ}}}$ (red) for $m\in \{20,40,60\}$. Top: Q is a Matérn 3/2 kernel; bottom: Q is the Cauchy kernel. K is always a Matérn kernel, from left to right $\nu =1/2,3/2,5/2$; $d=3$.

Figure 9

MSE of $\widehat{\mathsf{ISE}}$ for ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black solid lines), ${\zeta _{K}^{\mathsf{KH}}}$ (red dashed lines) and ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ (dotted green lines), for $m=10$ (+), $m=20$ (⋆) and $m=30$ (∘). From left to right $\nu =1/2,3/2,5/2$. Q is a Matérn 3/2 kernel; $d=1$.

Figure 10

MSE of $\widehat{\mathsf{ISE}}$ for ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black solid lines), ${\zeta _{K}^{\mathsf{KH}}}$ (red dashed lines) and ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ (dotted green lines), for $m=20$ (+), $m=30$ (⋆) and $m=60$ (∘). From left to right $\nu =1/2,3/2,5/2$. Q is a Matérn 3/2 kernel; $d=2$.

4.3 Comparison with Kernel Herding

Considering only validation measures ζ with uniform weights $1/m$, standard KH also minimises an MMD, incrementally extending ${\mathbf{Z}_{m}}$ with

\[ {\mathbf{z}_{m+1}}\in \underset{\mathbf{x}\in {\mathcal{X}_{L}}\setminus \{{\mathbf{Z}_{m}}\cup {\mathbf{X}_{n}}\}}{\operatorname{arg\,max}}{P_{{\overline{K}_{|n}}}}(\mathbf{x})-{\overline{K}_{|n}}(\mathbf{x},{\mathbf{Z}_{m}}){\mathbf{1}_{m}}/m\hspace{0.1667em},\]

where ${\mathbf{1}_{m}}$ denotes the m-dimensional vector with all components equal to one. Since KH has smaller complexity than SBQ, and the results of the previous section suggest that optimal choice of ${\mathbf{Z}_{m}}$ is less important than correct determination of the weights ${\mathbf{w}_{i}}$, we compare now ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ to two other validation measures, whose designs ${\mathbf{Z}_{m}}$ are found by extending ${\mathbf{X}_{n}}$ by KH: ${\zeta _{K}^{\mathsf{KH}}}$, that performs KH for kernel K, and ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ that uses ${\overline{K}_{|n}}$. As we will see, the SBQ design is a superior alternative, both in terms of performance and numerical stability, to the KH designs.

Since ${\zeta _{K}^{\mathsf{KH}}}$ considers only, at each step, measures with uniform weights, and ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ does not take into account the optimal weights that will be applied when ${\mathbf{Z}_{m}}$ is extended to ${\mathbf{Z}_{m+1}}$, we can expect the following ranking of these estimators:

(4.1)

\[ \mathcal{R}({\zeta _{K}^{\mathsf{KH}}};{\mathcal{F}_{n}})\ge \mathcal{R}({\zeta ^{\mathsf{KH}\mathrm{\star }}};{\mathcal{F}_{n}})\ge \mathcal{R}({\zeta ^{\mathsf{BQ}\mathrm{\star }}};{\mathcal{F}_{n}})\hspace{0.1667em}.\]

Figures 9 and 10 plot, for $d=1$ and $d=2$, respectively, the MSE of estimators $\widehat{\mathsf{ISE}}$ that use ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ (black solid lines), ${\zeta _{K}^{\mathsf{KH}}}$ (red dashed lines) and ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ (dotted green lines). Kernels (Q and K) and designs sizes m are as in the previous examples, see the figures’ captions. We can see that ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ has virtually always smaller MSE than the validation measures using validation designs ${\mathbf{Z}_{m}}$ found by KH, in particular for small design sizes m and the more regular models. It also appears to be more robust with respect to the choice of the GP kernel. We remark that the design found by KH for kernel ${\overline{K}_{|n}}$, i.e., the validation measure ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ (in green), often leads to the poorest performance. That use of ${\overline{K}_{|n}}$ may lead to worse performance than simply using K has already been noticed in [12], where only validation sets generated with KH were considered.

Moreover, our experiments reveal that the designs ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ can sometimes lead to ISE estimates with very large errors. This happens when KH places design points close to ${\mathbf{X}_{n}}$. In fact, the implementation of standard KH for kernel ${\overline{K}_{|n}}$ needs careful handling of possible repetition of design points, as already noted in [35] where an algorithm is proposed to accommodate this eventuality. Since the implementation used in Figures 9 and 10 simply imposes ${\mathbf{z}_{m+1}}\notin ({\mathbf{X}_{n}}\cup {\mathbf{Z}_{m}})$, a grid point very close to ${\mathbf{X}_{n}}\cup {\mathbf{Z}_{m}}$ can chosen, as shown below.

Figure 11 shows the designs ${\mathbf{Z}_{m}}$, $m=30$, for Matérn kernels with $\theta ={\theta _{0}}$, $d=1$, and regularity parameter (top to bottom panels) $\nu =1/2,3/2$ and $5/2$. The vertical red lines indicate ${\mathbf{X}_{n}}$ and the black stars, green circles and red squares the position of points of ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$, ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ and ${\zeta _{K}^{\mathsf{KH}}}$, respectively. A vertical offset is used to facilitate the visualisation of each design (from top to bottom, ${\zeta _{K}^{\mathsf{KH}}}$, ${\zeta ^{\mathsf{KH}\mathrm{\star }}}$ and ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$). Remark first that the SBQ designs are always space-filling continuations of ${\mathbf{X}_{n}}$, presenting a good stability with respect to ν, mainly moving points closer to the boundaries of $\mathcal{X}$ when ν increases. The other two designs place a few points in the vicinity of ${\mathbf{X}_{n}}$.

Figure 11

Designs for $\theta ={\theta _{0}}$ in Figure 9. From top to bottom: $\nu =1/2,3/2,5/2$. ${\mathbf{X}_{n}}$: red ∗; ${\mathbf{Z}_{m}^{\mathsf{BQ}\mathrm{\star }}}$: black ∗; ${\mathbf{Z}_{m}^{\mathsf{KH}}}$: red □ and ${\mathbf{Z}_{m}^{\mathsf{KH}\mathrm{\star }}}$: green ∘.

4.4 Properties of the Design Measures ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$

For the same set of kernels K and design sizes considered in Figure 4 (with $d=1$), we plot in Figure 12 the sum of the design weights, $S(\theta )={\textstyle\sum _{i}}{\mathbf{w}_{i}}(\theta )$, as a function of the (normalised) scale parameter of K. K is always a Matérn kernel, with regularity parameter $\nu =1/2,\hspace{0.1667em}3/2,5/2$ (top to bottom), as indicated in the legends. The learning design ${\mathbf{X}_{n}}$ ($n=10$) is the same for all cases.

Figure 12

$S={\textstyle\sum _{i}}{\mathbf{w}_{i}}$ for measure ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$, $n=10$.

Three values of m are considered, $m=10,\hspace{0.1667em}20$ and 30 (blue, red and cyan curves, respectively). In each curve the black squares indicate the value ${\theta _{0}}={n^{1/d}}$. We can see that $S(\theta )$ increases with m. For θ larger than a certain value, S becomes nearly constant, with a value smaller than one (note that the value ${\theta _{c}}$ prescribed by our rule of thumb for the scale parameter, which corresponds to the normalised value of θ equal to one, is always inside this range) while for $\theta ={n^{1/d}}$ (indicated by a square), under the more regular model with a Matérn 5/2 kernel, S may be larger than 1.

Figure 13

${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ in Figure 12, $\theta ={n^{1/d}}$. Validation design sizes: $m=10$ (blue), $m=20$ (red) and $m=30$ (cyan).

Figure 14

${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ in Figure 12, $\theta ={(n+m)^{1/d}}$. Validation design sizes: $m=10$ (blue), $m=20$ (red) and $m=30$ (cyan).

Figure 15

${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ in Figure 12, $\theta =2\hspace{0.1667em}{(n+m)^{1/d}}$. Validation design sizes: $m=10$ (blue), $m=20$ (red) and $m=30$ (cyan).

Figures 13, 14 and 15 present the designs for three values of θ: $\theta ={n^{1/d}}$ (the value used in the simulations of Figure 4, and indicated by the squares in Figure 12), for the value prescribed by our rule of thumb, $\theta ={(n+m)^{1/d}}$, and for $\theta =2\hspace{0.1667em}{(n+m)^{1/d}}$, the upper limit considered in Figure 4. In the figures, the weights of ${\zeta _{m}}$ are shown multiplied by m, to enable comparison. The distinct kernels K correspond to the three row panels, as indicated in the figure (with regularity increasing from top to bottom). The dotted black vertical lines (the same in the three panels) indicate the learning design ${\mathbf{X}_{n}}$. The colours code the validation design size: $m=10$ in blue, $m=20$ in red and $m=30$ in cyan. Remark the striking similarity of the validation measures obtained for the different kernels in Figures 14 and 15, supporting our observations concerning the robustness of the estimator. The figures also show that the validation designs are, as expected, space-filling continuations of ${\mathbf{X}_{n}}$, and that as m grows (remember ${\mathbf{Z}_{10}}\subset {\mathbf{Z}_{20}}\subset {\mathbf{Z}_{30}}$) the holes of ${\mathbf{X}_{n}}\cup {\mathbf{Z}_{m}}$ are refined. Note, however, the slow rate of population of the immediate neighborhood of ${\mathbf{X}_{n}}$, ${\mathbf{Z}_{m}}$ tending first, as m grows, to refine the interior of the wider holes of ${\mathbf{X}_{n}}$. For the Matérn 5/2 kernel and $\theta ={n^{1/d}}$ a few weights, corresponding to validation points close to ${\mathbf{X}_{n}}$, become very large, see Figure 13, explaining that $S(\theta )$ may be larger than one on the bottom panel of Figure 12. Analysis of the validation measures obtained assuming the larger value of θ in Figure 15 shows that, as the assumed correlation length decreases, ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ tends to a uniform measure, all weights having now a similar value. Note that even in this situation, use of the BQ measure, which down-weights the squared residuals, leads to a smaller error than use of the simple uniform measure over ${\mathbf{Z}_{m}}$, as the comparison of Figures 4 and 6 in Section 4.1 has shown.

5 “Real” Models

We study in this section the behaviour of the validation method proposed considering deterministic functions. More precisely, we consider the following multidimensional functions:

• The 2-dimensional drag model that describes the quasi-steady drag coefficient ${C_{D}}$ of a spherical particle in a compressible flow, see [30]:
\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}& & \displaystyle {C_{D}}(M,{R_{e}})=\left(\alpha ({R_{e}})-\beta ({R_{e}})\right)\xi (M,{R_{e}})+\beta (Re)\\ {} & & \displaystyle \hspace{-22.76228pt}\alpha ({R_{e}})=\frac{24}{{R_{e}}}\left(1+0.107{R_{e}^{0.867}}\right)+0.646{\left(1+\frac{861}{{R_{e}^{0.634}}}\right)^{-1}}\\ {} & & \displaystyle \hspace{-22.76228pt}\beta ({R_{e}})=24\left(1+0.118{R_{e}^{0.813}}\right){R_{e}}+0.69{\left(1+\frac{3550}{{R_{e}^{0.793}}}\right)^{-1}}\\ {} & & \displaystyle \xi (M,{R_{e}})={\sum \limits_{i=1}^{3}}{f_{i}}(M)\prod \limits_{j\ne i,j\in \{1,2,3\}}\frac{\log {R_{e}}-{C_{j}}}{{C_{i}}-{C_{j}}}\end{array}\]
where
\[\begin{aligned}{}{C_{1}}& =6.48,\hspace{2.5pt}{C_{2}}=8.93,\hspace{2.5pt}{C_{3}}=12.21\\ {} f(M)& =\mathbf{a}+\mathbf{b}M+\mathbf{c}{M^{2}}+\mathbf{d}{M^{2}}-\mathbf{g}(M)\\ {} \mathbf{a}& =-{\left[2.963\hspace{2.5pt}6.617\hspace{2.5pt}5.866\right]^{T}}\\ {} \mathbf{b}& ={\left[4.392\hspace{2.5pt}12.11\hspace{2.5pt}11.57\right]^{T}}\\ {} \mathbf{c}& =-{\left[1.169\hspace{2.5pt}6.501\hspace{2.5pt}6.665\right]^{T}}\\ {} \mathbf{d}& ={\left[-0.027\hspace{2.5pt}1.182\hspace{2.5pt}1.312\right]^{T}}\\ {} \mathbf{g}(M)& =\left[\begin{array}{c}0.233{e^{(1-M)/0.11}}\\ {} 0.174{e^{(1-M)/0.01}}\\ {} 0.350{e^{(1-M)/0.012}}\end{array}\right]\end{aligned}\]
• The 7-dimensional piston model, that describes the circular motion of a piston within a cylinder, see [1]:
\[\begin{aligned}{}C(\mathbf{x})& =2\pi \sqrt{\frac{M}{k+{S^{2}}\frac{{P_{0}}{V_{0}}}{{T_{0}}}\frac{{T_{a}}}{{V^{2}}}}}\\ {} V& =\frac{S}{2k}\left(\sqrt{{A^{2}}+4k{T_{A}}\frac{{P_{0}}{V_{0}}}{{T_{0}}}}-A\right)\\ {} A& ={P_{0}}S-+19.62M-k\frac{{V_{0}}}{S}\end{aligned}\]
with ${x_{1}}=M\in [30,60]$, ${x_{2}}=S\in [0.005,0.020]$, ${x_{3}}={V_{0}}\in [0.002,0.010]$, ${x_{4}}=k\in [1000,5000]$, ${x_{5}}={P_{0}}\in [90000,110000]$, ${x_{6}}={T_{a}}\in [290,296]$ and ${x_{7}}={T_{0}}\in [340,360]$. In [28], a screening study is presented, indicating that only variables ${x_{i}}$ for $i\le 4$ are relevant. For this reason we consider $C(\mathbf{x})$ only as a 4-dimensional function, with the remaining three input variables (${x_{5}}$, ${x_{6}}$ and ${x_{7}}$) being fixed to the mid-point of the corresponding intervals.

The functions generated by the models above cannot be well interpolated using simple kriging unless the design size is very large, having a smooth tendency which, when not taken into account, leads to a residual signal $\epsilon (\mathbf{x})$ that strongly departs from the stationarity hypothesis assumed in this work. For that reason, we consider the estimation of ISE for interpolators of f of the form

\[ \eta (\mathbf{x})={P_{q,n}}(\mathbf{x})+g(\mathbf{x})\hspace{0.1667em}.\]

Above, $P(\mathbf{x})$ is the complete q-degree polynomial obtained by least squares fit to the n observations ${\mathbf{y}_{n}}$ over the learning design ${\mathbf{X}_{n}}$. The term $g(\mathbf{x})$ is the simple kriging interpolator of

\[ {f^{\prime }}(\mathbf{x})=f(\mathbf{x})-{P_{q,n}}(\mathbf{x})\hspace{0.1667em}.\]

In all cases, ${\mathbf{X}_{n}}$ is a space filling design determined by Kernel herding for a spherical Matérn 3/2 kernel with $\theta ={n^{1/d}}$, and all validation designs ${\mathbf{Z}_{m}}$ assume a spherical Matérn kernel (several settings of its correlation length are studied). All experiments reported in this section consider $m=n/2$.

We compare the performance of the ISE estimator proposed in the paper, using the validation measure ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$, with the simple empirical estimator ${\widehat{\mathsf{ISE}}_{un}}$ that uses a uniform validation distribution with support the validation design ${\mathbf{Z}_{m}}$, see (2.2). All validation measures ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ assume a Matérn 3/2 kernel. Learning designs ${\mathbf{X}_{n}}$ are always space filling, found by Kernel Herding for a Matérn kernel with parameter $\theta ={n^{1/d}}$.

The robustness of the estimator with respect to the assumed value of the range parameter of the covariance of the GP model is studied by showing three panels, corresponding to $\theta \in \{{m^{1/d}},{n^{1/d}},{(n+m)^{1/d}}\}$ (increasing from left to right). The figures display grouped bar plots, corresponding to increasing sizes of ${\mathbf{X}_{n}}$ (and thus of ${\mathbf{Z}_{m}}$): $n\in \{10,20,30,40,50\}d$. The blue bars correspond to the true value, the red bars to ${\widehat{\mathsf{ISE}}^{\mathsf{BQ}\mathrm{\star }}}$, and the yellow bars to ${\widehat{\mathsf{ISE}}_{un}}$.

5.1 Drag Model

For the drag model ${P_{q,n}}$ has degree $q=1$.

Figure 16

Drag model. True ISE (blue) SBQ estimate (red) and ${\widehat{\mathsf{ISE}}_{un}}$ (yellow), $\eta (\mathbf{x})={P_{q,n}}(\mathbf{x})+g(\mathbf{x})$. From left to right: $\theta ={m^{1/d}},{n^{1/d}},{(n+m)^{1/d}}$.

Figure 17

Drag model. True ISE (blue) SBQ estimate (red) and ${\widehat{\mathsf{ISE}}_{un}}$ (yellow), $\eta (\mathbf{x})={P_{q,n}}(\mathbf{x})$. From left to right: $\theta ={m^{1/d}},{n^{1/d}},{(n+m)^{1/d}}$.

Figure 18

Piston model. True ISE (blue) BQ estimate (red) and ${\widehat{\mathsf{ISE}}_{un}}$ (yellow). Interpolator, $\eta (\mathbf{x})=P(\mathbf{x})+g(\mathbf{x})$. From left to right: $\theta ={m^{1/d}},{n^{1/d}},{(n+m)^{1/d}}$; $p=3$.

Figure 16 confirms the robustness of ${\widehat{\mathsf{ISE}}^{\mathsf{BQ}\mathrm{\star }}}$ with respect to the choice of θ. Also, the anticipated overestimation of the empirical estimator, as well as the negative bias of the BQ estimator, are apparent in the figure. Note that different values of the assumed θ – corresponding to the three panels of the figure – lead both to distinct validation weights and to different validation designs ${\mathbf{Z}_{m}}$. As we had anticipated, the validation weights of ${\widehat{\mathsf{ISE}}^{\mathsf{BQ}\mathrm{\star }}}$ compensate for the precise location of the points of ${\mathbf{Z}_{m}}$, and the variation of the estimates across the three panels in Figure 16 is minor. On the contrary, the empirical estimator ${\widehat{\mathsf{ISE}}_{un}}$ displays an important sensitivity with respect to the exact placement of ${\mathbf{Z}_{m}}$, changing significantly across the three panels.

Figure 17 considers $\eta (\mathbf{x})={P_{q,n}}(\mathbf{x})$, violating the interpolating assumption. As we might expect, the SBQ estimate of $\mathsf{ISE}$ (red bars) has now an important (negative) bias. Somewhat surprisingly, when the value of n is small the positive bias of ${\widehat{\mathsf{ISE}}_{un}}$ (yellow bars) partially compensates, in this example, for the non-zero residuals of η over ${\mathbf{X}_{n}}$.

5.2 Piston Model

We switch now to the higher dimensional piston model ($d=4$), for which ${P_{q,n}}$ is a polynomial of degree $q=2$. Figure 18 is the equivalent of Figure 16. It confirms the superior performance and robustness of ${\zeta ^{\mathsf{BQ}\mathrm{\star }}}$ over ${\widehat{\mathsf{ISE}}_{un}}$.

When η is not an interpolator a behaviour similar to the one observed for the drag model has been observed.

6 Conclusions

The paper presents an estimator for the ISE of an interpolator based on knowledge of the design on which it has been learned, defined as the ISE for a finitely supported validation measure. The estimator proposed is the optimal MSE linear estimator under the assumption that the interpolated function is a realisation from a Gaussian process with known statistical moments. The support and weights of the validation measure are found by minimising an MMD for a non-stationary kernel that is adapted to the learning design, and a nested sequence of validation designs is greedily determined by SBQ. A default rule is proposed to select the covariance kernel of the assumed model.

Numerical experiments on both simulations from nominal Gaussian processes and on two real models of small dimension confirm the superior performance of the proposed estimator when compared to common estimation by the simple empirical average of the observed squared residuals.

The interpretation of the ISE estimator in terms of an interpolation of the squared residuals explains the utmost importance of accounting for the correct shape of their second-order moment. Moreover, it unriddles the observed robustness of the estimator with respect to the covariance of the assumed GP model.

The work presented suggests several directions for future developments. One concerns the determination of indicators of the quality of the ISE estimate itself, ideally given by the risk function that is optimised. These could both be used to define stopping rules, indicating that incorporation of further residual observations should not yield a significant improvement on the confidence of the current ISE estimate, or to flag poor performance of the current interpolator, and trigger its update including some of the residuals observed over ${\mathbf{Z}_{m}}$ in the learning dataset ${\mathcal{F}_{n}}$. A major difficulty is related to the dependency of the MSE of the interpolator on the assumed process variance, which is known to be difficult to estimate [22]. A possible source of suboptimality of the estimator presented concerns the restriction to a linear estimator. The extension to more general estimators while preserving at the same time the robustness property of the method forms a challenging objective. Finally, we believe that the analysis presented here suggests possible approaches to defining (down-)weighted CV estimators that perform better than standard ones; this is the subject of ongoing work.

Appendix A Bias Correction

Under the assumed GP model for ${f_{|{\mathcal{F}_{n}}}}$, the estimator $\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}},{\mathbf{Z}_{m}})$ has a non-zero bias:

\[\begin{aligned}{}B({\mathbf{Z}_{m}})& =\mathsf{E}\left\{\left.\widehat{\mathsf{ISE}}({\eta _{{\mathcal{F}_{n}}}},{\mathbf{Z}_{m}})-\mathsf{ISE}({\eta _{{\mathcal{F}_{n}}}})\right|{\mathcal{F}_{n}}\right\}\\ {} & \hspace{-30.0pt}={\sigma ^{2}}\tilde{\mathbf{w}}{({\mathbf{Z}_{m}})^{T}}{\mathbf{k}_{|n}}({\mathbf{Z}_{m}})-{\mathsf{IMSE}^{\mathrm{\star }}}({\mathbf{X}_{n}})\hspace{0.1667em},\end{aligned}\]

with ${\mathbf{k}_{|n}}({\mathbf{Z}_{m}})$ the m-dimensional column vector with components ${[{k_{|n}}]_{i}}={K_{|n}}({\mathbf{z}_{i}},{\mathbf{z}_{i}})$ (see equation (2.3)), and ${\mathsf{IMSE}^{\mathrm{\star }}}({\mathbf{X}_{n}})$ given by (2.4).

By noting that ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ given by (3.3) is also the optimal MMSE estimator under the zero mean model ${\mathcal{GP}^{0}}=\mathcal{GP}(0,{s^{2}}{\overline{K}_{|n}})$ for some (arbitrary) ${s^{2}}$ (necessarily linear, since the model is Gaussian), equation (3.4) suggests that $B({\mathbf{Z}_{m}})$ may be negative: as ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ is optimal for ${\mathcal{GP}^{0}}$, it should tend to have smaller values than estimators that consider the correct first posterior moment, i.e., $\mathcal{GP}({\sigma ^{2}}{K_{|n}}(\mathbf{x},\mathbf{x}),{\sigma ^{4}}{\overline{K}_{|n}})$. Figure 19 displays the empirical bias $(1/M){\textstyle\sum _{i}}\left({({\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}})^{(i)}}(\mathbf{x}|{\mathbf{Z}_{m}})-{({\varepsilon ^{2}})^{(i)}}(\mathbf{x})\right)$ observed over $M=500$ realisations from several GP models, supporting this conjecture (simulations are from the models considered in Figure 2).

Simply subtracting $B({\mathbf{Z}_{m}})$ from the biased linear estimator yields the following unbiased affine estimator

(A.1)

\[\begin{aligned}{}{\widehat{\mathsf{ISE}}_{affine}}({\mathbf{X}_{n}},{\mathbf{Z}_{m}})=& {\sum \limits_{i=1}^{m}}{\tilde{\mathbf{w}}_{i}}{\varepsilon ^{2}}({\mathbf{z}_{i}})-B({\mathbf{Z}_{m}})\\ {} & ={\mathsf{IMSE}^{\mathrm{\star }}}({\mathbf{X}_{n}})+{\tilde{\mathbf{w}}^{T}}{\Delta _{m}}\hspace{0.1667em},\end{aligned}\]

where ${\Delta _{m}}$ collects the mean corrected squared residuals at the validation points: ${\Delta _{m}}({\mathbf{z}_{i}})={\varepsilon ^{2}}({\mathbf{z}_{i}})-{\sigma ^{2}}{K_{|n}}({\mathbf{z}_{i}},{\mathbf{z}_{i}})$, $i=1,\dots ,m$. Note that implementation of the expression above requires specification of the value of ${\sigma ^{2}}$, which is in practice unknown. In the numerical experiments presented in Figures 20 and 21, we considered the ideal situation where the true value (${\sigma ^{2}}=1$) is known.

Figure 19

Bias of $\widehat{{\varepsilon ^{2}}}$, $\mathbf{x}\in \mathcal{X}$. (simulations from Cauchy and Matérn 3/2 kernels).

Figure 20

${\mathsf{IMSE}^{\mathrm{\star }}}({\mathbf{X}_{n}};\theta )$ (black), $\overline{\mathsf{ISE}}$ (dashed), ${\overline{\widehat{\mathsf{ISE}}}_{biased}}(\theta )$ (blue), ${\overline{\widehat{\mathsf{ISE}}}_{affine}}(\theta )$ (red) and ${\overline{\widehat{\mathsf{ISE}}}_{linear}}(\theta )$ (green). From left to right $m=5,10,20$. Q and K are the Matérn 3/2 kernel.

Figure 21

${\mathsf{IMSE}^{\mathrm{\star }}}({\mathbf{X}_{n}};\theta )$ (black), $\overline{\mathsf{ISE}}$ (dashed), ${\overline{\widehat{\mathsf{ISE}}}_{biased}}(\theta )$ (blue), ${\overline{\widehat{\mathsf{ISE}}}_{affine}}(\theta )$ (red), ${\overline{\widehat{\mathsf{ISE}}}_{linear}}(\theta )$ (green). From left to right $m=5,10,20$. Q is the Matérn 3/2 kernel; K is the Matérn kernel with parameter $\nu =1/2$ (top) and $\nu =5/2$ (bottom).

Alternatively, a linear (instead of affine) unbiased solution can be found by using weights $\mathbf{w}({\mathbf{Z}_{m}})$ that minimise the same quadratic cost function, but under the zero bias constraint. This leads to the following additive correction of the optimal weights of the biased linear estimator $\widehat{\mathsf{ISE}}$:

(A.2)

\[\begin{aligned}{}{\mathbf{w}_{linear}}({\mathbf{Z}_{m}})& =\tilde{\mathbf{w}}({\mathbf{Z}_{m}})\\ {} & \hspace{-20.0pt}-\frac{{\sigma ^{2}}\tilde{\mathbf{w}}({\mathbf{Z}_{m}}){\mathbf{k}_{|n}}({\mathbf{Z}_{m}})-{\mathsf{IMSE}^{\mathrm{\star }}}({\mathbf{X}_{n}})}{{\sigma ^{2}}{\mathbf{k}_{|n}}{({\mathbf{Z}_{m}})^{T}}\mathbf{t}}\mathbf{t}\hspace{0.1667em},\end{aligned}\]

where $\mathbf{t}={\overline{K}_{|n}}{({\mathbf{Z}_{m}},{\mathbf{Z}_{m}})^{-1}}{\mathbf{k}_{|n}}({\mathbf{Z}_{m}})$.

Denote by ${\widehat{\mathsf{ISE}}_{biased}}$ the empirical $\mathsf{ISE}$ estimator that uses the validation measure presented in Section 3.2, and let ${\widehat{\mathsf{ISE}}_{linear}}$ denote the linear unbiased estimator with weights given by (A.2).

As (A.2) is linear and ${\widehat{{\varepsilon ^{2}}}_{{\mathcal{F}_{n}}}}(\mathbf{x}|{\mathbf{Z}_{m}})$ in (3.3) is the MMSE linear estimate of ${\varepsilon ^{2}}(\mathbf{x})$ given ${\mathcal{F}_{n}}$, ${\widehat{\mathsf{ISE}}_{linear}}$ will necessarily perform worse than ${\widehat{\mathsf{ISE}}_{biased}}$ when using the correct model for f. Conversely, ${\widehat{\mathsf{ISE}}_{affine}}$ performs better than ${\widehat{\mathsf{ISE}}_{biased}}$ for the right model for f. In fact, the numerical experiments presented below show that ${\widehat{\mathsf{ISE}}_{linear}}$ and ${\widehat{\mathsf{ISE}}_{affine}}$ have both bad performance and poor robustness: as both estimators explicitly incorporate the uncertainty predicted by the posterior distribution, they inherit, as we will see, the well known sensitivity of modelled prediction uncertainty with respect to the assumed model.

We performed $M=500$ simulations from a GP with kernel $Q(\cdot ,\cdot ;{\theta _{0}})$, the Matérn kernel with regularity parameter $\nu =3/2$ and ${\theta _{0}}=n$, over the domain $\mathcal{X}={[0,1]^{d}}$ ($d=1$). The corresponding optimal Bayesian interpolators ${\eta _{{\mathcal{F}_{n}}}^{(i)}}$ all use the same learning design ${\mathbf{X}_{n}}$ of size $n=10$ (see details in Appendix C).

Let ${\widehat{\mathsf{ISE}}_{c}^{(i)}}({K_{\theta }})$, $c\in \{biased,affine,linear\}$ denote the estimate ${\widehat{\mathsf{ISE}}_{c}}({\eta _{{\mathcal{F}_{n}}}^{(i)}})$ when the validation measure assumes kernel $K(\cdot ,\cdot ;\theta )$. Figures 20 and 21 plot the average of these estimates. In Figure 20 $K=Q$, while in Figure 21 measures ${\zeta _{m}}$ are based on Matérn kernels with $\nu \in \{1/2,\hspace{0.1667em}5/2\}$. In both figures the horizontal dashed black lines indicate $\overline{\mathsf{ISE}}$, the empirical average of ${\mathsf{ISE}^{(i)}}({\eta _{{\mathcal{F}_{n}}}})$ over the M realisations, and the solid black curve is ${\mathsf{IMSE}^{\mathrm{\star }}}({\mathbf{X}_{n}};\theta )$, predicted by kernel $K(\cdot ,\cdot ;\theta )$. The three panels correspond to increasing design sizes $m=5,10,20$ (from left to right).

The correct ${\theta _{0}}$ can be identified in Figure 20 as the value at which the black solid and dashed lines intersect: ${\mathsf{IMSE}^{\mathrm{\star }}}({\mathbf{X}_{n}};{\theta _{0}})=\overline{\mathsf{ISE}}$. We can see that for the correct parameter value the unbiased estimates (red and green curves) both have the correct mean, while the biased estimator (blue line) has, as foreseen, a negative bias. For $\theta \ne {\theta _{0}}$ all three estimators have a non-zero bias, which decreases when m grows as the estimators become less dependent on the prior stochastic model for f. For large design sizes, the two linear estimates (blue and green curves) have nearly the same bias, showing that bias correction is mainly relevant for small validation designs. As anticipated, the unbiased estimates display a larger sensitivity with respect to model mismatch than the original ${\widehat{\mathsf{ISE}}_{biased}}(\theta )$, which shows a remarkably stable behaviour with respect to θ.

In Figure 21 wrong values of ν are assumed by the design algorithm. In the top row $\nu =1/2$, less regular than Q, while in the bottom row $\nu =5/2$, more regular than the simulated model. While a much larger bias is observed for the exponential ($\nu =1/2$) model in the top row, the curves in the bottom panels are similar to those in Figure 20, indicating that the estimator can accommodate a model that assumes a higher regularity. The robustness of BQ with respect to models assuming higher regularity than the true one has been previously noted in [20]. Finally, remark that ${\widehat{\mathsf{ISE}}_{biased}}$ has a remarkably stable behaviour, and that its bias is often the smallest amongst all three estimators.

Unless a high confidence can be given to the assumed GP model, including its scale parameter, the lack of robustness of the unbiased estimators prevents their use. For small design sizes, where bias correction could indeed be important, guaranteeing the fidelity of the assumed model is in general impossible, severely limiting the practical interest of the unbiased estimators discussed here.

Appendix B Potential ${P_{{\overline{K}_{|n}}}}(\mathbf{z})$ for Tensor-Product Kernels on ${[0,1]^{d}}$

B.1 Factorisation in the General Case

A key difficulty for the algorithmic construction of a validation design by SBQ (Section 3.2) or KH (Section 4.3) is the calculation of ${P_{{\overline{K}_{|n}}}}(\mathbf{x})={P_{{\overline{K}_{|n}},\mu }}(\mathbf{x})$ for many x in order to choose ${\mathbf{z}_{m+1}}$. However, when K is a tensor-product kernel, ${P_{{\overline{K}_{|n}},\mu }}$ can be calculated explicitly when μ is uniform on $\mathcal{X}={[0,1]^{d}}$.

We can write $\mu (\mathrm{d}\mathbf{x})={\textstyle\prod _{i=1}^{d}}{\mu _{1}}(\mathrm{d}{x_{i}})$ with ${\mu _{1}}$ the uniform measure on $[0,1]$ and $\mathbf{x}={({x_{1}},\dots ,{x_{d}})^{\top }}$. When $K(\mathbf{x},{\mathbf{x}^{\prime }})={\textstyle\prod _{i=1}^{d}}{K_{i}}({x_{i}},{x^{\prime }_{i}})$, with $\mathbf{x}={({x_{1}},\dots ,{x_{d}})^{\top }}$ and ${\mathbf{x}^{\prime }}={({x^{\prime }_{1}},\dots ,{x^{\prime }_{d}})^{\top }}$, we thus have

\[ {P_{K,\mu }}(\mathbf{x})={\prod \limits_{i=1}^{d}}{\int _{{\mathcal{X}_{i}}}}{K_{i}}({x_{i}},{x^{\prime }_{i}})\hspace{0.1667em}{\mu _{1}}(\mathrm{d}{x^{\prime }_{i}})={\prod \limits_{i=1}^{d}}{P_{{K_{i}},{\mu _{1}}}}({x_{i}})\hspace{0.1667em}.\]

One may refer to [43] for connections between positive-definiteness properties of the ${K_{i}}$ and those of K. The expression of ${P_{{K_{i}},{\mu _{1}}}}(\cdot )$ is available for many kernels ${K_{i}}$; see [36] and the references therein.

Before deriving the expression of ${P_{{\overline{K}_{|n}},\mu }}(\mathbf{x})$ we introduce some notation. Denote by ${\overline{\boldsymbol{\Omega }}_{K,n}}$ the $n\times n$ matrix with respective elements

\[ {\{{\overline{\boldsymbol{\Omega }}_{K,n}}\}_{j,k}}={\prod \limits_{i=1}^{d}}{\beta _{{K_{i}}}}({{x_{j}}_{i}},{{x_{k}}_{i}})\hspace{0.1667em},\]

and by ${\overline{\boldsymbol{\omega }}_{K,n}}(\mathbf{x})$ the vector with j-th component

\[ {\{{\overline{\boldsymbol{\omega }}_{K,n}}(\mathbf{x})\}_{j}}={\prod \limits_{i=1}^{d}}{\beta _{{K_{i}}}}({{x_{j}}_{i}},{x_{i}})\hspace{0.1667em},\]

where ${{x_{j}}_{i}}$ (respectively, ${{x_{k}}_{i}}$) is the i-th component of ${\mathbf{x}_{j}}$ (respectively, ${\mathbf{x}_{k}}$), and

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {\beta _{{K_{i}}}}(r,s)& \displaystyle =& \displaystyle {\int _{\mathcal{X}}}{K_{i}}(r,t){K_{i}}(s,t)\hspace{0.1667em}{\mu _{1}}(\mathrm{d}t)\hspace{0.1667em},\hspace{2.5pt}i=1,\dots ,d\hspace{0.1667em}.\end{array}\]

Then, using (2.6), direct calculation gives

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {P_{{\overline{K}_{|n}},\mu }}(\mathbf{x})& \displaystyle =& \displaystyle 2\hspace{0.1667em}{P_{{K^{2}},\mu }}(\mathbf{x})-4\hspace{0.1667em}{\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\overline{\boldsymbol{\omega }}_{K,n}}(\mathbf{x})\\ {} & & \displaystyle +2\hspace{0.2778em}{\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\overline{\boldsymbol{\Omega }}_{K,n}}{\mathbf{K}_{n}^{-1}}{\mathbf{k}_{n}}(\mathbf{x})\\ {} & & \displaystyle \hspace{-28.45274pt}+\left[1-{\mathbf{k}_{n}^{\top }}(\mathbf{x}){\mathbf{K}_{n}^{-1}}{\mathbf{k}_{n}}(\mathbf{x})\right]\left[1-\mathrm{trace}({\mathbf{K}_{n}^{-1}}{\overline{\boldsymbol{\Omega }}_{K,n}})\right]\hspace{0.1667em}.\end{array}\]

The expressions of ${P_{{K^{2}},{\mu _{1}}}}(x)$ and ${\beta _{K}}(u,v)$, $x,u,v\in [0,1]$, for ${\mu _{1}}$ uniform on $[0,1]$ and ${K_{i}}(x,{x^{\prime }})$ a Matérn 3/2 kernel (C.1) are given in Section B.2, making the expression of ${P_{{\overline{K}_{|n}},\mu }}(\mathbf{x})$ available in closed form when $K(\mathbf{x},{\mathbf{x}^{\prime }})$ is the product of uni-dimensional Matérn 3/2 kernels and μ is uniform on $\mathcal{X}={[0,1]^{d}}$. Similar calculations can be conducted for other kernels. The expression of ${\mathcal{E}_{{\overline{K}_{|n}}}}(\mu )$, which appears in the expansion of $\mathcal{R}({\zeta _{m}},{\mathbf{X}_{n}})$, see (2.5), can be obtained in closed form in a similar way; see [35].

B.2 The Matérn 3/2 Case

When ${K_{i}}(x,{x^{\prime }})={K_{\text{Mat\'ern}}^{3/2}}(|x-{x^{\prime }}|)$ given by (C.1) with $\theta =\gamma /\sqrt{3}$, we have [36]

\[ {P_{{K_{i}},{\mu _{1}}}}(x)={S_{\gamma }}(x)+{S_{\gamma }}(1-x)\hspace{0.1667em},\]

with ${S_{\gamma }}(x)=\frac{1}{\gamma }\hspace{0.1667em}[2-(2+\gamma x){\mathsf{e}^{-\gamma x}}]$, $x\in [0,1]$. Straightforward but lengthy calculation gives

\[ {P_{{K_{i}^{2}},{\mu _{1}}}}(x)={T_{\gamma }}(x)+{T_{\gamma }}(1-x)\hspace{0.1667em},\]

with ${T_{\gamma }}(x)=\frac{1}{4\hspace{0.1667em}\gamma }\hspace{0.1667em}[5-(5+6\hspace{0.1667em}\gamma x+2\hspace{0.1667em}{\gamma ^{2}}{x^{2}}){\mathsf{e}^{-2\hspace{0.1667em}\gamma x}}]$, $x\in [0,1]$. Also, the expressions ${\beta _{{K_{i}}}}(u,v)={B_{\gamma }}(u,v)-{C_{\gamma }}(u,v)-{C_{\gamma }}(1-u,1-v)$, $u,v\in [0,1]$, with

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}\displaystyle {B_{\gamma }}(u,v)& \displaystyle =& \displaystyle \frac{{\mathsf{e}^{-\gamma |u-v|}}}{6\hspace{0.1667em}\gamma }\hspace{0.1667em}\left[15\hspace{0.1667em}(1+\gamma |u-v|)+6\hspace{0.1667em}{\gamma ^{2}}|u-v{|^{2}}\right.\\ {} & & \displaystyle \left.+{\gamma ^{3}}|u-v{|^{3}}\right]\hspace{0.1667em},\\ {} \displaystyle {C_{\gamma }}(u,v)& \displaystyle =& \displaystyle \frac{{\mathsf{e}^{-\gamma (u+v)}}}{4\hspace{0.1667em}\gamma }\hspace{0.1667em}\left[5+3\hspace{0.1667em}\gamma (u+v)+2\hspace{0.1667em}{\gamma ^{2}}uv\right]\hspace{0.1667em},\end{array}\]

permit to calculate ${P_{{\overline{K}_{|n}},\mu }}(\mathbf{x})$ explicitly.

Appendix C Details on Numerical Experiments

C.1 GP Models

Let ${\{f(\mathbf{x})\}_{\mathbf{x}\in \mathcal{X}}}$, $f(\mathbf{x})\in \mathbb{R}$ be a real d-dimensional stochastic process defined over the compact index set $\mathcal{X}\subset {\mathbb{R}^{d}}$. ${\{f(\mathbf{x})\}_{\mathbf{x}\in \mathcal{X}}}$ is a Gaussian process with mean function $\mu (\cdot )$ and covariance kernel $K(\cdot ,\cdot )$, noted ${\{f(\mathbf{x})\}_{\mathbf{x}\in \mathcal{X}}}\sim \mathcal{GP}(\mu (\cdot ),K(\cdot ,\cdot ))$, if for any finite $n\in \mathbb{N}$, and any $\mathbf{X}=\{{\mathbf{x}_{1}},\dots ,{\mathbf{x}_{n}}\}\subset \mathcal{X}$, the collection of random variables $\{f({\mathbf{x}_{i}}),i=1,\dots ,n\}$ is a n-dimensional normal random vector, i.e.,

\[ \{f({\mathbf{x}_{i}}),i=1,\dots ,n\}\sim \mathcal{N}\left({\mu _{\mathbf{X}}},{\mathbf{K}_{\mathbf{X}}}\right)\hspace{0.5em},\]

where ${\mu _{\mathbf{X}}}\in {\mathbb{R}^{d}}$ has i-th component ${\left[{\mu _{\mathbf{X}}}\right]_{i}}=\mu ({\mathbf{x}_{i}})$, and the $n\times n$ matrix ${\mathbf{K}_{\mathbf{X}}}$ has generic $(i,j)$ element ${\left[{\mathbf{K}_{\mathbf{X}}}\right]_{(i,j)}}=K({\mathbf{x}_{i}},{\mathbf{x}_{j}})$.

All Gaussian models considered in the numerical experiments presented assume a zero mean, i.e., $\mu (\cdot )\equiv 0$, and are defined over $\mathcal{X}={\left[0,1\right]^{d}}$. Besides, only stationary and isotropic processes are considered, i.e., all covariance kernels K satisfy $K(\mathbf{x},{\mathbf{x}^{\prime }})=\Psi (\mathbf{x}-{\mathbf{x}^{\prime }})=\psi \left(\left\| \mathbf{x}-{\mathbf{x}^{\prime }})\right\| \right)$.

The experiments presented resort to several parametric families for the kernel K, namely, the Cauchy kernel ${K_{\text{Cauchy}}}$ as defined in [13], and the Matérn kernels ${K_{\text{Mat\'ern}}^{\nu }}$ with regularity parameter $\nu \in \{1/2,3/2,5/2\}$, all given below. For all kernels $\theta \in {\mathbb{R}^{+}}$ is the scale parameter, and for the Cauchy kernels $(\rho ,\gamma )$ are the long distance dependency and the shape parameters, respectively. Below, $\ell =\left\| \mathbf{x}-{\mathbf{x}^{\prime }}\right\| $.

(C.1)

\[\begin{aligned}{}{\psi _{\text{Cauchy}}}(\ell )& ={\left(1+{(\theta \hspace{0.1667em}\ell )^{\gamma }}\right)^{-\rho /\gamma }}\hspace{2.5pt},\\ {} {\psi _{\text{Mat\'ern}}^{1/2}}(\ell )& ={e^{-\theta \hspace{0.1667em}\ell }}\hspace{2.5pt},\\ {} {\psi _{\text{Mat\'ern}}^{3/2}}(\ell )& =\left(1+\sqrt{3}\theta \hspace{0.1667em}\ell \right)\hspace{0.1667em}{e^{-\sqrt{3}\theta \hspace{0.1667em}\ell }}\hspace{0.1667em},\end{aligned}\]

(C.2)

\[\begin{aligned}{}{\psi _{\text{Mat\'ern}}^{5/2}}(\ell )& =\left(1+\sqrt{5}\theta \hspace{0.1667em}\ell +\frac{5}{3}{(\theta \hspace{0.1667em}\ell )^{2}}\right)\hspace{0.1667em}{e^{\sqrt{5}\theta \hspace{0.1667em}\ell }}\hspace{0.1667em}.\end{aligned}\]

For the Cauchy kernel, we set $\rho =\gamma =1$, and thus a rational kernel with bandwidth determined by θ.

The parameter ${\theta _{0}}$ of the simulated GP model is dependent of the size of the learning design of ${\mathcal{F}_{n}}$: ${\theta _{0}}={n^{1/d}}$. This will guarantee the numerical stability of the KH algorithm used to define ${\mathbf{X}_{n}}$ (see below), and that the interpolator ${\eta _{{\mathcal{F}_{n}}}}$ will have a moderate error level.

C.2 Sampling from a GP Processes

The material in Section 4 presents the average performance of the $\mathsf{ISE}$ estimators over $M=500$ simulations from an assumed GP model. These simulations are supported in a dense finite subset ${\mathcal{X}_{L}}$ of $\mathcal{X}$ of size $L={2^{12}}$; ${\mathcal{X}_{L}}$ is a uniform grid when $d=1$ and a scrambled low-discrepancy Sobol’ sequence when $d\ge 2$.

Generation of realisations from the GP model requires factorisation of the matrix ${\mathbf{K}_{{\mathcal{X}_{L}}}}$ collecting the values of kernel K over the pairs of points of ${\mathcal{X}_{L}}$:

\[ {f^{(i)}}(\mathbf{t})={\mathbf{K}^{-1/2}}\mathbf{u}\hspace{0.1667em},\hspace{2em}\mathbf{u}\sim \mathcal{N}\left(0,{I_{|\mathcal{T}|}}\right)\hspace{0.1667em},\hspace{2em}\mathbf{t}\in \mathcal{T}\hspace{0.1667em}.\]

When L is very large this may lead to numerical instabilities for some parameter values, due to near singularity of K. In that case, our simulated signals are the optimal MSE estimate (under the simulated $\mathcal{GP}$) of samples obtained as above over a smaller dense subset ${\mathcal{X}_{M}}$ of $\mathcal{X}$, of size $M={10^{3}}\hspace{0.1667em}d$:

\[\begin{array}{r@{\hskip10.0pt}c@{\hskip10.0pt}l}& & \displaystyle {\{{\mathbf{u}^{(i)}}(\mathbf{x})\}_{\mathbf{x}\in {\mathcal{X}_{M}}}}\sim \mathcal{N}\left(0,{I_{M}}\right)\\ {} & & \displaystyle \hspace{71.13188pt}\longrightarrow {\{{f^{(i)}}(\mathbf{x})\}_{\mathbf{x}\in {\mathcal{X}_{M}}}}\to {\{{\hat{f}^{(i)(\mathbf{x})}}\}_{\mathbf{x}\in \mathcal{X}}}\hspace{0.1667em}.\end{array}\]

The simulated functions are thus slightly smoother than the actual realisations from the assumed GP. We believe, however that this does not compromise the validity of our conclusions.

C.3 Learning Design ${\mathbf{X}_{n}}$, Interpolator ${\eta _{{\mathcal{F}_{n}}}}$, $\mathsf{ISE}$ Estimates

In Section 4, for each GP kernel K and design size n, ${\mathbf{X}_{n}}$ is always the space-filling design obtained by standard KH for kernel K. For each realisation ${f^{(i)}}$, its interpolator ${\eta _{{\mathcal{F}_{n}^{(i)}}}}$ is the optimal interpolator for the assumed GP model using the learning data ${\mathcal{F}_{n}^{(i)}}=({\mathbf{X}_{n}},{f^{(i)}}({\mathbf{X}_{n}}))$,

\[ {\eta _{{\mathcal{F}_{n}^{(i)}}}}(\mathbf{x})={\mathbf{k}_{|n}}{(\mathbf{x},{\mathbf{X}_{n}})^{T}}{K_{|n}}{({\mathbf{X}_{n}},{\mathbf{X}_{n}})^{-1}}{f^{(i)}}({\mathbf{X}_{n}})\hspace{2.5pt}.\]

Simulated residuals are thus ${\varepsilon ^{(i)}}(\mathbf{x})={f^{(i)}}(\mathbf{x})-{\eta _{{\mathcal{F}_{n}^{(i)}}}}(\mathbf{x})$. For a validation measure ${\zeta _{m}}=(\mathbf{w},{\mathbf{Z}_{m}})$ the MSE of the corresponding estimate $\widehat{\mathsf{ISE}}$ is approximated as

\[ \hat{\mathcal{R}}(\zeta )=\frac{1}{M}{\sum \limits_{i=1}^{M}}{\left({\widehat{\mathsf{ISE}}^{(i)}}-{\mathsf{ISE}^{(i)}}\right)^{2}}\hspace{0.1667em},\]

where

\[ {\widehat{\mathsf{ISE}}^{(i)}}={\sum \limits_{i=i}^{m}}{\mathbf{w}_{i}}{\varepsilon _{{\eta _{{\mathcal{F}_{n}^{(i)}}}}}^{2}}({\mathbf{z}_{i}}),\hspace{2em}{\mathsf{ISE}^{(i)}}=\frac{1}{L}\sum \limits_{{\mathbf{t}_{i}}\in {\mathcal{X}_{L}}}{({\varepsilon ^{(i)}}({\mathbf{t}_{i}}))^{2}}\hspace{0.1667em}.\]

Acknowledgements

The authors acknowledge the fruitful collaboration with the other partners of the ANR project INDEX, in particular Bertrand Iooss, Elias Fekhari and Joseph Muré from EDF R&D Chatou, France.

Footnotes

² We will often consider $\mathcal{X}={[0,1]^{d}}$.

³ ${\delta _{\mathbf{a}}}$ denotes the unit point-mass at $\mathbf{x}=\mathbf{a}$.

⁴ independent and identically distributed.

⁵ In Figure 1, the ${\{{f^{(i)}}\}_{i=1}^{500}}$ are realisations of a uni-dimensional GP on the unit interval with a Matérn 5/2 kernel (see (C.2)) with range parameter $\theta =n+m$, and ${\eta _{{\mathcal{F}_{n}}}^{(i)}}$ is the optimal kriging regressor for the simulated model. ${\mathbf{X}_{n}}$ is a space-filling design and ${\mathbf{Z}_{m}}$ is a space-filling continuation of ${\mathbf{X}_{m}}$. The uniform measure is approximated by a uniform grid with ${2^{12}}$ points.

⁶ However, it does not provide the optimal design for a fixed m: the construction of one-shot m-point designs minimising an MMD criterion is considered for instance in [26, 36]; we do not develop this aspect here.

⁷ The exact definition of these kernels is given in Appendix C.

References

[1]

Piston model. https://www.sfu.ca/~ssurjano/piston.html. Accessed: 2023-03-17.

[2]

Anand, M., Velu, A. and Whig, P. Prediction of loan behaviour with machine learning models for secure banking. Journal of Computer Science and Engineering (JCSE) 3(1) 1–13 (2022).

[3]

Bach, F., Lacoste-Julien, S. and Obozinski, G. On the equivalence between herding and conditional gradient algorithms. In Proc. 29th Annual International Conference on Machine Learning 1355–1362 (2012).

[4]

Bachoc, F. Cross validation and maximum likelihood estimations of hyper-parameters of Gaussian processes with model misspecification. Computational Statistics and Data Analysis 66. 55–69 (2013). https://doi.org/10.1016/j.csda.2013.03.016. MR3064023

[5]

Borovicka, T., Jirina, M. Jr., Kordik, P. and Jirina, M. Selecting representative data sets. (A. Karahoca, ed.) In Advances in Data Mining, Knowledge Discovery and Applications 43–70. INTECH (2012).

[6]

Chevalier, C., Bect, J., Ginsbourger, D., Picheny, V., Richet, Y. and Vazquez, E. Fast kriging-based stepwise uncertainty reduction with application to the identification of an excursion set. Technometrics 56. 455–465 (2014). https://doi.org/10.1080/00401706.2013.860918. MR3290615

[7]

Demay, C., Iooss, B., Le Gratiet, L. and Marrel, A. Model selection for Gaussian Process regression: an application with highlights on the model variance validation. Quality and Reliability Engineering International Journal 8. 1482–1500 (2021).

[8]

Dubrule, O. Cross validation of kriging in a unique neighborhood. Journal of the International Association for Mathematical Geology 15(6) 687–699 (1983). https://doi.org/10.1007/BF01033232. MR0720633

[9]

ENIQ. Qualification of an AI / ML NDT system – Technical basis. NUGENIA, ENIQ Technical Report (2019).

[10]

Fang, K-T., Li, R. and Sudjianto, A. Design and Modeling for Computer Experiments. Chapman & Hall/CRC (2006). MR2510302

[11]

Fedorov, V. V. Theory of Optimal Experiments. Academic Press, New York (1972). MR0403103

[12]

Fekhari, E., Iooss, B., Muré, J., Pronzato, L. and Rendas, J. Model predictivity assessment: incremental test-set selection and accuracy evaluation. (N. Salvati, C. Perna, S. Marchetti and R. Chambers, eds.) In Studies in Theoretical and Applied Statistics, SIS 2021, Pisa, Italy, June 21–25, Springer, (2022). Preprint hal-03523695. https://doi.org/10.1007/978-3-031-16609-9_20. MR4606592

[13]

Gneiting, T. and Schlather, M. Stochastic models that separate fractal dimension and the Hurst effect. SIAM Review 46(2) 269–282 (2004). https://doi.org/10.1137/S0036144501394387. MR2114455

[14]

Hawkins, R., Paterson, C., Picardi, C., Jia, Y., Calinescu, R. and Habli, I. Guidance on the assurance of machine learning in autonomous systems (AMLAS). Assuring Autonomy International Programme (AAIP), University of York (2021). MR4326507

[15]

Hindman, M. Building better models: Prediction, replication, and machine learning in the social sciences. The Annals of the American Academy of Political and Social Science 659(1) 48–62 (2015).

[16]

Huszár, F. and Duvenaud, D. Optimally-weighted herding is Bayesian quadrature. In Uncertainty in Artificial Intelligence 377–385 (2012).

[17]

Iooss, B., Boussouf, L., Feuillard, V. and Marrel, A. Numerical studies of the metamodel fitting and validation processes. International Journal of Advances in Systems and Measurements 3. 11–21 (2010).

[18]

Iooss, B. Sample selection from a given dataset to validate machine learning models (2021), arXiv preprint arXiv:2104.14401.

[19]

Joseph, V. R. Space-filling designs for computer experiments: A review. Quality Engineering 28(1) 28–35 (2016). MR3528792

[20]

Kanagawa, M., Sriperumbudur, B. K. and Fukumizu, K. Convergence guarantees for kernel-based quadrature rules in misspecified settings. In Advances in Neural Information Processing Systems 3288–3296 (2016).

[21]

Karvonen, T., Kanagawa, M. and Särkkä, S. On the positivity and magnitudes of Bayesian quadrature weights. Statistics and Computing 29(6) 1317–1333 (2019). https://doi.org/10.1007/s11222-019-09901-0. MR4026673

[22]

Karvonen, T., Wynne, G., Tronarp, F., Oates, C. and Särkkä, S. Maximum likelihood estimation and uncertainty quantification for Gaussian process approximation of deterministic functions. SIAM/ASA Journal on Uncertainty Quantification 8(3) 926–958 (2020). https://doi.org/10.1137/20M1315968. MR4130422

[23]

Kleijnen, J. P. C. and Sargent, R. G. A methodology for fitting and validating metamodels in simulation. European Journal of Operational Research 120. 14–29 (2000). https://doi.org/10.1016/j.ejor.2016.06.041. MR3543078

[24]

Kupiec, P. H. On the accuracy of alternative approaches for calibrating bank stress test models. Journal of financial stability 38. 132–146 (2018).

[25]

Lorenzo, G., Zanocco, P., Giménez, M., Marquès, M., Iooss, B., Bolado-Lavin, R., Pierro, F., Galassi, G., D’Auria, F. and Burgazzi, L. Assessment of an isolation condenser of an integral reactor in view of uncertainties in engineering parameters. Science and Technology of Nuclear Installations 2011, Article ID 827354 (2011). https://doi.org/10.1155/2011/827354

[26]

Mak, S. and Joseph, V. R. Support points. The Annals of Statistics 46(6A) 2562–2592 (2018). https://doi.org/10.1214/17-AOS1629. MR3851748

[27]

Marrel, A., Iooss, B. and Chabridon, V. The ICSCREAM methodology: Identification of penalizing configurations in computer experiments using screening and metamodel – Applications in thermal-hydraulics. Nuclear Science and Engineering 196. 301–321 (2022).

[28]

Moon, H. Design and analysis of computer experiments for screening input variables. PhD thesis, Ohio State University, USA (2010). MR2794741

[29]

O’Hagan, A. Bayes–Hermite quadrature. Journal of Statistical Planning and Inference 29(3) 245–260 (1991). https://doi.org/10.1016/0378-3758(91)90002-V. MR1144171

[30]

Parmar, M., Haselbacher, A. and Balachandar, S. Improved drag correlation for spheres and application to shock-tube experiments. Aiaa Journal 48(6) 1273–1276 (2010).

[31]

Petropoulos, A., Siakoulis, V., Stavroulakis, E. and Vlachogiannakis, N. E. Predicting bank insolvencies using machine learning techniques. International Journal of Forecasting 36(3) 1092–1113 (2020).

[32]

Pronzato, L. Minimax and maximin space-filling designs: some properties and methods for construction. Journal de la Société Française de Statistique 158(1) 7–36 (2017). MR3637639

[33]

Pronzato, L. Performance analysis of greedy algorithms for minimising a maximum mean discrepancy. Statistics and Computing 33. 14 (2023). Preprint hal-03114891. arXiv:2101.07564. https://doi.org/10.1007/s11222-022-10184-1. MR4519641

[34]

Pronzato, L. and Müller, W. G. Design of computer experiments: space filling and beyond. Statistics and Computing 22(3) 681–701 (2012). https://doi.org/10.1007/s11222-011-9242-3. MR2909615

[35]

Pronzato, L. and Rendas, M. -J. Validation design I: construction of validation designs via kernel herding (2021). Preprint hal-03474805. arXiv:2112.05583.

[36]

Pronzato, L. and Zhigljavsky, A. A. Bayesian quadrature, energy minimization and space-filling design. SIAM/ASA J. Uncertainty Quantification 8(3) 959–1011 (2020). https://doi.org/10.1137/18M1210332. MR4133484

[37]

Rasmussen, C. E. and Ghahramani, Z. Bayesian Monte Carlo. In Advances in Neural Information Processing Systems 505–512 (2003).

[38]

Sacks, J., Welch, W. J., Mitchell, T. J. and Wynn, H. P. Design and analysis of computer experiments. Statistical Science 4(4) 409–435 (1989). MR1041765

[39]

Santner, T., Williams, B. and Notz, W. The Design and Analysis of Computer Experiments. Springer (2003). https://doi.org/10.1007/978-1-4757-3799-8. MR2160708

[40]

Sejdinovic, S., Sriperumbudur, B., Gretton, A. and Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics 41(5) 2263–2291 (2013). https://doi.org/10.1214/13-AOS1140. MR3127866

[41]

Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B. and Lanckriet, G. R. G. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research 11. 1517–1561 (2010). MR2645460

[42]

Stein, M. L. Interpolation of Spatial Data. Some Theory for Kriging. Springer, Heidelberg (1999). https://doi.org/10.1007/978-1-4612-1494-6. MR1697409

[43]

Szabó, Z. and Sriperumbudur, B. Characteristic and universal tensor product kernels. Journal of Machine Learning Research 18. 1–29 (2018). MR3845532

[44]

Welling, M. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning 1121–1128 (2009).

[45]

Wynn, H. P. The sequential generation of D-optimum experimental designs. Annals of Math. Stat. 41. 1655–1664 (1970). https://doi.org/10.1214/aoms/1177696809. MR0267704

[46]

Xu, Y. and Goodacre, R. On splitting training and validation set: A comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. Journal of Analysis and Testing 2. 249–262 (2018).

Exit Reading

Table of contents

1 Introduction and Motivation
2 A Criterion for Validation Measures
3 Minimisation of ${\mathcal{E}_{{\overline{K}_{|n}}}}$
4 Numerical Experiments
5 “Real” Models
6 Conclusions
Appendix A Bias Correction
Appendix B Potential ${P_{{\overline{K}_{|n}}}}(\mathbf{z})$ for Tensor-Product Kernels on ${[0,1]^{d}}$
Appendix C Details on Numerical Experiments
Acknowledgements
Footnotes
References

RSS

Authors