Validation of Machine Learning Prediction Models

We address the estimation of the Integrated Squared Error (ISE) of a predictor η ( x ) of an unknown function f learned using data acquired on a given design X n . We consider ISE estimators that are weighted averages of the residuals of the predictor η ( x ) on a set of selected points Z m . We show that, under a stochastic model for f , minimisation of the mean squared error of these ISE estimators is equivalent to minimisation of a Maximum Mean Discrepancy (MMD) for a non-stationary kernel that is adapted to the geometry of X n . Sequential Bayesian quadrature then yields sequences of nested validation designs that minimise, at each step of the construction, the relevant MMD. The optimal ISE estimate can be written in terms of the integral of a linear reconstruction, for the assumed model, of the square of the interpolator residuals over the domain of f . We present an extensive set of numerical experiments which demonstrate the good performance and robustness of the proposed solution. Moreover, we show that the validation designs obtained are space-filling continuations of X n , and that correct weighting of the observed interpolator residuals is more important than the precise configuration Z m of the points at which they are observed.


INTRODUCTION AND MOTIVATION
This paper proposes a methodology to estimate the quality of an interpolator learned on a given experimental design.More precisely, we suppose that data gathered on the points of an experimental design X n = {x 1 , . . ., x n } with n points in a compact set 1 X has been used to build a predictor of the value of the function f : X → R that produced the collected samples.
We denote by y n = (f (x 1 ), . . ., f (x n )) ⊤ the vector collecting the n evaluations of f at the design points x i , by F n = (X n , y n ) the learning dataset, and by η Fn (x) the resulting prediction of f (x).The quality of η Fn is assessed through a widely used measure of the precision of interpolators, the Integrated Squared Error (ISE): 2 µ(dx) . (1.1) In the definition above the (user-defined) measure µ enables penalization of the interpolation errors over regions of X which are considered to be of particular importance.We stress that we consider that the experimental design X nalso referred to as the "learning design" -is given, making no assumptions on how it is has been chosen.Estimation of the integral (1.1) must necessarily resort to the evaluation of the prediction error ε(x) = η Fn (x) − f (x) over only a finite set of points Z m = {z 1 , . . .z m } ⊂ X , which we designate by "validation design".The integral is then approximated by replacing µ by a point mass measure ζ = ζ(w, Z m ) = i w i δ zi supported on Z m only.We generically refer to ζ as the validation measure, using the notation ζ m to make explicit the dependency on the size of the validation set.Although ζ is not necessarily the uniform distribution supported on Z m , and with a slight abuse of terminology, we refer to the corresponding ISE estimators as empirical ISE estimators.We address the choice of the validation measure ζboth of the validation design Z m and of the validation weights w -and investigate the properties of the resulting estimates ISE(η Fn ; ζ) given by (1.2).The algorithms presented are iterative, defining increasing sequences of nested validation designs Z m ⊂ Z m+1 ⊂ Z m+2 ⊂ • • • such that the performance of ISE(η Fn ; ζ) improves as m increases.A preliminary version of this work has been presented in [5], in the context of a comprehensive comparison of validation methodologies.
The paper is organised as follows.Section 2.1 first relates the ISE estimators (1.2) to other ISE estimators.Then, assuming that the interpolated function f is a realisation of a Gaussian process with known moments, we present in Section 2.2 a computable criterion R(ζ, F n ) that evaluates the precision of empirical estimators of the form (1.2).In Section 3 we discuss optimisation of R(ζ, F n ), detailing appli-cation of related existing algorithms to the specific conditions of the validation problem of interest here, and providing an instrumental interpretation of the corresponding "optimal" empirical ISE estimators.Since the "optimal" validation measure depends on the assumed GP model, the robustness and performance of the validation methodology presented are investigated numerically in Section 4, leading to two major conclusions.One concerns the validation weights w, stating that the contributions of the individual errors ε(z i ) to ISE(η Fn ; ζ) must be down-weighted -with respect to taking ζ to be the uniform distribution over Z m -to avoid overestimation of ISE(η Fn ).The second concerns the geometry of the validation design Z m , whose optimality is seen to be much less important that correct choice of the weights w.Based on these numerical studies we propose a default choice for the covariance kernel of the GP model used, including its scale parameter.Finally, Section 5 summarises our findings and proposes some directions for future studies.

A CRITERION FOR VALIDATION MEASURES
Since f is unknown, we can at best expect to find an ISE estimator that will perform well for most functions f consistent with F n .To characterise this set of functions we adopt the Gaussian process framework -briefly recalled below -enabling us to subsequently derive a criterion to choose the validation measure ζ.
Before doing that, the next section puts our approach in perspective in relation to other (non-parametric) model validation methods.

Empirical ISE estimation
Non-parametric estimation of the ISE of a computational model learned on a dataset F n is most commonly done using F n itself.In cross-validation (CV), see e.g.[3,2], the residuals ε cv i = y i − η Fn\(xi,yi) (x i ) at each data point (x i , y i ) of a predictor fit to all other n − 1 points of F n are computed, and the ISE is estimated by their average: The setup considered in this paper is in some sense dual of CV.On the one hand, CV requires more information about η, assuming the ability to build the n new predictors η Fn\(xi,yi) (one for each point that is "left out") and assumes thus knowledge of how η Fn is learned, while we consider η Fn as a black-model delivered by a third party, using an undisclosed modelling approach.On the other hand, CV requires no any additional observations of f , while Given the observations of f over a validation set Z m , a straightforward estimate of the ISE is the simple arithmetic mean of the squared values of the m residuals ε a special case of (1.2), obtained by letting ζ be the uniform distribution over Z m : ζ = (1/m) i δ zi , with δ a denoting the unit point-mass at x = a.Let p η denote the (unknown) probability density of the residuals ε(x) when x ∼ µ.For expression (2.2) to be a Monte Carlo estimate of the ISE integral, the observed ε(Z m ) = {ε i } m i=1 must be a plausible i.i.d. 2 sample from p η , which is generally not true.Consider for instance that Z m is a space filling continuation of X n , sampling the regions of X the most distant from X n .In this situation we can anticipate that ε(Z m ) will be biased towards the upper limit of the support of p η , and thus that ISE un will over-estimate ISE.The contribution of the observed residuals to ISE must thus be adjusted, counterbalancing this poor sampling of regions where the weakest residual values are expected.The validation measures ζ proposed in this paper automatically implement this variable residual scaling, relying on a prior stochastic model for f to infer how well the observed ε(Z m ) are expected to be representative of the errors over the entire X .It follows from the above that there is no reason for imposing that the validation measure ζ be a probability distribution i.e., that i w i = 1.Our methodology drops this common constraint, defining an un-normalised measure ζ adjusted to the geometry of Z m relative to X n .To corroborate this choice, note that when η Fn is an interpolator, so that ε(x i ) = 0 for all x i ∈ X n , incorporation of these n zero residuals in (2.2), which should lead to a better estimator of ISE(η Fn ), yields

Choosing the validation measure: a GP-based criterion
The estimation error | ISE(η Fn ; ζ) − ISE(η Fn )| is not a computable criterion that we can optimise to choose ζ.A possible approach would be to consider that f belongs to some class of functions S and optimise the worst estimation performance over all f ∈ S.Here we follow an alternative and simpler route, assuming that f is a realization of a Gaussian Process (GP), or Gaussian Random Field, and minimising a moment of the ISE estimation error under the assumed model.Assume thus that f is a sample function F x from a GP indexed by X , with known second-order characteristics Kernel K is supposed to be Strictly Positive Definite (SPD), and, for the sake of simplicity, we consider that m(x) = E{F x } = 0 for all x ∈ X .Extension of the material presented below to the case of a linearly parameterized mean, with E{F x } = β ⊤ h(x) for a vector β of unknown parameters and a vector h(x) = (h 1 (x), . . ., h p (x)) ⊤ of p known functions of x is possible via some adaptation.
Under the assumption above ISE(η Fn ) given by ( as a criterion to choose the validation design: The GP assumption defines a prior distribution for f , which given F n can be updated into the posterior distribution of its values over the unobserved points, with mean for any x, x ′ in X , where The n × n matrix K n is SPD as K is SPD (we assume that the x i in X n are pairwise distinct).Note that K |n (x i , x) = 0 for all x ∈ X and all x i ∈ X n .The Integrated Mean Squared Error (IMSE) is thus This minimum value depends only on the learning design X n and is given by proportional to the energy of the signed measure ζ m − µ for the kernel a scaled version of the second order moment of the squared residuals.Under GP f , with K |n (x, x ′ ) given by (2.6).When η Fn does not interpolate y n , and under the same GP for f , similar developments still give In the following, we always consider that η Fn is the optimal interpolator k n (x) ⊤ K −1 n y n , and thus that (2.6) holds.Kernels K |n present a number of features which are not shared by the most commonly used GP kernels.The assumption that η Fn is an interpolator, i.e. ε(x i ) = 0, implies that K |n (x i , x) = 0 for all x ∈ X and all x i ∈ X n .The squared error process is thus non-stationary, with a spatial coherency structure that is strongly dictated by the geometry of X n .Adapting the validation weights w i to this correlation structure dictates the performance of ISE(η Fn ; ζ m ) in a critical manner.Yet, as the numerical studies of Section 4 show, exploiting the particular shape of K n when choosing the validation points Z m is less critical (as long as they do not fall in the vicinity of X n ).
Finally, notice that K |n is PD.Indeed, the Hadamard product |n is PD, which in turn implies that K |n is also PD.

MINIMISATION OF E K |n
We address now the minimisation of We drop the two constraints usually imposed on weights: besides the sum-of weights-equals-one constraint (see Section 2.1), we also do not impose w i ≥ 0. Imposing positivity would be natural if the observations were noisy independent random samples of the interpolation error, but here the ε i are noise-free and, more importantly, strongly linked by a coherency structure dictated by both the regularity characteristics of f and the quality of η Fn as an interpolator.Nonetheless, our numerical experiments show that the w i are almost always positive; see for example Figures 12 and  13.
Since for a given f and η Fn the validation residuals are deterministic, repeating validation points or choosing z i ∈ X n brings no additional information.We thus restrict Z m to configurations of m distinct points in X \ X n .The minimisation of E K |n (ζ m −µ) with respect to the parameters of ζ m is a non-linear optimisation problem over a large dimensional space (m(d+1) scalar parameters when X ⊂ R d ).As briefly evoked in the introduction, rather than fixing upfront the size m of the validation design, we are interested in finding nested sequences of validation designs, generated by a sequence of identical steps, each one increasing the design size by one: where z m+1 is restricted to Before we present in Section 3.2 the sequential Bayesian quadrature algorithm that performs this iterative construction, greedily decreasing E K |n (ζ m − µ) at each step, we present background on relevant literature on iterative energy (or, equivalently, MMD) minimisation.

Background
Kernel herding (KH) [19] can be seen to correspond to the Frank-Wolfe conditional gradient algorithm [1] applied to MMD minimisation, that is, to the vertex-direction method with predefined step-length, commonly used in optimal experimental design since the pioneering work of H.P. Wynn [20] and V.V. Fedorov [4].It is an accretive method3 , generating a sequence z 1 , z 2 , . . .which can be incrementally grown to any target size m.
In Bayesian quadrature (BQ) [11,15] the goal is to choose samples that best approximate an integral by exploiting the assumption that the integrated function is the realisation of a GP.Sequential BQ (SBQ) sequentially expands the set of sampled points by adding a new sample at the point that decreases the variance of the integral estimate the most.This variance is shown to be the MMD between the target integral measure and the discrete measure that implements the quadrature rule for the kernel of the assumed GP model.
KH and SBQ are closely related, see e.g.[8], both attempting to minimise the same MMD.The two techniques embed the problem in consideration in the RKHS of a positive definite kernel that is chosen to reflect the characteristics of the underlying data distribution (in the original formulation of KH) or of the integrated functions (in SBQ).As stressed in [8], a major distinction between the two techniques concerns the weights assigned to each sample, which are uniform for standard KH, while they are optimally selected in SBQ.The two methods differ both in complexity and performance: SBQ is superior to standard (uniform weight) KH, this improvement coming at the cost of an increased complexity, O(n) for KH and O(n 2 ) for SBQ when constructing an n-point design; see [12].
Experiments combining the two methodologies, by using the optimal BQ weights for a design found by standard KH, show that correct weighting is more critical than sample placement [8], affecting in particular the algorithm's convergence rate: KH has performance similar to SBQ for small design sizes, but displays worse performance as design size grows.
The validation setup of this paper coincides with the framework assumed by BQ, our final goal being to estimate an integral from a small number of samples, and we also resort to a GP assumption.As in BQ, the weights of our empirical estimator do not need to sum to 1 and are not necessarily positive, and the optimal solution minimises an MMD.Placing the GP assumption not directly on the function we wish to integrate -in our case ε 2 (x) -but on the interpolated f , leads to the identification of the pertinent MMD kernel under our validation framework as the nonstationary kernel K |n , whose structure encodes the geometry of the learning design X n .
Both KH and BQ assume that the RKHS kernel is characteristic, meaning that the corresponding MMD between two probability measures is zero if and only if these two measures coincide.Kernel K |n is not characteristic, and in particular it cannot differentiate between measures that differ only over the finite set X n , where K |n is zero.However, as we stressed before, since we know that ε(x) = 0 for x ∈ X n , the set of target measures over which we optimise E K |n (ζ m −µ) all put zero mass on X n , and thus it still makes sense to minimise it.

Greedy optimisation of E
In this section we briefly present the SBQ method, reinterpreting it in the validation setup of interest to us.
By noting that , the weights w(Z m ) that minimise it for a given Z m can be seen to be given by where the m × m matrix K |n (Z m , Z m ) has generic element K |n (z i , z j ) and the i-th entry of the m-dimensional column vector P K |n (Z m ) is the potential of µ associated with kernel K |n at validation point z i : (3.6)Note that the weights wi (Z m ), and thus the estimator ISE(η Fn , Z m ) itself, are independent of σ 2 .The estimators ε 2 Fn (x|Z m ) and ISE(η Fn , Z m ) rely on the assumed GP model GP f for f , but as explained in [17,Sect. 3.2], model misspecification has a much smaller effect on linear predictions than on evaluation of the MSE.One important strength of the approach is thus that our estimator of ISE(η Fn ) does not require estimation of the MSE associated with a prediction.As shown in Appendix A, this is no longer the case when one attempts at removing the bias of ISE(η Fn , Z m ), which leads to estimators that are much less robust to model misspecification.
For a given where, The next validation point is thus a maximiser of the second term in E m (x), which can equivalently be written as The numerator measures how much ISE(η Fn ) and the error of ε 2 Fn (z|Z m ) as an estimate of ε 2 (x) are statistically associated.Points where this term is large are good candidates to extend the current design.The denominator penalises points x where ε 2 (x) is estimated with a large MSE, tending to keep z m+1 away from the boundaries of X (where the uncertainty is in general large), as the numerical studies presented later will show.
The recursive extension of the validation measure is initiated with In practice, a finite set X L ⊂ X , for instance the L first elements of a low-discrepancy sequence in X , of a regular grid in X if d is not too large, is substituted for X in (3.7) and (3.8).The determination of z m+1 , m ≥ 0, then requires the evaluation of P K |n (x) for all x ∈ X L \ X n .This calculation is done once for all, at the initialisation of the algorithm.In the numerical examples of Section 4, P K |n = P K |n ,µ is replaced by P K |n ,µ L , with µ L the uniform (discrete) measure uniform on X L , see Appendix C for details.When K is a tensor-product kernel and µ is uniform on X = [0, 1] d , P K |n (x) can often be calculated explicitly; see Appendix B.
With the aid of a one-dimensional example we formulate now a number of comments about the expected behaviour and properties of the estimators ISE obtained by repeated application of (3.7) -to extend Z m to Z m+1 -and (3.2)fixing the weights of ζ m+1 , and thus ε 2 Fn (x|Z m+1 ) for the subsequent design extension.The red bold curve in the top panel of Figure 1 plots the squared residuals ε 2 (x) of the interpolator η Fn for the function f plotted in the bottom panel (where η Fn and f are in red and green, respectively), trained on the learning design of size 10 indicated by the red stars.The blue and green curves on the top panel are the squared residuals ε 2 Fn (x|Z m ) predicted by two distinct ζ m (m = 10), both generated using (3.7) and (3.2), but assuming distinct kernels K(x, x ′ ): Cauchy (in blue) and Matérn 3/2 (in green), with range parameters θ as indicated in the legend 4 The (nearly coincident) validation designs Remark first that as anticipated both designs Z m have no points in the boundaries of X , even if the uncertainty affecting ε 2 (x) is large in those regions.Those familiar with optimal interpolation using monotonically decreasing stationary covariance kernels may be surprised by the fact that in intervals between learning points containing no validation points (e.g.around x ≃ 0.3) the interpolated squared residual is non-zero, i.e., ε 2 Fn (x|Z m ) > 0. This is a consequence of the particular shape of kernel K |n , strongly dictated by the geometry of X n , which has larger values between residual values at pairs of points at large distance than the original K, as shown in Figure 2.For z 1 ≃ 0.1 ∈ Z m , the figure plots normalised versions of both the assumed (stationary) signal correlation K(z 1 − x) (in thin coloured lines) as well as kernel K |n (z 1 , x) (bold lines), with the same colour code as in Figure 2. The similarity of the two K |n allows us to expect that the estimator will have some robustness with respect to the assumed GP model.The numerical studies presented in Section 4 confirm this expectation.
Above, we recognised ε 2 Fn (x|Z m ) as the MMSE linear estimator of ε 2 (x) given ε 2 (Z m ).Being agnostic with respect to the expected values of the involved random variables, estimators ε 2 Fn (x|Z m ), and thus ISE, are biased.We investigate in Appendix A the possibility of exploiting knowledge of the first moments, namely E ε 2 (x) ) with an unbiased estimator.Unfortunately, bias correction comes at the price of loosing robustness with respect to the assumed GP model for f , as we might expect given the explicit dependency on σ 2 of both expected values.Thus, the unbiased estimators in Appendix A cannot be considered as instrumental alternatives to ISE, and we will not consider them in the numerical study of Section 4.

NUMERICAL EXPERIMENTS
Section 4.1 presents numerical studies that demonstrate the robustness of ISE with respect to the assumed GP model, with ζ m found by SBQ.Section 4.2 confirms the importance of using K |n to define the energy minimised by SBQ.We then study, in Section 4.3, the impact of using KH, which has slightly smaller computational complexity, rather than SBQ, to find the validation support of ζ m , concluding that it leads to worse performance and is subject to numerical instability.Finally, Section 4.4 illustrates via some examples the properties of the validation measures, in particular their space-filling properties and the fact that they down-weight the observed squared residuals.In all examples X = [0, 1] d , with d = 1, 2 or 3. Use of larger values of d lead to similar conclusions, see [13].
Our analysis resorts to simulations from several (zero mean) GP models, and the MSE of the ISE estimates is approximated by averaging the squared errors of ISE (i) on M = 500 realisations {f (i) } M i=1 of the assumed GP model.We reserve the notation Q(•, •; θ 0 ) for the kernel of the GP model from which is f sampled, θ 0 being thus "the true" scale parameter.The scale parameter is adapted to the size of the learning design, θ 0 = n 1/d , such that good interpolation performance over X can be attained with n points.Designs X n are always space-filling, and η Fn is the optimal Bayesian interpolator for the simulated GP model.See Appendix C for details.
K(•, •; θ) denotes the kernel of the GP model assumed by the design algorithm that produces ζ m , with θ its scale parameter.In all numerical examples we will always consider σ 2 = 1.The influence of θ is studied for θ ∈ n 1/d /4, max(n 1/d , 2 (n + m) 1/d ) , an interval that always contains (as well as θ 0 = n 1/d ).All plots consider the normalisation θ/θ c , such that θ c ↔ 1 in the plots shown.In all plots of this section the special symbols in the plotted curves indicate that the design algorithm uses θ = θ 0 , the scale parameter of the simulated GP model.

Robustness with respect to assumed GP model
We address robustness by studying how much the MSE of ISE is affected by model mismatch.Figure 3 plots empirical estimates of R(ζ m , F n ).Kernel Q is the Matérn 3/2 kernel, X n has n = 10 points and d = 1, θ 0 = n.The panels correspond to different values of the regularity parameterν = 1/2, 3/2, 5, from left to right -of the Matérn kernel K.
The three curves in each plot correspond to different sizes of Z m : m ∈ {5, 10, 20} (in blue, red and brown, respectively), plotting R as a function of θ.The black stars indicate θ = θ 0 .Comparison of the three panels confirms the anticipated robustness of the estimator.When K has higher regularity than Q, as in the rightmost panel (ν = 5/2), the curves are almost identical to the central panel, where the correct model is used.However, the assumption of a less regular model, as in the leftmost panel, may significantly degrade performance.The estimators are reasonably robust with respect to precise choice of the scale parameter if values θ ≃ θ c are used.
Figure 4 reproduces the same study for simulations from a process with a Cauchy kernel and larger Z m : m ∈ {10, 20, 30} (left to right).As in previous figure, K is the Matérn kernel and the three panels correspond to different smoothness parameters ν ∈ {1/2, 3/3, 5/2}.Here the simulated model has a weaker regularity than the models assumed, and a noticeable performance degradation is now observed for the smaller designs and the more regular Matérn kernel with ν = 5/2.Similar results were obtained when simulating from other models and for higher values of d.
Finally, Figure 5 shows, for the same validation designs Z m as in Figure 3, the MSE of ISE un given by equation (2.2), estimated over 500 realisations of a GP with the same Matérn 3/2 model.We can see that proper residual weighting leads to a significant decrease of the estimation error, which is nearly one order magnitude larger in Figure 5 than for the optimal BQ designs of Figure 3.
The experiments in this section suggest a rule-of-thumb to chose the kernel used by the design algorithm: K should model functions with a reasonably large degree of smoothness (Matérn 3/2 was found to be a good compromise), with a scale parameter θ dependent on the sizes of the learning and validation sets.For the Matérn family used in our experiments a good choice is θ ≃ (n + m) 1/d , automatically adjusting to the actual total number of residual samples.

Impact of K |n
Our main novel contribution is the identification of K |n as the kernel that appears in the MMD that the validation measure ζ m , both its weights and its support, must minimise.One may question the importance of using the nonstationary conditional kernel K |n to find Z m , instead of directly using kernel K.We now compare the performance of the empirical estimator ISE with Z m determined by SBQ for kernel K |n , as in Section 3.2, which from now on we denote by ζ BQ⋆ , with use of a validation measure ζ BQ K whose support Z m is incrementally found by SBQ for kernel K, the continuation of X n that is optimal to integrate the function f .Independently of how Z m was found, the validation measures ζ m used by the estimators ISE always have optimal weights given by (3.2).
Figures 6 (d =  We can see that the two estimators display similar performance and robustness with respect to mis-modelling.When m is small ζ BQ⋆ often yields smaller MSE, see top curves, but the red and black curves are almost coincident for the larger values of m.These results, which are representative of those obtained for other choices of Q and d, indicate that correct residual weighting is more important than the detailed placement of the validation points Z m .
Note that, in the configurations tested, the default ruleof-thumb for the choice of K and θ presented in Section 4.1 leads indeed to good and stable performance.

Comparison with Kernel Herding
Considering only validation measures ζ with uniform weights (1/m), standard KH also minimises an MMD, incrementally extending Z m with the point that minimises the numerator of the second term of the BQ criterion, see equation (3.7).Since KH has smaller complexity than SBQ, and the results of the previous section suggest that optimal choice of Z m is less important than correct determination of the weights w i , we compare now ζ BQ⋆ to two other validation measures, whose designs Z m are found by extending X n by KH: ζ KH K , that performs KH for kernel K, and ζ KH⋆ that considers kernel K |n .As we will see, the SBQ design is a superior alternative, both in terms of performance and numerical stability, to the KH designs.
Since ζ KH K considers only, at each step, measures with uniform weights, and ζ KH⋆ does not take into account the optimal weights that will be applied when Z m is extended to Z m+1 , we can expect the following ranking of these estimators: which has already been remarked in [8].     3 measures using validation designs Z m found by KH, in particular for small design sizes m and the more regular models, and appears to be more robust with respect to the choice of the GP kernel.We remark that the design found by KH for kernel K |n , i.e., the validation measure ζ KH⋆ (in blue), often leads to the poorest performance.That use of K |n may lead to worse performance than simply using K has already been noticed in [5], where only validation sets generated with KH were considered.
Our experiments reveal that the designs ζ KH⋆ can some- times lead to very large errors.This happens when KH places design points close X n , which are subsequently given very large weights.In fact, the implementation of standard KH for kernel K |n needs careful handling of possible repetition of design points, as already noted in [13] where an algorithm is proposed to accommodate this eventuality.As shown there, this corresponds to situations where, when adding a new point to the current design Z m , the total mass of the optimal uniform measure over Z m+1 must be decreased.Since our implementation simply imposes z m+1 ̸ ∈ (X n ∪ Z m ), a grid point very close to X n ∪ Z m is chosen in these situations, as shown below.
Figure 10 shows the designs Z m , m = 10, for Matérn kernels with θ = θ 0 , and regularity parameter (top to bottom panels) ν = 1/2, 3/2 and 5/2.The vertical red lines indicate X n and the black stars, blue circles and red squares the position of points of ζ BQ⋆ , ζ KH⋆ and ζ KH , respectively.A vertical offset is used to facilitate the visualisation of each design (from top to bottom, ζ KH K , ζ KH⋆ and ζ BQ⋆ ).Remark first that the SBQ designs are always space-filling continuations of X n , presenting a good stability with respect to ν, mainly moving points closer to the boundaries of X when ν increases.The other two designs place a few points in the vicinity of X n .

Properties of the design measures ζ BQ⋆
For the same set of kernels K and design sizes considered in Figure 3 (with d = 1 thus), we plot in Figure 11 the sum of the design weights, S(θ) = i w i (θ), as a function of the (normalised) scale parameter of K. K is always a Matérn kernel, with regularity parameter ν = 1/2, 3/2, 5/2 (top to bottom), as indicated in the legends.The learning design X n (n = 10) is the same for all cases.
Three values of m are considered, m = 10, 20 and 30 (blue, red and cyan curves, respectively).In each curve the black squares indicate the value θ 0 = n 1/d .We can see that S(θ) increases with m.For θ larger than a certain value S becomes nearly constant and smaller than (note that the value of the scale parameter θ c prescribed by our rule of thumb, which corresponds to the normalised value of θ equal to one) is always inside this interval) while for θ = n 1/d , under the more regular model with a Matérn 5/2 kernel, S may be larger than 1.     interior of the wider holes of X n .For the Matérn 5/2 kernel and θ = n 1/d a few weights, corresponding to validation points close to X n , become very large, see Figure 12, explaining that S(θ) may be larger than one on the bottom panel of Figure 11.Analysis of the validation measures obtained assuming the larger value of θ in Figure 14 shows that as the assumed correlation length increases ζ BQ⋆ tends to a uniform measure, all weights having now a similar value.We note that even in this situation, use of the BQ measure, which down-weights the squared residuals, leads to a smaller error than use of the simple uniform measure over Z m , as the comparison of Figures 3 and 5 in Section 4.1 has shown.

CONCLUSIONS
The paper presents an estimator for the ISE of an interpolator based on knowledge of the design on which it has been learned, defined as the ISE for a finitely supported  validation measure.The estimator proposed is the optimal MSE linear estimator under the assumption that the interpolated function is a realisation from a Gaussian process with known statistical moments.The support and weights of the validation measure are found by minimising an MMD for a non-stationary kernel that is adapted to the learning design, and a nested sequence of validation designs is greedily determined by SBQ.A default rule is proposed to select the covariance kernel of the assumed model.
The interpretation of the ISE estimator in terms of an interpolation of the squared residuals explains the utmost importance of accounting for the correct shape of their second order moment.Moreover, it unriddles the observed robustness of the estimator with respect to the covariance of the assumed GP model.
The work presented suggests several directions for future developments.One concerns the determination of indicators of the quality of the ISE estimate itself, ideally given by the risk function that is optimised.These could both be used to define stopping rules, indicating that incorporation of further residual observations should not yield a significant improvement on the confidence of the current ISE estimate, or to flag poor performance of the current interpolator, and trigger its update including some of the residuals observed over Z m in the learning dataset F n .A major difficulty is related to the dependency of the MSE of the interpolator on the assumed process covariance, which is known to be difficult to estimate.A possible source of suboptimality of the estimator presented concerns the restriction to a linear estimator.The extension to more general estimators while preserving at the same time the robustness property of the method forms a challenging objective.Finally, we believe that the analysis presented here suggests possible approaches to define (down-)weighted CV estimators with better performance than standard ones.the simulated model.While a much larger bias is observed for the exponential (ν = 1/2) model in the top row, the curves in the bottom panels are similar to those in Figure 16), indicating that the estimator can accommodate a model that assumes a higher regularity.The robustness of BQ with respect to models assuming higher regularity than the true one has been previously noted in [9].Finally, remark that ISE biased has a remarkably stable behaviour, and that its bias is often the smallest amongst all three estimators.Unless a high confidence can be given to the assumed GP model, including its scale parameter, the lack of robustness of the unbiased estimators prevents their use.For small design sizes, where bias correction could indeed be important, guaranteeing the fidelity of the assumed model is in general impossible, severely limiting the practical interest of the un-biased estimators discussed here.

Factorisation in the general case
A key difficulty for the algorithmic construction of a validation design by SBQ (see Section 3.2) or KH (see Section 4.3) is the calculation of P K |n (x) = P K |n ,µ (x) for many x in order to choose z m+1 .However, when K is a tensorproduct kernel, P K |n ,µ can be calculated explicitly.
One may refer to [18] for connections between positivedefiniteness properties of the K i and those of K.The expression of P Ki,µ1 (•) is available for many kernels K i ; see [14] and the references therein.
Before deriving the expression of P K |n ,µ (x) we introduce some notation.Denote by Ω K,n the n × n matrix with respective elements and by ω K,n (x) the vector with j-th component where x j i (respectively, x ki ) is the i-th component of x j (respectively, x k ), and β Ki (r, s) = X K i (r, t)K i (s, t) µ 1 (dt) , i = 1, . . ., d .
Then, using (2.6), direct calculation gives The expressions of P K 2 ,µ1 (x) and β K (u, v), x, u, v ∈ [0, 1], for µ 1 uniform on [0, 1] and K i (x, x ′ ) a Matérn 3/2 kernel (C.1) are given in Section B.2, making the expression of P K |n ,µ (x) available in closed form when K(x, x ′ ) is the product of uni-dimensional Matérn 3/2 kernels and µ is uniform on X = [0, 1] d .Similar calculations can be conducted for other kernels.The expression of E K |n (µ), which appears in the expansion of R(ζ m , X n ), see (2.5), can be obtained in closed form in a similar way; see [13].

APPENDIX C. DETAILS ON NUMERICAL EXPERIMENTS C.1 GP models
Let {f (x)} x∈X , f (x) ∈ R be a real d-dimensional stochastic process defined over the compact index set X ⊂ R d .{f (x)} x∈X is a Gaussian process with mean function µ(•) and covariance kernel K(•, •), noted {f (x)} x∈X ∼ GP(µ(•), K((•, •), if for any finite n ∈ N, and any X = {x 1 , . . ., x n } ⊂ X , the collection of random variables {f (x i ), i = 1, . . ., n} is a n-dimensional normal random vector, i.e. {f (x i ), i = 1, . . ., n} ∼ N (µ X , K X ) , where µ X ∈ R d has i-th component [µ X ] i = µ(x i ), and the n × n matrix K X has generic (i, j) element K X (i,j) = K(x i , x j ).
The experiments presented resort to several parametric families for the process kernel K, namely, the Cauchy kernel K Cauchy as defined in [7], and the Matérn kernels K ν Matérn with regularity parameter ν ∈ {1/2, 3/2, 5/2}, as defined below.For all kernels θ ∈ R + is the scale parameter, and for the Cauchy kernels (ρ, γ) are the long distance dependency and the shape parameters, respectively.Below, ℓ = ∥x − x ′ ∥.For the Cauchy kernel, we set ρ = γ = 1, and thus a rational kernel with bandwidth determined by θ.

2
independent and identically distributed.
1) and 7 (d = 3) show the empirical MSE of ISE for ζ BQ⋆ (black lines) and ζ BQ K (red lines) observed when Q is the Matérn 3/2 kernel (top) and the Cauchy kernel (bottom), for a learning design of size n = 10 d.From left to right, K is a Matérn kernel with ν = 1/2, 3/2 and 5/2.The size of the validation designs, m ∈ {10 d, 20 d, 30 d}, is indicated by the line symbols (+, ⋆ and •, respectively).

Figures 8 and 9
plot, for d = 1 and d = 2, respectively, the MSE of estimators ISE that use ζ BQ⋆ , ζ KH⋆ and ζ KH K .Kernels (Q and K) and designs sizes m are as in the previous examples, see the figures' captions.We can see that ζ BQ⋆ has virtually always smaller MSE than the validation

Figure 4 :
Figure 4: MSE of ISE.Statistics over 500 realisations.Q is a Cauchy kernel, K is as in Figure 3; d = 1.

Figure 5 :
Figure 5: MSE of ISE un .Statistics over 500 realisations for the example in Figure 3

Figures 12 ,
13 and 14 present the designs for three values of θ: θ = n 1/d (the value used in the simulations of Figure 3, and indicated by the squares in Figure 11), for the value prescribed by our rule of thumb, θ = (n + m) 1/d , and for θ = 2 (n + m) 1/d , the upper limit considered in Figure 3.In the Figures the weights of ζ m are shown multiplied m, to en-