We address the estimation of the Integrated Squared Error (ISE) of a predictor

Using machine learning models in real world applications, for instance for industrial optimisation and testing [

Model validation ideally resorts to a reserved test set, i.e. to evaluations of the modelled function on data points that have not been used neither to select nor to train the machine learning model [

This paper proposes a methodology to estimate the quality of an interpolator learned on a given experimental design. More precisely, we suppose that data gathered on the points of an experimental design

We will often consider

We denote by

Estimation of the integral (

We address the choice of the validation measure

The paper is organised as follows. Section

Throughout the manuscript we frequently resort to the notion of space-filling designs, i.e., designs whose points are evenly spread over

Since

Before doing that, the next section puts our approach in perspective in relation to other (non-parametric) model validation methods.

Non-parametric estimation of the ISE of a computational model learned on a dataset

Given the observations of

We argue below that there is no rationale for uniform weighting of the observed residuals. Let

independent and identically distributed.

sample fromThis means that there is no reason to impose that

In Figure

Histograms of the errors of estimators

The estimation error

Assume thus that the function

Under the assumption above

The GP assumption defines a prior distribution for

For any kernel

When

Kernels

Finally, notice that

We address now the minimisation of

We drop the two constraints usually imposed on weights: besides the sum-of weights-equals-one constraint (see Section

Since for a given

Before we present in Section

Kernel Herding (KH) [

However, it does not provide the optimal design for a fixed

In Bayesian quadrature (BQ) [

KH and SBQ are closely related, see e.g. [

Experiments combining the two methodologies, by using the optimal BQ weights for a design found by standard KH, show that correct weighting is more critical than sample placement [

The validation setup of this paper coincides with the framework assumed by BQ, our final goal being to estimate an integral from a small number of samples, and we also resort to a GP assumption. As in BQ, the weights of our empirical estimator do not need to sum to 1 and are not necessarily positive (see [

Both KH and BQ assume that the RKHS kernel is characteristic, meaning that the corresponding MMD between two probability measures is zero if and only if these two measures coincide. Kernel

In this section we briefly present the SBQ method, reinterpreting it in the validation setup of interest to us.

By noting that

For a given

The numerator measures how much

The recursive extension of the validation measure is initiated with

In practice, a finite set

With the aid of a one-dimensional example we formulate now a number of comments about the expected behaviour and properties of the estimators

The exact definition of these kernels is given in Appendix

Top:

Remark first that, as anticipated, both designs

Above, we recognised

Section

Our analysis resorts to simulations from several (zero mean) GP models, and the MSE of the

MSE of

MSE of

We address robustness by studying how much the MSE of

The three curves in each plot correspond to different sizes of

Figure

Finally, Figure

The experiments in this section suggest a rule-of-thumb to choose the kernel

MSE of

Our main novel contribution is the identification of

Figures

Note that, in the configurations tested, the default rule-of-thumb for the choice of

MSE of

MSE of

MSE of

MSE of

Considering only validation measures

Since

Figures

Moreover, our experiments reveal that the designs

Figure

Designs for

For the same set of kernels

Three values of

Figures

We study in this section the behaviour of the validation method proposed considering deterministic functions. More precisely, we consider the following multidimensional functions:

The 2-dimensional

The 7-dimensional

The functions generated by the models above cannot be well interpolated using simple kriging unless the design size is very large, having a smooth tendency which, when not taken into account, leads to a residual signal

We compare the performance of the ISE estimator proposed in the paper, using the validation measure

The robustness of the estimator with respect to the assumed value of the range parameter of the covariance of the GP model is studied by showing three panels, corresponding to

For the drag model

Drag model. True ISE (blue) SBQ estimate (red) and

Drag model. True ISE (blue) SBQ estimate (red) and

Piston model. True ISE (blue) BQ estimate (red) and

Figure

Figure

We switch now to the higher dimensional piston model (

When

The paper presents an estimator for the ISE of an interpolator based on knowledge of the design on which it has been learned, defined as the ISE for a finitely supported validation measure. The estimator proposed is the optimal MSE linear estimator under the assumption that the interpolated function is a realisation from a Gaussian process with known statistical moments. The support and weights of the validation measure are found by minimising an MMD for a non-stationary kernel that is adapted to the learning design, and a nested sequence of validation designs is greedily determined by SBQ. A default rule is proposed to select the covariance kernel of the assumed model.

Numerical experiments on both simulations from nominal Gaussian processes and on two real models of small dimension confirm the superior performance of the proposed estimator when compared to common estimation by the simple empirical average of the observed squared residuals.

The interpretation of the ISE estimator in terms of an interpolation of the squared residuals explains the utmost importance of accounting for the correct shape of their second-order moment. Moreover, it unriddles the observed robustness of the estimator with respect to the covariance of the assumed GP model.

The work presented suggests several directions for future developments. One concerns the determination of indicators of the quality of the ISE estimate itself, ideally given by the risk function that is optimised. These could both be used to define stopping rules, indicating that incorporation of further residual observations should not yield a significant improvement on the confidence of the current ISE estimate, or to flag poor performance of the current interpolator, and trigger its update including some of the residuals observed over

Under the assumed GP model for

By noting that

Simply subtracting

Bias of

Alternatively, a linear (instead of affine) unbiased solution can be found by using weights

Denote by

As (

We performed

Let

The correct

In Figure

Unless a high confidence can be given to the assumed GP model, including its scale parameter, the lack of robustness of the unbiased estimators prevents their use. For small design sizes, where bias correction could indeed be important, guaranteeing the fidelity of the assumed model is in general impossible, severely limiting the practical interest of the unbiased estimators discussed here.

A key difficulty for the algorithmic construction of a validation design by SBQ (Section

We can write

Before deriving the expression of

The expressions of

When

Let

All Gaussian models considered in the numerical experiments presented assume a zero mean, i.e.,

The experiments presented resort to several parametric families for the kernel

The parameter

The material in Section

Generation of realisations from the GP model requires factorisation of the matrix

In Section

The authors acknowledge the fruitful collaboration with the other partners of the ANR project INDEX, in particular Bertrand Iooss, Elias Fekhari and Joseph Muré from EDF R&D Chatou, France.