1 Introduction
Let us place ourselves in a classical scenario where data about an unknown function take the form
(1) 
The values and the evaluations points are available to the learner. The goal is to ‘learn’ the function from the data (1) by producing a surrogate function for
. Supervised Machine Learning methods compute such an
from a hypothesis class selected in advance. The performance of a method then depends on the choice of this hypothesis class: a good class should obviously approximate functions of interest well. This translates into a small approximation error, which is one of the constituents towards the total error of a method. Another constituent is the estimation error. In classical Statistical Learning vapnik1999overview, the latter is often analyzed by adopting a postulate that the ’s are independent realizations of a random variable with an unknown distribution on . While relevant in many applications, this postulate may not hold in general, encouraging the development of learning frameworks that are robust to nonIID data.In this work, we consider the regression problem from an Optimal Recovery perspective, without recourse to IID assumption on data. Indeed, in the absence of randomness, an averagecase analysis is not possible anymore. Instead, the learner aims at minimizing the worstcase (prediction) error by relying on a model assumption comparable to choosing a hypothesis class. We restrict our attention here to Hilbert spaces and provide the following contributions:

We develop a numerical framework for calculating the worstcase error in the case of finitedimensional Hilbert spaces. In particular, we show that this error can be computed via a semidefinite program (Theorem 1).

We show that Optimal Recovery provides a formula which is userfriendly from an algorithmic pointofview when the hypothesis class is a linear subspace (Theorem 2). Interestingly, this formula coincides with kernel ridgeless regression in some cases (Theorem 3), proving that minimizing the average error and worstcase error can yield the same solution.
The theoretical findings, whose proofs are included in the supplementary material, are verified through some numerical experiments presented in Section 5.
Why Optimal Recovery?
The theory of Optimal Recovery was developed in the 70’s80’s as a subfield of Approximation Theory (see the surveys MicRiv; MicRiv2). Its development was shaped by concurrent developments in the theory of spline functions (see e.g. de1963best; duchon1977splines). Splines provided a rare example where the theory integrated computations de1977computational. But, at that time, algorithmic issues were not the high priority that they have become today and theoretical questions such as the existence of linear optimal algorithms prevailed (see e.g. the survey packel1988linear). Arguably, this neglect hindered the development of the topic and this work can be seen as an attempt to promote an algorithmic framework that sheds light on similarities and differences between Optimal Recovery (in Hilbert spaces) and Statistical Learning. Incidentally, what is sometimes called the spline algorithm
in Optimal Recovery has recently made a reappearance in Machine Learning circles as minimumnorm interpolation
belkin2018understand; rakhlin2019consistency; Liang2018JustIK, of course with a different motivation. We also remark that Optimal Recovery is not the only framework dealing with nonIID data. There are indeed other strands of Machine Learning literature (e.g. Online Learning hazan2016introduction and Federated Learning zhao2018federated) that investigate learning from nonIID data.Noisy observations.
A careful reader may wonder about the possibility of incorporating an error in the data
, which is a common consideration in Machine Learning. We do not investigate such a scenario in this work, as our main focus is on drawing interesting connections between Optimal Recovery and some of the common Supervised Learning techniques in the simplest of settings first. Future works will concentrate on this inaccurate scenario, which is already welldefined and for which some results exists, see
plaskota1996noisy; ettehad2020instances.2 The Optimal Recovery Perspective
In this section, we present the general framework of optimal recovery and provide some results, including the computation of worstcase error and the explicit formula of optimal recovery map.
The function space.
Echoing the theory of Optimal Recovery, we consider the function more asbtractly as an element from a normed space . The output data , which are evaluations of at the points ’s, can be generalized to linear functionals ’s applied to , so that the data take the form
(2) 
For convenience, we summarize these data as
(3) 
where the linear map is called the observation operator. Relevant situations include the case where is the space of continuous functions on , which is equipped with the uniform norm, and the case where is a Hilbert space , which is equipped with the norm derived from its inner product. It is the latter case that is the focus of this work. More precisely, after recalling some known results, we concentrate on a reproducing kernel Hilbert space of functions defined on , so that the point evaluations at the ’s are indeed welldefined and continuous linear functionals on .
The model set.
Without further information, data by themselves are not sufficient to say anything meaningful about . For example, one could think of all ways to fit a univariate function through points if no restriction is imposed. Thus, a model assumption for the functions of interest is needed. This assumption takes the form
(4) 
where the model set translates an educated belief about the behavior of realistic functions . In Optimal Recovery, the set is often chosen to be a convex and symmetric subset of . Here, our relevant modeling assumption is the one that occurs implicitly in Machine Learning, namely that the functions of interest are wellapproximated by suitable hypothesis classes. In this work, we only consider hypothesis classes that are linear subspaces of . Thus, given an approximation parameter (the targeted approximation error), our model set has the form
(5) 
where . In the case of a Hilbert space, this model set reads
(6) 
where is the orthogonal projection of onto the subspace . Such an approximability set was put forward by binev2017data, who were motivated by parametric PDEs. When working with this model, it is implicitly assumed that
(7) 
otherwise the existence of a nonzero would imply that each , , is both dataconsistent () and modelconsistent (), leading to infinite worstcase error by letting . By a dimension argument, the assumption (7) forces
(8) 
i.e., we must place ourselves in an underparametrized regime where there are less model parameters than datapoints. To make sense of the overparametrized regime, the model set (5) would need to be refined by adding some boundedness conditions, see foucart2020instances for results in this direction.
Worstcase errors.
We now need to assess the performance of a learning/recovery map, which is just a map taking data as input and returning an element as output. Given a model set , the local worstcase error of such a map at is
(9) 
The global worstcase error is the worst local worstcase error over all that can be obtained by observing some , i.e.,
(10) 
A learning/recovery map is called locally, respectively globally, optimal if it minimizes the local, respectively global, worstcase error. These definitions can be extended to handle not only the full recovery of but also the recovery of a quantity of interest . That is, for a map from into another normed space , one would define e.g. the global worstcase error of the learning/recovery map as
(11) 
Such a framework is pertinent even if we target the full recovery of but with performance evaluated in a norm different from the native norm , as we can consider to be the identity map from equipped with into equipped with .
Perhaps counterintuitively, dealing with the global setting is somewhat easier than dealing with the local setting, in the sense that globally optimal maps have been obtained in situations where locally optimal maps have not, e.g. when . Accordingly, it is the local setting which is the focus of this work.
Computation of local worstcase errors.
When is a Hilbert space and the approximability model (6) is selected, determining the local worstcase error of a given map at some involves solving
(12) 
This is a nonconvex optimization program, and as such does appear hard to solve at first sight. However, it is a quadratically constrained quadratic program, hence it is possible to solve it exactly. Although Gurobi gurobi now features direct capabilities to solve quadratically constrained quadratic programs, we take the route of recasting (12) as a semidefinite program using the Slemma polik2007survey. The solution of the recast program can then be obtained using an offtheshelf semidefinite solver, at least when is a Hilbert space of finite dimension, say . Precisely, with denoting an orthonormal basis for chosen so that is an orthonormal basis for and with denoting the unitary map , local worstcase errors can be computed based on the following observation.
Theorem 1.
The local worstcase error of a learning/recovery map at under the model set (5) can be expressed, with , as
(13) 
where is the unique element from satisfying and is the minimal value of the following program, in which :
(14) 
Optimal learning/recovery map.
Even though it is possible to compute the minimal worstcase error via (13)(14), optimizing over to produce the locally optimal recovery map would still require some work and would in fact be a major overkill. Indeed, for our situation of interest, some crucial work in this direction has been carried out in binev2017data, and we rely on it to derive the announced userfriendly formula for the optimal recovery map . Precisely, when is a (finite or infinitedimensional) Hilbert space and the model set is given by (6), binev2017data showed that, for any input , the output is the solution to the convex minimization program
(15) 
Their argument, based on the original expression (9) of the worstcase error, exploits the fact that is orthogonal not only to but also to . Let us point out that is both dataconsistent and modelconsistent when for some . It is also interesting to note that the optimal recovery map does not depend on the approximation parameter . This peculiarity disappears as soon as observation errors are taken into consideration, see ettehad2020instances.
A computable expression for the minimal local error (9), and in turn for the minimal global error (10), has also been given in binev2017data. Without going into details, we only want to mention that latter decouples as the product of an indicator of compatibility between model and data points, which increases as the space is enlarged, and of the parameter of approximability, which decreases as the space is enlarged. Thus, the choice of a space yielding small minimal worstcase errors involves a tradeoff on . This tradeoff is illustrated numerically in Subsection 5.2.
Although the description (15) of the optimal learning/recovery map is quite informative, it fails to make apparent the fact the map is actually a linear map. This fact can be seen from the theorem below, which states that solving a minimization program for each is not needed to produce . Indeed, one can obtain by some linear algebra computations involving two matrices which are more or less directly available to the learner. To define these matrices, we need the Riesz representers of the linear functionals , which are characterized by
We also need a (not necessarily orthonormal) basis for . The two matrices in question are the Gramian of and the crossGramian of and . Their entries are given, for and , by
(16)  
(17) 
The matrix is positive definite and in particular invertible (linear independence of the ’s is assumed). The matrix has full rank thanks to the assumption . The result below (proved in the supplementary material) shows that the output of the optimal learning/recovery map does not have to lie in the space (the hypothesis class), as opposed to the output of algorithms such as empirical risk minimizations.
Theorem 2.
The locally optimal recovery map is given in closed form for each by
(18) 
where the coefficient vectors
and are computed as(19)  
(20) 
Recalling from (8) that , the time cost of calculating the coefficient vectors and is .
Remark. When the goal is to learn/recover for some linear quantity of interest , the above recipe still produces the locally optimal map, which turns out to be . One advantage of this situation is that the full knowledge of (a basis for) the space is not needed, since only the values of the ’s and ’s are required to form .
3 Relation to Supervised Learning
Supervised learning algorithms take data as input (while also being aware of the ’s) and return functions as outputs, so they can be viewed as learning/recovery maps . We examine below how some of them compare to the map from Theorem 2.
Empirical risk minimizations.
The outputs returned by these algorithms belong to a hypothesis space chosen in advance from the belief that it provides good approximants for reallife functions. Since this implicit belief corresponds to the explicit assumption expressed by the model set (5), our Optimal Recovery algorithm and empirical risk minimization algorithms are directly comparable, in that they both depend on a common approximation space/hypothesis class
. With a loss function chosen as a
th power of an norm for , empirical risk minimization algorithms consist in solving the convex optimization program(21) 
In the case of the square loss, the solution actually reads
(22) 
where the matrix still represents the crossGramian introduced in (17).
Kernel regressions.
Kernel regression algorithms usually operate in the setting of Reproducing Kernel Hilbert Spaces (see next section), but they can be phrased for arbitrary Hilbert spaces, too. For instance, the traditional kernel ridge regression consists in solving the following convex optimization problem
(23) 
for some parameter . In the limit , one obtains kernel ridgeless regression, which consists in solving the convex optimization problem
(24) 
This algorithm fits the training data perfectly and also generalizes well Liang2018JustIK.
The crucial observation we wish to bring forward here is that kernel ridgeless regression, although not designed with this intention, is also an Optimal Recovery method. Indeed, (24) appears as the special case of the convex optimization program (15) with the choice . Using Theorem 2, we can retrieve in particular that kernel ridgeless regression is explicitly given by
(25) 
Incidentally, the latter can also be interpreted as the special case , since is a linear combination of the Riesz representers that satisfy the observation constraint . In fact, there are more choices for that leads to kernel ridgeless regression, as revealed below.
Theorem 3.
If the approximation space is for some subset of , then the locally optimal recovery map (15) reduces to kernel ridgeless regression independently of .
Spline models.
From an Optimal Recovery pointofview, the success of (24) can be surprising because it seems to use only data and no model assumption. In fact, the model assumption occurs in the objective function being minimized. Procedure (24) favors dataconsistent functions which are themselves small. If one preferred to favor dataconsistent functions which have small derivatives, one would instead consider, say, the program
(26) 
with optimization variable in the Sobolev space . As it turns out, this procedure coincides with the Optimal Recovery method that minimizes the worstcase error over the model set given by and its solution is known explicitly de1963best. With (where one tries to minimize the strain energy of a curve constrained to pass through a prescribed set of points), the solution is a cubic spline, see wahba1990spline for details. For multivariate functions, the solutions to problems akin to (26) are also known explicitly: they are thin plate splines duchon1977splines. More generally, minimum(semi)norm interpolation problems are what define the concept of abstract splines de1981convergence.
Remark. When observation error is present, exact interpolation conditions should not be enforced, so it is natural to subsitutute (15) by a regularized problem similar to (23) but with acting as a reguralizer instead of . This has already been proposed in li2007generalized under the name Generalized Regularized LeastSquares, of course with a different motivation than Optimal Recovery.
4 Optimal Recovery in Reproducing Kernel Hilbert Spaces (RKHS)
We consider in this section the case where is a Hilbert space of functions defined on a domain for which point evaluations are continuous linear functionals. In other words, we consider a reproducing kernel Hilbert space , where denotes the kernel characterized, for any , by
(27) 
In this way, the Riesz representers of points evaluations at ’s take the form . Thus, the Gramian of (16) has entries
(28) 
As for the crossGramian of (17), it has entries
(29) 
where represents a basis for the space . Some possible choices of and are discussed below.
Choosing the kernel.
A kernel that is widely used in many learning problems is the Gaussian kernel given, for some parameter , by
(30) 
The associated infinitedimensional Hilbert space, which is explicitly characterized in minh2010gaussiankernel, has orthonormal basis , where
(31) 
Choosing the approximation space.
Since a learning/recovery procedure uses both data and model (maybe implicitly), its performance depends on the interaction between the two. In Optimal Recovery, and subsequently in InformationBased Complexity traub2003information, it is often assumed that the model is fixed and that the user has the ability to choose evaluation points in a favorable way. From another angle, one can view the evaluation points as being fixed but the model could be chosen accordingly. For the applicability of Theorem 2, it is perfectly fine to select an approximation space depending on , so long as it does not depend on . Thus, one possible choice for the approximation space is for some subset . However, we have seen in Theorem 3
that such a choice invariably leads to kernel ridgeless regression. Another choice for the approximation space is inspired by linear regression, which uses the space
. We do not consider this space verbatim, because its elements (or any polynomial function, for that matter, see minh2010gaussiankernel) do not belong to the Reproducing kernel Hilbert space with Gaussian kernel. Instead, we modify it slightly by multiplying with a decreasing exponential and by allowing for degrees higher than one, so as to consider the space(32) 
which has dimension . We ignore the coefficients of in numerical experiments, which has no effects on the test error. These ’s are the socalled ‘Taylor features’ used in approximation of the Gaussian kernel cotter2011explicit.
5 Experimental Validation
5.1 Comparison of worstcase errors
We first compare worstcase errors for the optimal recovery map (OR) described in Theorem 2 and for empirical risk minimizations defined in (21). They are only considered with (ERM1) and (ERM2). The algorithms OR, ERM1, and ERM2 all operate with a specific space (as a hypothesis class), so direct comparisons can be made by selecting the same for all these algorithms. According to Theorem 1, when is a finitedimensional Hilbert space, the computation of their worstcase errors is performed by semidefinite programming. Here, we restrict ourselves to the case where is a dimensional subspace of , with and . The linear observations are randomly generated. Figure 1 confirms that OR yields the smallest worstcase errors and suggests that often, but not always, ERM2 yields smaller worstcase errors than ERM1. It also hints at a quasilinear dependence of the worstcase errors on the approximability parameter .
5.2 Test errors for nonIID data
In this subsection, we the implement optimal recovery map on two realworld regression datasets, namely Years Prediction and Energy Use, both available on UCI Machine Learning Repository.
We focus on the RKHS associated with Gaussian kernel throughout this experiment. The space is spanned by a subset of Taylor features of order , see (32), so that goes up to , where is the number of features in the datasets. To choose the optimal kernel width, we conduct a grid search. Furthermore, to make the data nonIID, we sort both datasets according to their th feature in a descending order and then select the top as the training set and the bottom as the test set. Recall by Theorem 2 that the optimal recovery map depends on the Hilbert space and a subspace . Therefore, it is natural to compare it to kernel ridgeless regression (25) (in ) and Taylor features regression (22) (in ).
The test error comparison is presented in Figure 2. Due to the size of Years Prediction dataset, we do not perform kernel ridgeless regression on the full dataset, so we randomly subsample a subset of the data and repeat the experiment for Monte Carlo simulations to average out the randomness. Therefore, error bars are presented in Figure 2(a) to show the statistical significance. We observe that Optimal Recovery shows promising performance on both datasets. On Years Prediction dataset, Optimal Recovery outperforms kernel ridgeless regression for all . On Energy Use dataset, it outperforms kernel ridgeless regression after . Also, Taylor features regression in the space is consistently inferior to the optimal recovery map. The Ushape Optimal Recovery curve in Figure 2(a) demonstrates the tradeoff between the compatibility indicator and the approximability parameter .
6 Conclusion
Generalization guarantees in Statistical Learning are based on the postulate of IID data, the pertinence of which is not guaranteed in all learning environments. In this work, we considered the regression problem (with nonIID data) in Hilbert spaces from an Optimal Recovery pointofview, where the learner aims at minimizing with the worstcase error. We first formulated a semidefinite program for calculating the worstcase error of any recovery map in finitedimensional Hilbert spaces. Then, we provided a closedform expression for optimal recovery map in the case where the hypothesis class is a linear subspace of any Hilbert space. The formula coincides with kernel ridgeless regression when in a RKHS. Our numerical experiments showed that, when , Optimal Recovery has the potential to outperform kernel ridgeless regression in the test mean squared error.
Our main focus was to provide an algorithmic perspective to Optimal Recovery, whose theory was initiated in the 70’s80’s. Our findings revealed interesting connections with current Machine Learning methods. There are many directions to consider in the future, including:

learning the hypothesis space from the data (instead of incorporating domain knowledge);

developing Optimal Recovery with noise/error in the observations;

studying the overparametrized regime where ;

investigating the case where the hypothesis class is not a linear space.
References
Supplementary Material
Proof of Theorem 1. We first justify the claim that there exists a unique such that . To see this, define the linear map . Since , the map is injective. Then, by the by ranknullity theorem, , so the map is also surjective. Thus, the claim is justified.
Next, the squared local worstcase error (9) at is
(33) 
Decomposing and as and with and , the condition reduces to , i.e., is uniquely determined. The condition then becomes , where . As for the expression to maximize, it separates into
(34) 
Up to the additive constant , the maximum in (33) is now
(35)  
Writing with , this latter constraint reads
(36)  
whenever 
By the Slemma, see e.g. polik2007survey, (36) is equivalent to the existence of such that
(37) 
for all , or in other words, to the existence of such that
(38) 
for all . This constraint can be reformulated as a semidefinite constraint
(39) 
Keeping in mind that , this is the semidefinite constraint appearing in (14). Putting everything together, we arrive at the expression for the local worstcase error announced in (13). ∎
Proof of Theorem 2. Let be the solution to (15). We shall first recall the argument explaining that is orthogonal to before showing that this orthogonality condition, together with the condition , characterizes uniquely as the element given in (18). For the orthogonality condition, consider any and any , and notice that the expression
(40) 
is minimized when . This forces , and since is orthogonal to , we have for all , as required.
In view of , it follows that
(41) 
Taking inner product with leads to . Then, expanding on , we obtain
(42) 
Taking inner product with leads to and in turn to after multiplying by . The latter yields the expression for given in (19), while the former yields the expression for given in (20). ∎
Proof of Theorem 3. Let for some and let be the output of kernel ridgeless regression. According to the previous proof, to prove that is the solution to (15), we have to verify that . Since we already know that (recall that kernel ridgeless regression is (15) with in place of ), it remains to check that . This simply follows from . ∎
Comments
There are no comments yet.