Learning from Non-IID Data in Hilbert Spaces: An Optimal Recovery Perspective

06/05/2020 ∙ by Simon Foucart, et al. ∙ 0

The notion of generalization in classical Statistical Learning is often attached to the postulate that data points are independent and identically distributed (IID) random variables. While relevant in many applications, this postulate may not hold in general, encouraging the development of learning frameworks that are robust to non-IID data. In this work, we consider the regression problem from an Optimal Recovery perspective. Relying on a model assumption comparable to choosing a hypothesis class, a learner aims at minimizing the worst-case (prediction) error, without recourse to IID assumption on data. We first develop a semidefinite program for calculating the worst-case error of any recovery map in finite-dimensional Hilbert spaces. Then, for any Hilbert space, we show that Optimal Recovery provides a formula which is user-friendly from an algorithmic point-of-view, as long as the hypothesis class is linear. Interestingly, this formula coincides with kernel ridgeless regression in some cases, proving that minimizing the average error and worst-case error can yield the same solution. We provide numerical experiments in support of our theoretical findings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let us place ourselves in a classical scenario where data about an unknown function take the form

(1)

The values and the evaluations points are available to the learner. The goal is to ‘learn’ the function from the data (1) by producing a surrogate function for

. Supervised Machine Learning methods compute such an

from a hypothesis class selected in advance. The performance of a method then depends on the choice of this hypothesis class: a good class should obviously approximate functions of interest well. This translates into a small approximation error, which is one of the constituents towards the total error of a method. Another constituent is the estimation error. In classical Statistical Learning vapnik1999overview, the latter is often analyzed by adopting a postulate that the ’s are independent realizations of a random variable with an unknown distribution on . While relevant in many applications, this postulate may not hold in general, encouraging the development of learning frameworks that are robust to non-IID data.

In this work, we consider the regression problem from an Optimal Recovery perspective, without recourse to IID assumption on data. Indeed, in the absence of randomness, an average-case analysis is not possible anymore. Instead, the learner aims at minimizing the worst-case (prediction) error by relying on a model assumption comparable to choosing a hypothesis class. We restrict our attention here to Hilbert spaces and provide the following contributions:

  • We develop a numerical framework for calculating the worst-case error in the case of finite-dimensional Hilbert spaces. In particular, we show that this error can be computed via a semidefinite program (Theorem 1).

  • We show that Optimal Recovery provides a formula which is user-friendly from an algorithmic point-of-view when the hypothesis class is a linear subspace (Theorem 2). Interestingly, this formula coincides with kernel ridgeless regression in some cases (Theorem 3), proving that minimizing the average error and worst-case error can yield the same solution.

The theoretical findings, whose proofs are included in the supplementary material, are verified through some numerical experiments presented in Section 5.

Why Optimal Recovery?

The theory of Optimal Recovery was developed in the 70’s-80’s as a subfield of Approximation Theory (see the surveys MicRiv; MicRiv2). Its development was shaped by concurrent developments in the theory of spline functions (see e.g. de1963best; duchon1977splines). Splines provided a rare example where the theory integrated computations de1977computational. But, at that time, algorithmic issues were not the high priority that they have become today and theoretical questions such as the existence of linear optimal algorithms prevailed (see e.g. the survey packel1988linear). Arguably, this neglect hindered the development of the topic and this work can be seen as an attempt to promote an algorithmic framework that sheds light on similarities and differences between Optimal Recovery (in Hilbert spaces) and Statistical Learning. Incidentally, what is sometimes called the spline algorithm

in Optimal Recovery has recently made a reappearance in Machine Learning circles as minimum-norm interpolation

belkin2018understand; rakhlin2019consistency; Liang2018JustIK, of course with a different motivation. We also remark that Optimal Recovery is not the only framework dealing with non-IID data. There are indeed other strands of Machine Learning literature (e.g. Online Learning hazan2016introduction and Federated Learning zhao2018federated) that investigate learning from non-IID data.

Noisy observations.

A careful reader may wonder about the possibility of incorporating an error in the data

, which is a common consideration in Machine Learning. We do not investigate such a scenario in this work, as our main focus is on drawing interesting connections between Optimal Recovery and some of the common Supervised Learning techniques in the simplest of settings first. Future works will concentrate on this inaccurate scenario, which is already well-defined and for which some results exists, see

plaskota1996noisy; ettehad2020instances.

2 The Optimal Recovery Perspective

In this section, we present the general framework of optimal recovery and provide some results, including the computation of worst-case error and the explicit formula of optimal recovery map.

The function space.

Echoing the theory of Optimal Recovery, we consider the function more asbtractly as an element from a normed space . The output data , which are evaluations of at the points ’s, can be generalized to linear functionals ’s applied to , so that the data take the form

(2)

For convenience, we summarize these data as

(3)

where the linear map is called the observation operator. Relevant situations include the case where is the space of continuous functions on , which is equipped with the uniform norm, and the case where is a Hilbert space , which is equipped with the norm derived from its inner product. It is the latter case that is the focus of this work. More precisely, after recalling some known results, we concentrate on a reproducing kernel Hilbert space of functions defined on , so that the point evaluations at the ’s are indeed well-defined and continuous linear functionals on .

The model set.

Without further information, data by themselves are not sufficient to say anything meaningful about . For example, one could think of all ways to fit a univariate function through points if no restriction is imposed. Thus, a model assumption for the functions of interest is needed. This assumption takes the form

(4)

where the model set translates an educated belief about the behavior of realistic functions . In Optimal Recovery, the set is often chosen to be a convex and symmetric subset of . Here, our relevant modeling assumption is the one that occurs implicitly in Machine Learning, namely that the functions of interest are well-approximated by suitable hypothesis classes. In this work, we only consider hypothesis classes that are linear subspaces of . Thus, given an approximation parameter (the targeted approximation error), our model set has the form

(5)

where . In the case of a Hilbert space, this model set reads

(6)

where is the orthogonal projection of onto the subspace . Such an approximability set was put forward by binev2017data, who were motivated by parametric PDEs. When working with this model, it is implicitly assumed that

(7)

otherwise the existence of a nonzero would imply that each , , is both data-consistent () and model-consistent (), leading to infinite worst-case error by letting . By a dimension argument, the assumption (7) forces

(8)

i.e., we must place ourselves in an underparametrized regime where there are less model parameters than datapoints. To make sense of the overparametrized regime, the model set (5) would need to be refined by adding some boundedness conditions, see foucart2020instances for results in this direction.

Worst-case errors.

We now need to assess the performance of a learning/recovery map, which is just a map taking data as input and returning an element as output. Given a model set , the local worst-case error of such a map at is

(9)

The global worst-case error is the worst local worst-case error over all that can be obtained by observing some , i.e.,

(10)

A learning/recovery map is called locally, respectively globally, optimal if it minimizes the local, respectively global, worst-case error. These definitions can be extended to handle not only the full recovery of but also the recovery of a quantity of interest . That is, for a map from into another normed space , one would define e.g. the global worst-case error of the learning/recovery map as

(11)

Such a framework is pertinent even if we target the full recovery of but with performance evaluated in a norm different from the native norm , as we can consider to be the identity map from equipped with into equipped with .

Perhaps counterintuitively, dealing with the global setting is somewhat easier than dealing with the local setting, in the sense that globally optimal maps have been obtained in situations where locally optimal maps have not, e.g. when . Accordingly, it is the local setting which is the focus of this work.

Computation of local worst-case errors.

When is a Hilbert space and the approximability model (6) is selected, determining the local worst-case error of a given map at some involves solving

(12)

This is a nonconvex optimization program, and as such does appear hard to solve at first sight. However, it is a quadratically constrained quadratic program, hence it is possible to solve it exactly. Although Gurobi gurobi now features direct capabilities to solve quadratically constrained quadratic programs, we take the route of recasting (12) as a semidefinite program using the S-lemma polik2007survey. The solution of the recast program can then be obtained using an off-the-shelf semidefinite solver, at least when is a Hilbert space of finite dimension, say . Precisely, with denoting an orthonormal basis for chosen so that is an orthonormal basis for and with denoting the unitary map , local worst-case errors can be computed based on the following observation.

Theorem 1.

The local worst-case error of a learning/recovery map at under the model set (5) can be expressed, with , as

(13)

where is the unique element from satisfying and is the minimal value of the following program, in which :

(14)

Optimal learning/recovery map.

Even though it is possible to compute the minimal worst-case error via (13)-(14), optimizing over to produce the locally optimal recovery map would still require some work and would in fact be a major overkill. Indeed, for our situation of interest, some crucial work in this direction has been carried out in binev2017data, and we rely on it to derive the announced user-friendly formula for the optimal recovery map . Precisely, when is a (finite- or infinite-dimensional) Hilbert space and the model set is given by (6), binev2017data showed that, for any input , the output is the solution to the convex minimization program

(15)

Their argument, based on the original expression (9) of the worst-case error, exploits the fact that is orthogonal not only to but also to . Let us point out that is both data-consistent and model-consistent when for some . It is also interesting to note that the optimal recovery map does not depend on the approximation parameter . This peculiarity disappears as soon as observation errors are taken into consideration, see ettehad2020instances.

A computable expression for the minimal local error (9), and in turn for the minimal global error (10), has also been given in binev2017data. Without going into details, we only want to mention that latter decouples as the product of an indicator of compatibility between model and data points, which increases as the space is enlarged, and of the parameter of approximability, which decreases as the space is enlarged. Thus, the choice of a space yielding small minimal worst-case errors involves a trade-off on . This trade-off is illustrated numerically in Subsection 5.2.

Although the description (15) of the optimal learning/recovery map is quite informative, it fails to make apparent the fact the map is actually a linear map. This fact can be seen from the theorem below, which states that solving a minimization program for each is not needed to produce . Indeed, one can obtain by some linear algebra computations involving two matrices which are more or less directly available to the learner. To define these matrices, we need the Riesz representers of the linear functionals , which are characterized by

We also need a (not necessarily orthonormal) basis for . The two matrices in question are the Gramian of and the cross-Gramian of and . Their entries are given, for and , by

(16)
(17)

The matrix is positive definite and in particular invertible (linear independence of the ’s is assumed). The matrix has full rank thanks to the assumption . The result below (proved in the supplementary material) shows that the output of the optimal learning/recovery map does not have to lie in the space (the hypothesis class), as opposed to the output of algorithms such as empirical risk minimizations.

Theorem 2.

The locally optimal recovery map is given in closed form for each by

(18)

where the coefficient vectors

and are computed as

(19)
(20)

Recalling from (8) that , the time cost of calculating the coefficient vectors and is .

Remark. When the goal is to learn/recover for some linear quantity of interest , the above recipe still produces the locally optimal map, which turns out to be . One advantage of this situation is that the full knowledge of (a basis for) the space is not needed, since only the values of the ’s and ’s are required to form .

3 Relation to Supervised Learning

Supervised learning algorithms take data as input (while also being aware of the ’s) and return functions as outputs, so they can be viewed as learning/recovery maps . We examine below how some of them compare to the map from Theorem 2.

Empirical risk minimizations.

The outputs returned by these algorithms belong to a hypothesis space chosen in advance from the belief that it provides good approximants for real-life functions. Since this implicit belief corresponds to the explicit assumption expressed by the model set (5), our Optimal Recovery algorithm and empirical risk minimization algorithms are directly comparable, in that they both depend on a common approximation space/hypothesis class

. With a loss function chosen as a

th power of an -norm for , empirical risk minimization algorithms consist in solving the convex optimization program

(21)

In the case of the square loss, the solution actually reads

(22)

where the matrix still represents the cross-Gramian introduced in (17).

Kernel regressions.

Kernel regression algorithms usually operate in the setting of Reproducing Kernel Hilbert Spaces (see next section), but they can be phrased for arbitrary Hilbert spaces, too. For instance, the traditional kernel ridge regression consists in solving the following convex optimization problem

(23)

for some parameter . In the limit , one obtains kernel ridgeless regression, which consists in solving the convex optimization problem

(24)

This algorithm fits the training data perfectly and also generalizes well Liang2018JustIK.

The crucial observation we wish to bring forward here is that kernel ridgeless regression, although not designed with this intention, is also an Optimal Recovery method. Indeed, (24) appears as the special case of the convex optimization program (15) with the choice . Using Theorem 2, we can retrieve in particular that kernel ridgeless regression is explicitly given by

(25)

Incidentally, the latter can also be interpreted as the special case , since is a linear combination of the Riesz representers that satisfy the observation constraint . In fact, there are more choices for that leads to kernel ridgeless regression, as revealed below.

Theorem 3.

If the approximation space is for some subset of , then the locally optimal recovery map (15) reduces to kernel ridgeless regression independently of .

Spline models.

From an Optimal Recovery point-of-view, the success of (24) can be surprising because it seems to use only data and no model assumption. In fact, the model assumption occurs in the objective function being minimized. Procedure (24) favors data-consistent functions which are themselves small. If one preferred to favor data-consistent functions which have small derivatives, one would instead consider, say, the program

(26)

with optimization variable in the Sobolev space . As it turns out, this procedure coincides with the Optimal Recovery method that minimizes the worst-case error over the model set given by and its solution is known explicitly de1963best. With (where one tries to minimize the strain energy of a curve constrained to pass through a prescribed set of points), the solution is a cubic spline, see wahba1990spline for details. For multivariate functions, the solutions to problems akin to (26) are also known explicitly: they are thin plate splines duchon1977splines. More generally, minimum-(semi)norm interpolation problems are what define the concept of abstract splines de1981convergence.

Remark. When observation error is present, exact interpolation conditions should not be enforced, so it is natural to subsitutute (15) by a regularized problem similar to (23) but with acting as a reguralizer instead of . This has already been proposed in li2007generalized under the name Generalized Regularized Least-Squares, of course with a different motivation than Optimal Recovery.

4 Optimal Recovery in Reproducing Kernel Hilbert Spaces (RKHS)

We consider in this section the case where is a Hilbert space of functions defined on a domain for which point evaluations are continuous linear functionals. In other words, we consider a reproducing kernel Hilbert space , where denotes the kernel characterized, for any , by

(27)

In this way, the Riesz representers of points evaluations at ’s take the form . Thus, the Gramian of (16) has entries

(28)

As for the cross-Gramian of (17), it has entries

(29)

where represents a basis for the space . Some possible choices of and are discussed below.

Choosing the kernel.

A kernel that is widely used in many learning problems is the Gaussian kernel given, for some parameter , by

(30)

The associated infinite-dimensional Hilbert space, which is explicitly characterized in minh2010gaussiankernel, has orthonormal basis , where

(31)

Choosing the approximation space.

Since a learning/recovery procedure uses both data and model (maybe implicitly), its performance depends on the interaction between the two. In Optimal Recovery, and subsequently in Information-Based Complexity traub2003information, it is often assumed that the model is fixed and that the user has the ability to choose evaluation points in a favorable way. From another angle, one can view the evaluation points as being fixed but the model could be chosen accordingly. For the applicability of Theorem 2, it is perfectly fine to select an approximation space depending on , so long as it does not depend on . Thus, one possible choice for the approximation space is for some subset . However, we have seen in Theorem 3

that such a choice invariably leads to kernel ridgeless regression. Another choice for the approximation space is inspired by linear regression, which uses the space

. We do not consider this space verbatim, because its elements (or any polynomial function, for that matter, see minh2010gaussiankernel) do not belong to the Reproducing kernel Hilbert space with Gaussian kernel. Instead, we modify it slightly by multiplying with a decreasing exponential and by allowing for degrees higher than one, so as to consider the space

(32)

which has dimension . We ignore the coefficients of in numerical experiments, which has no effects on the test error. These ’s are the so-called ‘Taylor features’ used in approximation of the Gaussian kernel cotter2011explicit.

5 Experimental Validation

5.1 Comparison of worst-case errors

We first compare worst-case errors for the optimal recovery map (OR) described in Theorem 2 and for empirical risk minimizations defined in (21). They are only considered with (ERM1) and (ERM2). The algorithms OR, ERM1, and ERM2 all operate with a specific space (as a hypothesis class), so direct comparisons can be made by selecting the same for all these algorithms. According to Theorem 1, when is a finite-dimensional Hilbert space, the computation of their worst-case errors is performed by semidefinite programming. Here, we restrict ourselves to the case where is a -dimensional subspace of , with and . The linear observations are randomly generated. Figure 1 confirms that OR yields the smallest worst-case errors and suggests that often, but not always, ERM2 yields smaller worst-case errors than ERM1. It also hints at a quasi-linear dependence of the worst-case errors on the approximability parameter .

(a) Worst-case errors vs.
(b) Zoomed-in version of the left plot
Figure 1: Optimal Recovery and Empirical Risk Minimization map with and .

5.2 Test errors for non-IID data

In this subsection, we the implement optimal recovery map on two real-world regression datasets, namely Years Prediction and Energy Use, both available on UCI Machine Learning Repository.

We focus on the RKHS associated with Gaussian kernel throughout this experiment. The space is spanned by a subset of Taylor features of order , see (32), so that goes up to , where is the number of features in the datasets. To choose the optimal kernel width, we conduct a grid search. Furthermore, to make the data non-IID, we sort both datasets according to their -th feature in a descending order and then select the top as the training set and the bottom as the test set. Recall by Theorem 2 that the optimal recovery map depends on the Hilbert space and a subspace . Therefore, it is natural to compare it to kernel ridgeless regression (25) (in ) and Taylor features regression (22) (in ).

The test error comparison is presented in Figure 2. Due to the size of Years Prediction dataset, we do not perform kernel ridgeless regression on the full dataset, so we randomly subsample a subset of the data and repeat the experiment for Monte Carlo simulations to average out the randomness. Therefore, error bars are presented in Figure 2(a) to show the statistical significance. We observe that Optimal Recovery shows promising performance on both datasets. On Years Prediction dataset, Optimal Recovery outperforms kernel ridgeless regression for all . On Energy Use dataset, it outperforms kernel ridgeless regression after . Also, Taylor features regression in the space is consistently inferior to the optimal recovery map. The U-shape Optimal Recovery curve in Figure 2(a) demonstrates the trade-off between the compatibility indicator and the approximability parameter .

(a) Test error comparison on Years Prediction
(b) Test error comparison on Energy Use
Figure 2: Optimal Recovery and two benchmark regression algorithms on two non-IID datasets.

6 Conclusion

Generalization guarantees in Statistical Learning are based on the postulate of IID data, the pertinence of which is not guaranteed in all learning environments. In this work, we considered the regression problem (with non-IID data) in Hilbert spaces from an Optimal Recovery point-of-view, where the learner aims at minimizing with the worst-case error. We first formulated a semidefinite program for calculating the worst-case error of any recovery map in finite-dimensional Hilbert spaces. Then, we provided a closed-form expression for optimal recovery map in the case where the hypothesis class is a linear subspace of any Hilbert space. The formula coincides with kernel ridgeless regression when in a RKHS. Our numerical experiments showed that, when , Optimal Recovery has the potential to outperform kernel ridgeless regression in the test mean squared error.

Our main focus was to provide an algorithmic perspective to Optimal Recovery, whose theory was initiated in the 70’s-80’s. Our findings revealed interesting connections with current Machine Learning methods. There are many directions to consider in the future, including:

  1. learning the hypothesis space from the data (instead of incorporating domain knowledge);

  2. developing Optimal Recovery with noise/error in the observations;

  3. studying the overparametrized regime where ;

  4. investigating the case where the hypothesis class is not a linear space.

References

Supplementary Material

Proof of Theorem 1. We first justify the claim that there exists a unique such that . To see this, define the linear map . Since , the map is injective. Then, by the by rank-nullity theorem, , so the map is also surjective. Thus, the claim is justified.

Next, the squared local worst-case error (9) at is

(33)

Decomposing and as and with and , the condition reduces to , i.e., is uniquely determined. The condition then becomes , where . As for the expression to maximize, it separates into

(34)

Up to the additive constant , the maximum in (33) is now

(35)

Writing with , this latter constraint reads

(36)
whenever

By the S-lemma, see e.g. polik2007survey, (36) is equivalent to the existence of such that

(37)

for all , or in other words, to the existence of such that

(38)

for all . This constraint can be reformulated as a semidefinite constraint

(39)

Keeping in mind that , this is the semidefinite constraint appearing in (14). Putting everything together, we arrive at the expression for the local worst-case error announced in (13). ∎

Proof of Theorem 2. Let be the solution to (15). We shall first recall the argument explaining that is orthogonal to before showing that this orthogonality condition, together with the condition , characterizes uniquely as the element given in (18). For the orthogonality condition, consider any and any , and notice that the expression

(40)

is minimized when . This forces , and since is orthogonal to , we have for all , as required.

In view of , it follows that

(41)

Taking inner product with leads to . Then, expanding on , we obtain

(42)

Taking inner product with leads to and in turn to after multiplying by . The latter yields the expression for given in (19), while the former yields the expression for given in (20). ∎

Proof of Theorem 3. Let for some and let be the output of kernel ridgeless regression. According to the previous proof, to prove that is the solution to (15), we have to verify that . Since we already know that (recall that kernel ridgeless regression is (15) with in place of ), it remains to check that . This simply follows from . ∎