Explain Interpolation Learning?

University of British Columbia (UBC) / Alberta Machine Intelligence Institute (Amii)

based on
arXiv:2006.05942
and 2106.09276,
with:

NYU Center for Data Science – November 10, 2021

(Swipe or arrow keys to move through slides; for a menu to jump; to show more.

Vertical slides are backup slides.)

The HTML version
is the “official” version,
though this PDF is basically the same.

- Given i.i.d. samples
- features/covariates , labels/targets

- Want such that for new samples from :
- e.g. squared loss:

- Standard approaches based on empirical risk minimization:

We have lots of bounds like: with probability ,

could be from VC dimension, covering number, RKHS norm, Rademacher complexity, fat-shattering dimension, …

Then for large , , so

Classical wisdom: “a model with zero training error is overfit to the training data and will typically generalize poorly”

(when )

(when )

We'll call a model with an *interpolating* predictor

Added label noise on MNIST (%)

Belkin/Ma/Mandal, ICML 2018Added label noise on MNIST (%)

Lots of recent theoretical work on interpolation.

[Belkin+ NeurIPS 2018],
[Belkin+ AISTATS 2018],
[Belkin+ 2019],
[Hastie+ 2019],

[Muthukumar+ JSAIT 2020],
[Bartlett+ PNAS 2020],
[Liang+ COLT 2020],
[Montanari+ 2019],
many more…

None* bound .

Is it possible to find such a bound?

Can uniform convergence explain interpolation learning?

*One exception-ish
[Negrea/Dziugaite/Roy, ICML 2020]:

relates to a surrogate predictor,

shows uniform convergence for the surrogate.

(Also, a few things since our first paper.)

Today, we're mainly going to worry about *consistency*:

…in a *noisy* setting:

…for Gaussian linear regression:

Is it possible to show consistency of an interpolator with

This requires tight constants!

“signal”, | “junk”, | |

controls scale of junk:

Linear regression:

Min-norm interpolator:

As , approaches ridge regression on the signal

is consistent when fixed, ,

Could we have shown that with uniform convergence?

__Theorem:__
In junk features with ,

__Theorem:__
In junk features with ,

Proof idea:

In junk features,
for each ,
let ,

a *natural* consistent interpolator,

and . Then, almost surely,

([Negrea/Dziugaite/Roy, ICML 2020] had a very similar result for )

*Natural* interpolators: doesn't change if flips to . Examples:

,
,
,

with each convex,

Proof shows that for most ,

there's a typical predictor (in )

that's good on most inputs (),

but very bad on *specifically* ()

- Convergence of surrogates [Negrea/Dziugaite/Roy, ICML 2020]?
- Nice, but not really the same thing…

- Only do analyses based on e.g. exact form of ?
- We'd like to keep good things about uniform convergence:
- Apply to more than just one specific predictor
- Tell us more about “why” things generalize
- Easier to apply without a nice closed form

- Or…

We don't really care about small , big ….

Could we bound instead of ?

- Existing uniform convergence proofs are “really” about [Nagarajan/Kolter, NeurIPS 2019]
- Strongly expect still for norm balls in our testbed
- instead of

- Not possible to show is big for
*all*- If consistent and , use

So far, used

But *we only care about interpolators*. How about

Is this “uniform convergence”?

It's the standard notion for noiseless () analyses…

Used at least since [Vapnik 1982] and [Valiant 1984]

From [Devroye/Györfi/Lugosi 1996]:

What does look like?

is the plane

Intersection of -ball with -hyperplane:

-ball centered at

Applying [Srebro/Sridharan/Tewari 2010]: for all , : high-prob bound on

if , ,

If this holds with
(and maybe ),

would explain consistency on junk features,

and predict that gives

__Theorem:__ If ,

- Confirms speculation based on assumption
- Shows consistency with uniform convergence (of interpolators)
- New result for error of not-quite-minimal-norm interpolators
- Norm is asympotically consistent
- Norm is at worst

What does look like?

is the plane

Intersection of -ball with -hyperplane:

-ball centered at

Can write as

where is *any* interpolator,
is basis for

Can change variables in to

Quadratic program, one quadratic constraint: strong duality

Exactly equivalent to problem in *one* scalar variable:

Can analyze this for different choices of …

In Gaussian least squares generally, have that so is consistent iff .

Very useful for lower bounds! [Muthukumar+ JSAIT 2020]

Roughly, “how much” of is “missed” by

Analyzing dual with ,

get without**any** distributional assumptions that

get without

(amount of missed energy) (available norm)

If consistent, everything smaller-norm also consistent iff term

In our setting:

is consistent,

Plugging in:

Analyzing dual with for , , get in general: