Universality of empirical risk minimization

02/17/2022
by   Andrea Montanari, et al.
0

Consider supervised learning from i.i.d. samples { x_i,y_i}_i≤ n where x_i ∈ℝ^p are feature vectors and y∈ℝ are labels. We study empirical risk minimization over a class of functions that are parameterized by 𝗄 = O(1) vectors θ_1, . . . , θ_𝗄∈ℝ^p , and prove universality results both for the training and test error. Namely, under the proportional asymptotics n,p→∞, with n/p = Θ(1), we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed -to leading order- under a simpler model in which the feature vectors x_i are replaced by Gaussian vectors g_i with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors x_i with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors x_i that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/21/2019

Limitations of Lazy Training of Two-layers Neural Networks

We study the supervised learning problem under either of the following t...
research
02/19/2021

A theory of capacity and sparse neural encoding

Motivated by biological considerations, we study sparse neural maps from...
research
04/10/2017

On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks

Empirical risk minimization (ERM) is ubiquitous in machine learning and ...
research
06/02/2014

On Classification with Bags, Groups and Sets

Many classification problems can be difficult to formulate directly in t...
research
07/21/2023

What can a Single Attention Layer Learn? A Study Through the Random Features Lens

Attention layers – which map a sequence of inputs to a sequence of outpu...
research
10/24/2022

Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup

Mixup is a data augmentation technique that relies on training using ran...
research
04/13/2017

Infinite Sparse Structured Factor Analysis

Matrix factorisation methods decompose multivariate observations as line...

Please sign up or login with your details

Forgot password? Click here to reset