Universality of empirical risk minimization
Consider supervised learning from i.i.d. samples { x_i,y_i}_i≤ n where x_i ∈ℝ^p are feature vectors and y∈ℝ are labels. We study empirical risk minimization over a class of functions that are parameterized by 𝗄 = O(1) vectors θ_1, . . . , θ_𝗄∈ℝ^p , and prove universality results both for the training and test error. Namely, under the proportional asymptotics n,p→∞, with n/p = Θ(1), we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed -to leading order- under a simpler model in which the feature vectors x_i are replaced by Gaussian vectors g_i with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors x_i with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors x_i that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).
READ FULL TEXT