Generalization Bounds for Neural Networks via Approximate Description Length

by   Amit Daniely, et al.

We investigate the sample complexity of networks with bounds on the magnitude of its weights. In particular, we consider the class H={W_t∘ρ∘...∘ρ∘ W_1 :W_1,...,W_t-1∈ M_d, d, W_t∈ M_1,d} where the spectral norm of each W_i is bounded by O(1), the Frobenius norm is bounded by R, and ρ is the sigmoid function e^x/1+e^x or the smoothened ReLU function ln (1+e^x). We show that for any depth t, if the inputs are in [-1,1]^d, the sample complexity of H is Õ(dR^2/ϵ^2). This bound is optimal up to log-factors, and substantially improves over the previous state of the art of Õ(d^2R^2/ϵ^2). We furthermore show that this bound remains valid if instead of considering the magnitude of the W_i's, we consider the magnitude of W_i - W_i^0, where W_i^0 are some reference matrices, with spectral norm of O(1). By taking the W_i^0 to be the matrices at the onset of the training process, we get sample complexity bounds that are sub-linear in the number of parameters, in many typical regimes of parameters. To establish our results we develop a new technique to analyze the sample complexity of families H of predictors. We start by defining a new notion of a randomized approximate description of functions f:X→R^d. We then show that if there is a way to approximately describe functions in a class H using d bits, then d/ϵ^2 examples suffices to guarantee uniform convergence. Namely, that the empirical loss of all the functions in the class is ϵ-close to the true loss. Finally, we develop a set of tools for calculating the approximate description length of classes of functions that can be presented as a composition of linear function classes and non-linear functions.


page 1

page 2

page 3

page 4


On the Sample Complexity of Two-Layer Networks: Lipschitz vs. Element-Wise Lipschitz Activation

We investigate the sample complexity of bounded two-layer neural network...

Initialization-Dependent Sample Complexity of Linear Predictors and Neural Networks

We provide several new results on the sample complexity of vector-valued...

On Size-Independent Sample Complexity of ReLU Networks

We study the sample complexity of learning ReLU neural networks from the...

The Sample Complexity of One-Hidden-Layer Neural Networks

We study norm-based uniform convergence bounds for neural networks, aimi...

Self-Regularity of Non-Negative Output Weights for Overparameterized Two-Layer Neural Networks

We consider the problem of finding a two-layer neural network with sigmo...

Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers

A common lens to theoretically study neural net architectures is to anal...

Sample Complexity Result for Multi-category Classifiers of Bounded Variation

We control the probability of the uniform deviation between empirical an...

Please sign up or login with your details

Forgot password? Click here to reset