1 Introduction
One of the most important theoretical challenges in machine learning comes from the fact that classical learning theory cannot explain the effectiveness of over-parametrized models in which the number of parameters is much larger than the size of the training set. This is especially the case for neural network models, which have achieved remarkable performance for a wide variety of problems
[2, 16, 26]. Understanding the mechanism behind these successes requires developing new analytical tools that can work effectively in the over-parametrized regime [30].Our work is partly motivated by the situation in classical approximation theory and finite element analysis [8]. There are two kinds of error bounds in finite element analysis depending on whether the target solution (the ground truth) or the numerical solution enters into the bounds. Let and be the true solution and the “numerical solution”, respectively. In “a priori” error estimates, only norms of the true solution enter into the bounds, namely
In “a posteriori” error estimates, the norms of the numerical solution enter into the bounds:
Here denote various norms.
In this language, most recent theoretical efforts [24, 4, 10, 21, 22, 23] on estimating the generalization error of neural networks should be viewed as “a posteriori” analysis, since the bounds depend on various norms of the solutions. Unfortunately, as observed in [1] and [23], the numerical values of these norms are usually quite large for real situations, yielding vacuous bounds.
In this paper we pursue a different line of attack by providing “a priori” analysis. For this purpose, a suitably regularized two-layer network is considered. It is proved that the generalization error of the regularized solutions is asymptotically sharp with constants depending only on the properties of the target function. Numerical experiments show that these a priori bounds are non-vacuous [9] for datasets of practical interests, such as MNIST and CIFAR-10. In addition, our experimental results also suggest that such regularization terms are necessary in order for the model to be “well-posed” (see Section 6 for the precise meaning).
1.1 Setup
We will focus on the regression problem. Let be the target function, with , and be the training set. Here are i.i.d samples drawn from an underlying distribution with , and with being the noise. Our aim is to recover by fitting
using a two-layer fully connected neural network with ReLU (rectified linear units) activation:
(1) |
where is the ReLU function, , and represents all the parameters to be learned from the training data. denotes the network width. To control the complexity of networks, we use the following scale-invariant norm.
Definition 2 (Spectral norm).
Given , denote by an extension of to . Let
be the Fourier transform of
, then We define the spectral norm of by(2) |
We also define .
Assumption 1.
As a consequence of the assumption that , we can truncate the network by . By an abuse of notation, in the following we still use to denote . The ultimate goal is to minimize the generalization error (expected risk)
In practice, we only have at our disposal the empirical risk
The generalization gap is defined as the difference between expected and empirical risk. Here the loss function is
, unless it is specified otherwise.1.2 Our Results
We propose a regularized estimator (defined in Section 3) and prove a priori estimates for its generalization error shown in Table 1. As a comparison, we also list the result of [14] which analyzed a similar problem. It is worth mentioning that they require the network width to be the orders of whereas we allow arbitrary network width. See Theorem 8 and 9 for more details about the results.
2 Preliminary Results
In this section, we summarize some results on the approximation error and generalization bound for two-layer ReLU networks. These results are required by our subsequent a priori analysis.
2.1 Approximation Properties
Proposition 1.
For any , one has the integral representation:
where
is the normalization constant such that , which satisfies .
Proof.
By an abuse of notation, let be its own extension in . Since , can be written as
(4) |
Note that the identity
holds when . Choosing , , we have
Let , , and , we have
(5) |
Let , inserting (5) into (4) yields
where
Consider a density on defined by
(6) |
where the normalized constant is given by
Since belongs to , so we have
(7) |
therefore the density is well-defined. To simplify the notations, we let
Then we have
Since , we obtain
∎
For simplicity, in the rest of this paper, we assume . We take samples with randomly drawn from , and consider the empirical average , which is exactly a two-layer ReLU network of width
. The central limit theorem (CLT) tells us that the approximation error is roughly
So as long as we can bound the variance at the right-hand side, we will have an estimate of the approximation error. The following result formalizes this intuition.
Theorem 2.
For any distribution with and any , there exists a two-layer network of width such that
Furthermore , i.e. the path norm of can be bounded by the spectral norm of the target function.
Proof.
Let be the Monte-Carlo estimator, we have
Furthermore, for any fixed , the variance can be upper bounded since
Hence we have
Therefore there must exist a set of , such that the corresponding empirical average satisfies
Due to the special structure of the Monte-Carlo estimator, we have . It follows Equation (7) that . ∎
2.2 Estimating the Generalization Gap
Definition 3 (Rademacher complexity).
Let be a hypothesis space, i.e. a set of functions. The Rademacher complexity of with respect to samples is defined as
where
are independent random variables with
.The generalization gap can be estimated via the Rademacher complexity by the following theorem (see [5, 25] ).
Theorem 3.
Fix a hypothesis space . Assume that for any and , . Then for any , with probability at least
Before to provide the upper bound for the Rademacher complexity of two-layer networks, we first need the following two lemmas.
Lemma 1 (Lemma 26.11 of [25]).
Let be vectors in . Then the Rademacher complexity of has the following upper bound,
The above lemma characterizes the Rademacher complexity of a linear predictor with norm bounded by
. To handle the influence of nonlinear activation function, we need the following contraction lemma.
Lemma 2 (Lemma 26.9 of [25]).
Let be a Lipschitz function, i.e. for all we have . For any , let , then we have
We are now ready to characterize the Rademacher complexity of two-layer networks. Specifically, The path norm is used to control the complexity.
Lemma 3.
Let be the set of two-layer networks with path norm bounded by , then we have
Proof.
Proposition 4.
Assume the loss function is Lipschitz continuous and bounded by , then with probability at least we have,
(8) |
Proof.
Theorem 5 (A posterior generalization bound).
Assume the loss function is Lipschitz continuous and bounded by . Then for any , with probability at least over the choice of the training set , we have, for any two-layer network ,
(9) |
where .
We can see that the generalization gap is roughly bounded by up to some logarithmic terms.
Proof.
Consider the decomposition , where . Let where . According to Theorem 4, if we fixed in advance, then with probability at least over the choice of ,
So the probability that there exists at least one such that (2.2) fails is at most . In other words, with probability at least , the inequality (2.2) holds for all .
3 A Priori Estimates
For simplicity we first consider the noiseless case, i.e. . In the next section, we deal with the noise.
We see that the path norm of the special solution which achieves the optimal approximation error is independent of the network size, and this norm can also be used to bound the generalization gap (Theorem 5). Therefore, if the path norm is suitably penalized during training, we should be able to control the generalization gap without harming the approximation accuracy. One possible implementation of this idea is through the structural empirical risk minimization [29, 25].
Definition 4 (Path-norm regularized estimator).
Define the regularized risk by
(10) |
where is a positive constant. The path-norm regularized estimator is defined as
(11) |
It is worth noting that the minimizer is not necessarily unique, and should be understood as any of the minimizers. Without loss of generality, we also assume . In the following, we provide detailed analysis for the generalization error of the regularized estimator.
Since the path norm of the network associated with is bounded, we have the following estimate of its regularized risk.
Proposition 6.
Let be the network constructed in Theorem 2. Then with probability at least , we have
(12) |
Proof.
Proposition 7 (Properties of the regularized estimator).
The path-norm regularized estimator satisfies:
Proof.
The first claim follows from the definition of . For the second claim, we have , so . By using Proposition 6 and , we have
∎
Remark 1.
The above proposition establishes the connection between the regularized solution and the special solution constructed in Theorem 2. In particular, the upper bound of the generalization gap of the regularized solution satisfies as . This suggests that our regularization term is added appropriately, which forces the generalization gap to be roughly in the same order of approximation error.
Theorem 8 (Main result, noiseless case).
Remark 2.
It should be noted that both terms at the right hand side of the above result has a Monte Carlo nature (for numerical integration). From this viewpoint, the result is quite sharp.