One of the major challenges involving neural networks is explaining their ability to generalize well, even if they are very large and have the potential to overfit the training data (Neyshabur et al., 2014; Zhang et al., 2016). Learning theory teaches us that this must be due to some inductive bias, which constrains one to learn networks of specific configurations (either explicitly, e.g., via regularization, or implicitly, via the algorithm used to train them). However, understanding the nature of this inductive bias is still largely an open problem.
A useful starting point is to consider the much more restricted class of linear predictors (). For this class, we have a very good understanding of how its generalization behavior is dictated by the norm of . In particular, assuming that (where signifies Euclidean norm), and the distribution is such that almost surely, it is well-known that the generalization error (w.r.t. Lipschitz losses) given training examples scales as , completely independent of the dimension of . Thus, it is very natural to ask whether in the more general case of neural networks, one can obtain similar “size-independent” results (independent of the networks’ depth and width), under appropriate norm constraints on the parameters. This is also a natural question, considering that the size of modern neural networks is often much larger than the number of training examples.
Classical results on the sample complexity of neural networks do not satisfy this desideratum, and have a strong explicit dependence on the network size. For example, bounds relying on the VC dimension (see Anthony and Bartlett (2009)) strongly depend on both the depth and the width of the network, and can be trivial once the number of parameters exceeds the number of training examples. Scale-sensitive bounds, which rely on the magnitude of the network parameters, can alleviate the dependence on the width (see Bartlett (1998); Anthony and Bartlett (2009)). However, most such bounds in the literature have a strong (often exponential) dependence on the network depth. To give one recent example, Neyshabur et al. (2015) use Rademacher complexity tools to show that if the parameter matrices in each of the layers have Frobenius norms upper-bounded by
Although this bound has no explicit dependence on the network width (that is, the dimensions of ), it has a very strong, exponential dependence on the network depth , even if for all . Neyshabur et al. (2015)
also showed that this dependence can sometimes be avoided for anti-symmetric activations, but unfortunately this is a non-trivial assumption, which is not satisfied for common activations such as the ReLU.Bartlett et al. (2017) use a covering numbers argument to show a bound scaling as
where denotes the spectral norm of , denotes the -norm of the 2-norms of the rows of , and where we ignore factors logarithmic in and the network width. Unlike Eq. (1), here there is no explicit exponential dependence on the depth. However, there is still a strong and unavoidable polynomial dependence: To see this, note that for any , , so the bound above can never be smaller than
In particular, even if we assume that is a constant, the bound becomes trivial once . Finally, and using the same notation, Neyshabur et al. (2017) utilize a PAC-Bayesian analysis to prove a bound scaling as
where denotes the network width.111Bartlett et al. (2017) note that this PAC-Bayesian bound is never better than the bound in Eq. (2) derived from a covering numbers argument. Again, since (the ratio of the Frobenius and spectral norms) is always at least for any matrix, this bound can never be smaller than , and becomes trivial once . To summarize, although some of the bounds above have logarithmic or no dependence on the network width, we are not aware of a bound in the literature which avoids a strong dependence on the depth, even if various norms are controlled.
Can this depth dependency be avoided, assuming the norms are sufficiently constrained? We argue that in some cases, it must be true. To see this, let us return to the well-understood case of linear predictors, and consider generalized linear predictors of the form
where is the ReLU function. Like plain-vanilla linear predictors, the generalization error of this class is well-known to be , assuming the inputs satisfy almost surely. However, it is not difficult to show that this class can be equivalently written as a class of “ultra-thin” ReLU networks of the form
is a vector andare scalars), where the depth is arbitrary. Therefore, the sample complexity of this class must also scale as : This depends on the norm product , but is completely independent of the network depth as well as the dimension of . We argue that a “satisfactory” sample complexity analysis should have similar independence properties when applied on this class.
In more general neural networks, the vector and scalars become matrices , and the simple observation above no longer applies. However, using the same intuition, it is natural to try and derive generalization bounds by controlling , where is a suitable matrix norm. Perhaps the simplest choice is the spectral norm (and indeed, a product of spectral norms was utilized in some of the previous results mentioned earlier). However, as we formally show in Sec. 5, the spectral norm alone is too weak to get size-independent bounds, even if the network depth is small. Instead, we show that controlling other suitable norms can indeed lead to better depth dependence, or even fully size-independent bounds, improving on earlier works. Specifically, we make the following contributions:
In Sec. 3, we show that the exponential depth dependence in Rademacher complexity-based analysis (e.g. Neyshabur et al. (2015)) can be avoided by applying contraction to a slightly different object than what has become standard since the work of Bartlett and Mendelson (2002). For example, for networks with parameter matrices of Frobenius norm at most , the bound in Eq. (1) can be improved to
The technique can also be applied to other types of norm constraints. For example, if we consider an setup, corresponding to the class of depth- networks, where the -norm of each row of is at most , we attain a bound of
where is the input dimension. Again, the dependence on is polynomial and quite mild. In contrast, Neyshabur et al. (2015) studied a similar setup and only managed to obtain an exponential dependence on .
In Sec. 4, we develop a generic technique to convert depth-dependent bounds to depth-independent bounds, assuming some control over any Schatten norm of the parameter matrices (which includes, for instance, the Frobenius norm and the trace norm as special cases). The key observation we utilize is that the prediction function computed by such networks can be approximated by the composition of a shallow network and univariate Lipschitz functions. For example, again assuming that the Frobenius norms of the layers are bounded by , we can further improve Eq. (5) to
where is a lower bound on the product of the spectral norms of the parameter matrices (note that always). Assuming that for some , this can be upper bounded by , which to the best of our knowledge, is the first explicit bound for standard neural networks which is fully size-independent, assuming only suitable norm constraints. Moreover, it captures the depth-independent sample complexity behavior of the network class in Eq. (4) discussed earlier. We also apply this technique to get a depth-independent version of the bound in (Bartlett et al., 2017): Specifically, if we assume that the spectral norms satisfy for all , and , then the bound in Eq. (2) provided by Bartlett et al. (2017) becomes
In contrast, we show the following bound for any (ignoring some lower-order logarithmic factors):
where is an upper bound on the Schatten -norm of , and is a lower bound on . Again, by upper bounding the by its first argument, we get a bound independent of the depth , assuming the norms are suitably constrained.
In Sec. 5, we provide a lower bound, showing that for any , the class of depth-, width- neural networks, where each parameter matrix has Schatten -norm at most , can have Rademacher complexity of at least
This somewhat improves on Bartlett et al. (2017, Theorem 3.6), which only showed such a result for (i.e. with spectral norm control), and without the term. For , it matches the upper bound in Eq. (6) in terms of the norm dependencies and . Moreover, it establishes that controlling the spectral norm alone (and indeed, any Schatten -norm control with ) cannot lead to bounds independent of the size of the network. Finally, the bound shows (similar to Bartlett et al. (2017)) that a dependence on products of norms across layers is generally inevitable.
Finally, we emphasize that the bounds of Sec. 4 are independent of the network size, only under the assumption that products of norms (or at least ratios of norms) across all layers are bounded by a constant, which is quite restrictive in practice. For example, it is enough that the Frobenius norm of each layer matrix is at least , to get that Eq. (6) scales as where is the number of layers. However, our focus here is to show that at least some norm-based assumptions lead to size-independent bounds, and hope these can be weakened in future work.
Notation. We use bold-faced letters to denote vectors, and capital letters to denote matrices or fixed parameters (which should be clear from context). Given a vector , will refer to the Euclidean norm, and for , will refer to the norm. For a matrix , we use , where , to denote the Schatten -norm (that is, the -norm of the spectrum of , written as a vector). For example, refers to the spectral norm, refers to the Frobenius norm, and refers to the trace norm. For the case of the spectral norm, we will drop the subscript, and use just . Also, in order to follow standard convention, we use to denote the Frobenius norm. Finally, given a matrix W and reals , we let denote the -norm of the -norms of the columns of .
Neural Networks. Given the domain in Euclidean space, we consider (scalar or vector-valued) standard neural networks, of the form
where each is a parameter matrix, and each is some fixed Lipschitz continuous function between Euclidean spaces, satisfying . In the above, we denote as the depth of the network, and its width is defined as the maximal row or column dimensions of . Without loss of generality, we will assume that has a Lipschitz constant of at most (otherwise, the Lipschitz constant can be absorbed into the norm constraint of the neighboring parameter matrix). We say that is element-wise if it can be written as an application of the same univariate function over each coordinate of its input (in which case, somewhat abusing notation, we will also use to denote that univariate function). We say that is positive-homogeneous if it is element-wise and satisfies for all and . An important example of the above are ReLU networks, where every corresponds to applying the (positive-homogeneous) ReLU function on each element. To simplify notation, we let be shorthand for the matrix tuple , and denote the function computed by the sub-network composed of layers through , that is
Rademacher Complexity. The results in this paper focus on Rademacher complexity, which is a standard tool to control the uniform convergence (and hence the sample complexity) of given classes of predictors (see Bartlett and Mendelson (2002); Shalev-Shwartz and Ben-David (2014) for more details). Formally, given a real-valued function class and some set of data points , we define the (empirical) Rademacher complexity as
is a vector uniformly distributed in. Our main results provide bounds on the Rademacher complexity (sometimes independent of , as long as they are assumed to have norm at most ), with respect to classes of neural networks with various norm constraints. Using standard arguments, such bounds can be converted to bounds on the generalization error, assuming access to a sample of i.i.d. training examples.
3 From Exponential to Polynomial Depth Dependence
To get bounds on the Rademacher complexity of deep neural networks, a reasonable approach (employed in Neyshabur et al. (2015)) is to use a “peeling” argument, where the complexity bound for depth networks is reduced to a complexity bound for depth networks, and then applying the reduction times. For example, consider the class of depth- ReLU real-valued neural networks, with each layer’s parameter matrix with Frobenius norm at most . Using some straightforward manipulations, it is possible to show that , which by definition equals
can be upper bounded by
Iterating this inequality times, one ends up with a bound scaling as (as in Neyshabur et al. (2015), see also Eq. (1)). The exponential factor follows from the factor in Eq. (8), which in turn follows from applying the Rademacher contraction principle to get rid of the function. Unfortunately, this factor is generally unavoidable (see the discussion in Ledoux and Talagrand (1991) following Theorem 4.12).
In this section, we point out a simple trick, which can be used to reduce such exponential depth dependencies to polynomial ones. In a nutshell, using Jensen’s inequality, we can rewrite the (scaled) Rademacher complexity as
where is an arbitrary parameter. We then perform a “peeling” argument similar to before, resulting in a multiplicative factor after every peeling step. Crucially, these factors accumulate inside the log factor, so that the end result contains only a factor, which by appropriate tuning of , can be further reduced to .
The formalization of this argument depends on the matrix norm we are using, and we will begin with the case of the Frobenius norm. A key technical condition for the argument to work is that we can perform the “peeling” inside the function. This is captured by the following lemma:
Let be a -Lipschitz, positive-homogeneous activation function which is applied element-wise (such as the ReLU). Then for any class of vector-valued functions , and any convex and monotonically increasing function ,
Letting be the rows of the matrix , we have
The supremum of this over all such that must be attained when for some , and for all . Therefore,
Since , this can be upper bounded by
where the equality follows from the symmetry in the distribution of the random variables. The right hand side in turn can be upper bounded by
(see equation 4.20 in Ledoux and Talagrand (1991)). ∎
With this lemma in hand, we can provide a bound on the Rademacher complexity of Frobnius-norm-bounded neural networks, which is as clean as Eq. (1), but where the factor is replaced by :
Let be the class of real-valued networks of depth over the domain , where each parameter matrix has Frobenius norm at most , and with activation functions satisfying Lemma 1. Then
Fix , to be chosen later. The Rademacher complexity can be upper bounded as
We write this last expression as
where ranges over all possible functions . Here we applied Lemma 1 with . Repeating the process, we arrive at
where . Define a random variable
(random as a function of the random variables ). Then
By Jensen’s inequality, can be upper bounded by
To handle the term in Eq. (10), note that is a deterministic function of the i.i.d. random variables , and satisfies
This means that satisfies a bounded-difference condition, which by the proof of Theorem 6.2 in (Boucheron et al., 2013), implies that
is sub-Gaussian, with variance factor
Choosing and using the above, we get that Eq. (9) can be upper bounded as follows:
from which the result follows. ∎
We note that for simplicity, the bound in Thm. 1 is stated
real-valued networks, but
the argument easily carries over to vector-valued networks, composed
real-valued Lipschitz loss function. In that case, one uses a variant
is stated for real-valued networks, but the argument easily carries over to vector-valued networks, composed with some real-valued Lipschitz loss function. In that case, one uses a variant of Lemma1 to peel off the losses, and then proceed in the same manner as in the proof of Thm. 1. We omit the precise details for brevity.
A result similar to the above can also be derived for other matrix norms. For example, given a matrix , let denote the maximal -norm of its rows, and consider the class of depth- networks, where each parameter matrix satisfies for all (this corresponds to a setting, also studied in Neyshabur et al. (2015), where the
-norm of the weights of each neuron in the network is bounded). In this case, we can derive a variant of Lemma1, which in fact does not require positive-homogeneity of the activation function:
Let be a -Lipschitz activation function with , applied element-wise. Then for any vector-valued class , and any convex and monotonically increasing function ,
where denotes the vector infinity norm.
Using the same technique as before, we can use this lemma to get a bound on the Rademacher complexity for :
Let be the class of real-valued networks of depth over the domain , where for all , and with activation functions satisfying the condition of Lemma 2. Then
where is the -th coordinate of the vector .
The constructions used in the results of this section use the function together with its inverse , to get depth dependencies scaling as . Thus, it might be tempting to try and further improve the depth dependence, by using other functions for which increases sub-logarithmically. Unfortunately, the argument still requires us to control , which is difficult if increases more than exponentially. In the next section, we introduce a different idea, which under suitable assumptions, allows us to get rid of the depth dependence altogether.
4 From Depth Dependence to Depth Independence
In this section, we develop a general result, which allows one to convert any depth-dependent bound on the Rademacher complexity of neural networks, to a depth-independent one, assuming that the Schatten -norms of the parameter matrices (for any ) is sufficiently controlled. We develop and formalize the main result in Subsection 4.1, and provide applications in Subsection 4.2. The proofs the results in this section appear in Sec. 7.
4.1 A General Result
To motivate our approach, let us consider a special case of depth- networks, where
Each parameter matrix is constrained to be diagonal and of size .
The Frobenius norm of every is at most .
All activation functions are the identity (so the network computes a linear function).
Letting be the diagonal of , such networks are equivalent to
where denotes element-wise product. Therefore, if we would like the network to compute a non-trivial function, we clearly need that be bounded away from zero (e.g., not exponentially small in ), while still satisfying the constraint for all . In fact, the only way to satisfy both requirements simultaneously is if are all close to some 1-sparse unit vector, which implies that the matrices must be close to being rank-1.
It turns out that this intuition holds much more generally, even if we do not restrict ourselves to identity activations and diagonal parameter matrices as above. Essentially, what we can show is that if some network computes a non-trivial function, and the product of its Schatten -norms (for any ) is bounded, then there must be at least one parameter matrix, which is not far from being rank-1. Therefore, if we replace that parameter matrix by an appropriate rank-1 matrix, the function computed by the network does not change much. This is captured by the following theorem:
For any , any network such that and , and for any , there exists another network (of the same depth and layer dimensions) with the following properties:
is identical to , except for the parameter matrix in the -th layer, for some . The matrix is of rank at most 1, and equals where
are some leading singular value and singular vectors pairs of.
We now make the following crucial observation: A real-valued network with a rank-1 parameter matrix computes the function
This can be seen as the composition of the depth- network
and the univariate function
Moreover, the norm constraints imply that the latter function is Lipschitz. Therefore, the class of networks we are considering is a subset of the class of depth- networks composed with univariate Lipschitz functions. Fortunately, given any class with bounded complexity, one can effectively bound the Rademacher complexity of its composition with univariate Lipschitz functions, as formalized in the following theorem.
Let be a class of functions from Euclidean space to . Let be the class of of -Lipschitz functions from to , such that for some fixed . Letting , its Rademacher complexity satisfies
where is a universal constant.
The can be replaced by , where is the empirical Gaussian complexity of – see the proof in Sec. 7 for details.
Combining these ideas, our plan of attack is the following: Given some class of depth- networks, and an arbitrary parameter , we use Thm. 3 to relate their Rademacher complexity to the complexity of similar networks, but where for some , the -th parameter matrix is of rank-1. We then use Thm. 4 to bound that complexity in turn using the Rademacher complexity of depth- networks. Crucially, the resulting bound has no explicit dependence on the original depth , only on the new parameter . Formally, we have the following theorem, which is the main result of this section:
Consider the following hypothesis class of networks on :
for some parameters . Also, for any , define
Finally, for , let , where are real-valued loss functions which are -Lipschitz and satisfy , for some . Assume that .
Then the Rademacher complexity is upper bounded by
where is a universal constant.
In particular, one can upper bound this result by any choice of in . By tuning appropriately, we can get bounds independent of the depth . In the next subsection, we provide some concrete applications for specific choices of .
The parameters and , which divide the norm terms in Thm. 5, are both closely related to the notion of a margin. Indeed, if we consider binary or multi-class classification, then bounds as above w.r.t. -Lipschitz losses can be converted to a bound on the misclassification error rate in terms of the average -margin error on the training data (see Bartlett et al. (2017, Section 3.1) for a discussion). Also, can be viewed as the “maximal” margin attainable over the input domain, since .
4.2 Applications of Thm. 5
In this section we exemplify how Thm. 5 can be used to obtain depth-independent bounds on the sample complexity of various classes of neural networks. The general technique is as follows: First, we prove a bound on , which generally depends on the depth , and scales as for some . Then, we plug it into Thm. 5, and utilize the following lemma to tune appropriately:
For any , and , it holds that
We begin with proving a depth-independent version of Thm. 1. That theorem implies that for the class of depth- neural networks with Frobenius norm bounds (up to and including ),
Let be the class of depth- neural networks, where each parameter matrix satisfies , and with -Lipschitz, positive-homogeneous, element-wise activation functions. Assuming the loss function and satisfy the conditions of Thm. 5 (with the sets being unconstrained), it holds that
Ignoring logarithmic factors and replacing the by its first argument, the bound in the corollary is at most