Neural networks are powerful tools in the area of machine learning, especially deep learning. And it is observed that we can normally get better performance when we increase the number of parameters in neural networks, such as the width (the dimensions of parameter matrices each layer) and the depth (the number of layers). By intuition, when the number of parameters grows neural networks will have the potential to overfit the training data, resulting in a larger generalization error. However, it is surprising that neural networks as deep aslayers can still possess remarkable generalization ability Li et al. (2021). This inspires researchers to study the representation power of neural networks.
It is well-known that generalization error is related to sample complexity of the function class of neural networks Bousquet et al. (2003). Therefore, most works start from the analysis of sample complexity in neural networks. In Neyshabur et al. (2015), by inductively applying contraction principle on single layer of a neural network, it gives a bound on the network generalization error which has exponential dependence on the network depth . With further investigation, Bartlett et al. (2017) shows that this bound can be tightened to polynomial dependence on . However, in practice, people have observed that the number of training samples required for convergence does not necessarily goes up when a network becomes deeper. Golowich et al. (2018)
proves that, under certain conditions on norms of weight matrices and assumptions on the activation functions, we can achieve a depth-independent bound on the sample complexity.
In this paper, we mainly focus on developing a similar depth-independent bound on sample complexity of a more general class of polynomial neural networks (PNNs), which is quite different from the architectures that previous works have considered. Specifically, we use as our activation function. There are two main reasons why PNNs are of particular interest in this paper. First, PNNs are easy to analyze because of the simple structure. Chrysos et al. (2020); Choraria et al. (2022) study the representation power of one instance of PNNs called -Nets. Livni et al. (2014); Soltani and Hegde (2018); Du and Lee (2018) show that PNNs can efficiently approximate other networks. Specifically, it only requires polynomial activation layers to approximate a 2-layer sigmoid networks with regularization, where indicates the approximation error. Beside this, Livni et al. (2014) shows that the training technique for such networks is very similar to SGD, and can be computed efficiently for each iteration. The other reason is that we are still lacking an understanding of the generalization ability of PNNs. Therefore, our contribution in this paper is an in-depth study on the sample complexity of PNNs, and we show that the complexity can be depth-independent under certain conditions, which may shed some light in understanding the recent success of deep PNNs in image generation Karras et al. (2019) and face recognition Chrysos et al. (2020). Note that most previous works on PNNs only focus on one specific choice of (i.e., ), but our conclusion in this paper is applicable to any .
This section introduces the notations used in later context. We use bold-faced lower case letters for vectors, non-bold lower case letters for scalars, and capital letters for matrices or fixed parameters.
For a vector , its norm is denoted as , and . For a matrix , , where is the -th row of matrix . Note that our definition of is a bit different from the norm of a matrix, and it is for simplicity of analysis. refers to Frobenius norm and refers to spectral norm. denotes Schatten -norm of the spectrum of with , which is defined by and
’s are singular values of.
A cascade neural network with depth is defined as follows:
And to simplify the notation, we denote as the shorthand for the matrix tuple , and denotes the function computed by the sub-network composed of layers through :
Given a real-valued function class and some set of data points , the (empirical) Rademacher complexity is
is the Rademacher random variable, which takes valueand
with equal probability. Throughout this paper,is a shorthand for .
Our main results provide upper bounds on the Rademacher complexity with respect to PNNs under mild assumptions on input data.
3 Related Works
Classical results on the sample complexity of neural networks normally have strong dependency on the dimensions of parameter matrices. The upper bound shown in Anthony et al. (1999) is based on VC dimension and strongly rely on both the depth and the width of the network. In Neyshabur et al. (2015), the authors make use of contraction principle and show an upper bound that have an exponential dependence on the network depth , even if the norm of each parameter matrix is strictly upper bounded. Specifically, if input data satisfies and each parameter matrix satisfies , then under some assumptions on the activations, the results in Neyshabur et al. (2015) show that
where is a matrix consisting of all input data points. Although this bound has no explicit dependence on the network width (the dimensions of ), it has an exponential dependence on the network depth , even if , which is obviously not desirable. Bartlett et al. (2017) improves this exponential dependency to polynomial dependency based on a covering numbers argument. The authors show that
where some lower-order logarithmic factors are ignored. Although the dependency of becomes polynomial (i.e., ) in the new bound, the bound becomes trivial when , since the bound will not decrease with the growth of sample size in this case, indicating that it is not possible to reduce generalization error by increasing the size of training data. This is obviously not consistent with our observations during the training process of deep neural networks Li et al. (2021).
In Golowich et al. (2018) the authors show another line of proof techniques that gives rise to the first depth-independent upper bound on sample complexity of neural networks. Their main idea follows these two observations:
We can also view the classifier stated above as a class of “ultra-thin” network with following form:
Note that here is a vector and are scalars. Therefore, for this ultra-thin network, we should still have . The only difference between real neural networks and this “ultra-thin” network is that are matrices in real neural networks. Therefore, one could image that the change would take place in the constraint , but the upper bound should remain independent of depth .
With these two observations, Golowich et al. (2018) proposes the following two theorems:
Theorem 3.1 (Golowich et al. (2018)).
For any and any net such that and , for , there exists an alternative net such that they are identical except for weights of the -th layer: . , where are the most significant singular value and corresponding left and right singular vectors of .
Theorem 3.2 (Golowich et al. (2018)).
Let , are -Lipschitz functions and for some fixed . Denote . The Rademacher complexity satisfies
where is some universal constant.
With Theorem 3.1, we can substitute the original network to a -layer net followed by a univariate net. And Theorem 3.2 shows that we achieve a bound on the sample complexity of neural networks by complexity of the first layers net. Note that after Theorem 3.2 the complexity is independent of the depth of the original network. Combining these two we arrive at the depth-independent bound.
Theorem 3.3 (Golowich et al. (2018)).
Consider the following hypothesis class of networks on :
for some parameters . Also for any , define
Finally, for , let , where are real-valued loss functions which are
are real-valued loss functions which are-Lipschitz and satisfy for some . Assume that . Then the Rademacher complexity is upper bounded by
where is a universal constant.
It is worth pointing out that the results show in Golowich et al. (2018) are not directly applicable to PNNs, due to their assumptions on activations. Therefore, we extend this line of analysis to PNNs in this paper.
4 Improved Sample Complexity for Polynomial Neural Networks
Both depth-dependent and depth-independent results are provided in this section. Some proofs are relegated to a longer version of this paper for brevity.
4.1 Depth-Dependent Sample Complexity for Polynomial Neural Networks
Let activations be which is applied element-wise. Then for , function class , where , and any convex and monotonically increasing function , we can have that with condition ,
Denote as the rows of the matrix . Since the activation is , we have for any ,
Then for the whole matrix we have
According to definition of , we have
here we use the fact that if . So the supremum over such that must be attained when for some , and for all , as we can consider as a unit vector in the sense of norm. Therefore, we have
Since , the above quantity can be upper bounded by
where we use the symmetry property of Rademacher variables . By Eq. (4.20) in Ledoux and Talagrand (1991), we can further bound this term by
where represents the Lipschitz constant for . Since is continuous, it is equivalent to make sure that . Let us consider in the rest of this paper for simplicity,
So if and , we have
Lemma 4.1 shows that even if we have a very wide layer (i.e., dimension is large for ), the sample complexity analysis is the same as the case where . This explains why the complexity can be independent of the width of neural networks. And the constraint in Eq. (2) will asymptotically converge to when increases, as shown in Figure 1. In supplement we consider one special case where .
Let be the class of real-valued networks of depth over the domain , and , where , and with activations satisfying Lemma 4.1 for each layer. Then,
For some , we can upper bound the empirical Rademacher complexity by
where . Here, follows from Jensen’s inequality, follows from Lemma 4.1 with , and follows from repeating this process times.
For the second term , we have that
This implies that is a sub-Gaussian random variable and we can have the concentration inequality
Also by Jensen’s inequality, we have
Here follows from the fact that is a concave function for . Note that is a pretty loose general bound for all . A much tighter bound can be obtained if we know the specific choice of (i.e., ).
Combining the results we have that Eq. (3) can be further upper bounded as follows
where we choose to optimize the upper bound in . ∎
Combining Lemma 4.1 and Lemma 4.2, we can have a depth-dependent upper bound on the sample complexity. But we need to take care of the norm constraints of all to make sure that the condition in Lemma 4.1 holds for each layer.
Let be the class of real-valued networks of depth over the domain . For , has , ; , ; and activation function which is applied element-wise. Then,
First for function , it is monotonically increasing on . So if , the condition in Lemma 4.1 becomes , meaning that . Let us consider the first layer. Since , we have for sure that
Then for second layer, the input becomes . By definition we have
where , which is independent of . Here follows from Hölder’s inequality, follows from , and follows from . This requires . The conclusion holds for all when . It shows that as long as the weight matrix of the first layer satisfies certain conditions related with original input, the constraints on other layers only depends on the activation of each layer. So we can get the overall condition: for ; , . This allows us to use Lemma 4.1 and Lemma 4.2 and the proof is completed. ∎
For simplicity, we consider the activations of each layer to be the same in Theorem 4.3 as . We can easily extend it to the case where activations are different in each layer. It will change nothing but the norm restriction of each weight matrix with respect to the value of . And a similar conclusion would also hold for all activations satisfying property .
4.2 Depth-Independent Sample Complexity for Polynomial Neural Networks
We next show how to get a depth-independent bound for polynomial networks following a similar argument in Golowich et al. (2018).
For any matrix , and , there exists a rank-1 matrix of the same size as such that
Let be a rank-1 matrix of the same size as containing only one non-zero row , which is the row with largest norm in , meaning that . Then by definition, we know that
So we can have . And obviously, we also have . ∎
The original proof in Golowich et al. (2018) uses SVD decomposition to construct the rank-1 matrix. We change the analysis to the row with largest norm in our case.
For two polynomial networks and , they only differ in the -th layer. If all conditions in Theorem 4.3 hold for both and , we have
Since input and weight matrices satisfy the constraints, every activation function is 1-Lipschitz in norm sense (), thus the Lipschitz constant of function is at most .
If satisfy , . Then for any , we have
The proof is similar to the proof of Lemma 6 in Golowich et al. (2018). Note that for . So we have
which completes the proof. ∎
For any polynomial network such that and , and for , ; , , there exists an alternative net such that they are identical except for weights of the -th layer for : , where , where is the -th row of with the largest norm, and is a one-hot vector . And we have
Theorem 4.7 shows that the original neural network can be approximated with composition of two networks,
where is the -th row of with the largest norm, and is a one-hot vector .