1 Introduction
Understanding why large neural networks are able to generalize is one of the most important puzzles in the theory of deep learning. Since sufficiently large neural networks can approximate any function, their success must be due to a strong inductive bias in the learned network weights, which is still not fully understood.
A useful approach to understand such biases is studying what types of constraints on the network weights can lead to uniform convergence bounds, which ensure that empirical risk minimization will not lead to overfitting. Notwithstanding the ongoing debate on whether uniform convergence can fully explain the learning performance of neural networks (Nagarajan and Kolter, 2019; Negrea et al., 2020; Koehler et al., 2021), these bounds provide us with important insights on what normbased biases can aid in generalization. For example, for linear predictors, it is wellunderstood that constraints on the Euclidean norm of the weights imply uniform convergence guarantees independent of the number of parameters, and so striving to minimize the Euclidean norm is a useful inductive bias, whether used explicitly or implicitly. However, neural networks have a more complicated structure than linear predictors, and we still lack a good understanding of what normbased constraints imply a good inductive bias.
In this paper, we study this question in the simple case of scalarvalued onehiddenlayer neural networks, which generally compute functions from to of the form
(1) 
with weight matrix
, weight vector
, and a fixed (generally nonlinear) activation function . We focus on an Euclidean setting, where the inputs and output weight vector are assumed to have bounded Euclidean norm. Our goal is to understand what kind of norm control on the matrix is required to achieve uniform convergence guarantees, independent of the underlying distribution and the network width(i.e., the number of neurons). Previous work clearly indicates that a bound on the spectral norm is generally necessary, but (as we discuss below) does not conclusively imply whether it is also sufficient.
Our first contribution (in Subsection 3.1) is formally establishing that spectral norm control is generally insufficient to get widthindependent sample complexity bounds in high dimensions, by directly lower bounding the fatshattering number of the predictor class. On the flip side, if we assume that the Frobenius norm of is bounded, then we can prove uniform convergence guarantees, independent of the network width or input dimension. The result is based on Rademacher complexity, and extends previous results in this setting from homogeneous activations to general Lipschitz activations (Neyshabur et al., 2015; Golowich et al., 2018). In Subsection 3.2, we also prove a variant of our lower bound in the case where the input dimension is fixed, pointing at a possibly interesting regime for which good upper bounds are currently lacking.
The constructions used in our lower bounds crucially require activation functions which are nonsmooth around . Moreover, they employ networks where the matrix can be arbitrary (as long as its norm is bounded). Motivated by this, we identify and analyze two important settings where these lower bounds can be circumvented, and where a mere spectral norm control is sufficient to obtain widthindependent guarantees:

The first case (studied in Sec. 4) is for networks where the activation function
is sufficiently smooth: Specifically, when it is analytic and the coefficients of its Taylor expansion decay sufficiently rapidly. Some examples include polynomial activations, sigmoidal functions such as the error function, and appropriate smoothings of the ReLU function. Perhaps surprisingly, the smoothness of the activation allows us to prove uniform convergence guarantees that depend only on the spectral norm of
and the structure of the activation function, independent of the network width. Moreover, we can extend our results for deeper networks when the activations is a power function (e.g., quadratic activations). 
A second important case (studied in Sec. 5) is when the network employs weightsharing on
, as in convolutional networks. Specifically, we consider two variants of onehiddenlayer convolutional networks, one with a linear output layer, and another employing maxpooling. In both cases, we present bounds on the sample complexity that depend only on the spectral norm, and study how they depend on the convolutional architecture of the network (such as the number of patches or their amount of overlap).
Our work leaves open quite a few questions and possible avenues for future research, which we discuss in Sec. 6. All proofs of our results appear in Appendix A.
Related Work
The literature on the sample complexity of neural networks has rapidly expanded in recent years, and cannot be reasonably surveyed here. In what follows, we discuss only works which deal specifically with the issues we focus on in this paper.
Frobenius vs. spectral norm Control, lower bounds. Bartlett et al. (2017) proved a lower bound on the Rademacher complexity of neural networks, implying that a dependence on the spectral norm is generally necessary. Golowich et al. (2018)
extended this to show that a dependence on the network width is also necessary, if only the spectral norm is controlled. However, their construction requires a vectorvalued (rather than scalarvalued) outputs. More importantly, the lower bound is on the Rademacher complexity of the predictor class rather than the fatshattering dimension, and thus (as we further discuss below) does not necessarily imply that the actual sample complexity with some bounded loss function indeed suffers from such a width dependence.
Daniely and Granot (2019) do provide a fatshattering lower bound, which implies that neural networks on with bounded spectral norm and width at most can shatter points with constant margin, assuming that the inputs have norm at most . However, this lower bound does not separate between the input norm bound and the width of the hidden layer (which both scale with ), and thus does not clarify the contribution of the network width to the bound. Moreover, their proof technique appears to crucially rely on the input’s norm scaling with the dimension, rather than being an independent parameter.Frobenius vs. spectral norm control, upper bounds. A widthindependent uniform convergence guarantee, depending on the Frobenius norm, has been established in Neyshabur et al. (2015) for constantdepth networks, and in Golowich et al. (2018) for arbitrarydepth networks. However, these bounds are specific to homogeneous activation functions, whereas we tackle general Lipschitz activations (at least for onehidden layer networks). Other types of normbased bounds appear, for example, in Anthony and Bartlett (1999); Bartlett et al. (2017).
Sample complexity with smooth activations. The Rademacher complexity for networks with quadratic activations has been studied in Du and Lee (2018), but assuming Frobenius norm constraints, whereas we show that mere spectral norm constraint is sufficient to bound the Rademacher complexity independent of the network width. The strong influence of the activation function on the sample complexity has been observed in the context of VCdimension bounds for binary classification (see Anthony and Bartlett (1999, Section 7.2)). However, we are not aware of previous results showing how the smoothness of the activation functions provably affects scalesensitive bounds such as the Rademacher complexity in our setting.
Sample complexity of convolutional networks. Normbased uniform convergence bounds for convolutional networks (including more general ones than the one we study) have been provided in Du et al. (2018); Long and Sedghi (2019). However, these bounds either depend on the overall number of parameters, or apply only to averagepooling. For convolutional networks with maxpooling, Ledent et al. (2021) provide a normbased analysis which we build on (see Sec. 5 for details). Works studying the generalization performance of convolutional networks in settings different than ours include Li et al. (2018); Arora et al. (2018); Wei and Ma (2019); Brutzkus and Globerson (2021).
2 Preliminaries
Notation. We use boldface letters to denote vectors, and let be shorthand for . Given a matrix , is the entry in row and column . Given a function on , we somewhat abuse notation and let (for a vector ) or (for a matrix ) denote applying elementwise. A special case is when is the ReLU function. We use standard bigOh notation, with hiding constants and hiding constants and factors polylogarithmic in the problem parameters.
Norms. denotes the operator norm: For vectors, it is the Euclidean norm, and for matrices, the spectral norm (i.e., ). denotes the Frobenius norm (i.e., ). It is wellknown that for any matrix , , so the class of matrices whose Frobenius norm is bounded by some is a subset of the class of matrices whose spectral norm is bounded by the same . Moreover, if is an matrix, then .
Network Architecture. Most of our results pertain to scalarvalued onehiddenlayer networks, of the form , where , , is a vector and is some fixed nonlinear function. The width of the network is , the number of rows of (or equivalently, the number of neurons in the hidden layer of the network).
FatShattering and Rademacher Complexity. When studying lower bounds on the sample complexity of a given function class, we use the following version of its fatshattering dimension:
Definition 1.
A class of functions on an input domain shatters points with margin , if there exist a number , such that for all we can find some such that
The fatshattering dimension of (at scale ) is the cardinality of the largest set of points in for which the above holds.
It is wellknown that the fatshattering dimension lower bounds the number of samples needed to learn in a distributionfree learning setting, up to accuracy (see for example Anthony and Bartlett (1999, Part III)). Thus, by proving the existence of a large set of points shattered by the function class, we get lower bounds on the fatshattering dimension, which translate to lower bounds on the sample complexity.
As to upper bounds on the sample complexity, our results utilize the Rademacher complexity of a function class , which for our purposes can be defined as
where is a vector of
independent random variables
uniformly distributed on . Upper bounds on the Rademacher complexity directly translate to upper bounds on the sample complexity required for learning : Specifically, the number of inputs required to make smaller than some is generally an upper bound on the number of samples required to learn up to accuracy , using any Lipschitz loss (see Bartlett and Mendelson (2002); ShalevShwartz and BenDavid (2014); Mohri et al. (2018)).We note that although the fatshattering dimension and Rademacher complexity are closely related, they do no always behave the same: For example, the class of normbounded linear predictors has Rademacher complexity , implying samples to make it less than . In contrast, the fatshattering dimension of the class is smaller, (Anthony and Bartlett, 1999; Bartlett and Mendelson, 2002). The reason for this discrepancy is that the Rademacher complexity of the predictor class necessarily scales with the range of the function outputs, which is not necessarily relevant if we use bounded losses (that is, if we are actually interested in the function class of linear predictors composed with a bounded loss). Such bounded losses are common, for example, when we are interested in bounding the expected misclassification error (see for example Bartlett and Mendelson (2002); Bartlett et al. (2017)). For this reason, we focus on fatshattering dimension in our lower bounds, and Rademacher complexity in our upper bounds.
3 Frobenius Norm Control is Necessary for General Networks
We begin by considering onehiddenlayer networks , where is a function on applied elementwise (such as the ReLU activation function). In Subsection 3.1, we consider the dimensionfree case (where we are interested in upper and lower bounds that do not depend on the input dimension ). In Subsection 3.2, we consider the case where the dimension is a fixed parameter.
3.1 DimensionFree Bounds
We focus on the following hypothesis class of scalarvalued, onehiddenlayer neural networks of width on inputs in , where is a function on applied elementwise, and where we only bound the operator norms:
The following theorem shows that if the input dimension is large enough, then under a mild condition on the nonsmoothness of around , the fatshattering dimension of this class necessarily scales with the network width :
Theorem 1.
Suppose that the activation function (as a function on ) is Lipschitz on , and satisfies as well as
(2) 
for some .
Then there exist universal constants such that the following hold: For any , there is some such that for any input dimension , can shatter
points from with margin , provided the expression above is larger than .
To understand the condition in Eq. (2), suppose that has a lefthand derivative and righthand derivative at . Recalling that , the condition stated in the theorem implies that
for all . In particular, as , we get . Thus, is necessarily nondifferentiable at . For example, the ReLU activation function satisfies the assumption in the theorem with , and the leaky ReLU function (with parameter ) satisfies the assumption with .
Remark 1.
The assumption is without much loss of generality: If , then let be a centering of , and note that our predictors can be rewritten in the form . Thus, our hypothesis class is contained in the hypothesis class of predictors of the form for some bounded bias parameter . This bias term does not change the fatshattering dimension, and thus is not of much interest.
The theorem implies that with only spectral norm control (i.e. where is bounded), it is impossible to get bounds independent of the width of the network . Initially, the lower bound might appear surprising, since if the activation function is the identity, simply contains linear predictors of norm , for which the sample complexity / fatshattering dimension is well known to be in high input dimensions, completely independent of (see discussion in the previous section). Intuitively, the extra term in the lower bound comes from the fact that for random matrices , can typically be much larger than , even when is a Lipschitz function satisfying . To give a concrete example, if is an matrix with i.i.d. entries uniform on , then standard concentration results imply that is upperbounded by a universal constant independent of , yet (since is just the constant matrix with value at every entry). The formal proof (in the appendix) relies on constructing a network involving random weights, so that the spectral norm is small yet the network can return sufficiently large values due to the nonlinearity.
Remark 2.
Thm. 1 has an interesting connection to the recent work of Bubeck et al. (2021), which implies that in order to fit points with bounded norm using a width onehiddenlayer neural network , the Lipschitz constant of the network (and hence ) must be generally at least . The lower bound in Thm. 1 implies a related statement in the opposite direction: If we allow to be sufficiently larger than , then there exist points that can be shattered with constant margin. Thus, we seem to get a good characterization of the expressiveness of onehidden layer neural networks on finite datasets, as a function of their width and the magnitude of the weights.
Considering the lower bound, and noting that is an upper bound on , the bound suggests that a control over the Frobenius norm would be sufficient to get widthindependent bounds. Indeed, such results were previously known when is the ReLU function, or more generally, a positivehomogeneous function of degree (Neyshabur et al., 2015; Golowich et al., 2018). In what follows, we will prove such a result for general Lipschitz functions (at least for onehidden layer networks).
Specifically, consider the following hypothesis class, where the previous spectral norm constraint on is replaced by a Frobenius norm constraint:
Theorem 2.
Suppose (as a function on ) is Lipschitz and . Then for any , the Rademacher complexity of on inputs from is at most , if
for some universal constant . Thus, it suffices to have .
The bound is indeed independent of the network width . Also, the result (as an upper bound on the Rademacher complexity) is clearly tight up to logfactors, since in the special case where and we fix , then reduces to the class of linear predictors with Euclidean norm at most (on data of norm at most ), whose Rademacher complexity matches the bound above up to logfactors.
Remark 3 (Connection to Implicit Regularization).
It was recently proved that training neural networks employing homogeneous activations on losses such as the logistic loss, without any explicit regularization, gradient methods are implicitly biased towards models which minimize the squared Euclidean norm of their parameters (Lyu and Li, 2019; Ji and Telgarsky, 2020). In our setting of onehiddenlayer networks , this reduces to . For homogeneous activations, multiplying by some scalar and dividing by the same scalar leaves the network unchanged. Based on this observation, and the fact that , it follows that minimizing (under any constraints on the network’s outputs) is equivalent to minimizing (under the same constraints). Thus, gradient methods are biased towards models which minimize our bound from Thm. 2 in terms of the norms of .
3.2 DimensionDependent Lower Bound
The bounds presented above are dimensionfree, in the sense that the upper bound holds for any input dimension , and the lower bound applies once is sufficiently large. However, for neural networks the case of being a fixed parameter is also of interest, since we often wish to apply large neural networks on inputs whose dimensionality is reasonably bounded (e.g., the number of pixels in an image).
For fixed , it is wellknown that there can be a discrepancy between the fatshattering dimension and the Rademacher complexity, even for linear predictors (see discussion in Sec. 2). Thus, although Thm. 2 is tight as a bound on the Rademacher complexity, one may conjecture that the fatshattering dimension (and true sample complexity for bounded losses) is actually smaller for fixed .
In what follows, we focus on the case of the Frobenius norm, and provide a dimensiondependent lower bound on the fatshattering dimension. We first state the result for a ReLU activation with a bias term (Thm. 3), and then extend it to the standard ReLU activation under a slightly more stringent condition (Corollary 1).
Theorem 3.
For any , and any larger than some universal constant, there exists a choice of such that the following hold: If , then can shatter
points from with margin , assuming the expression above is larger than for some universal constant , and where hides factors polylogarithmic in .
Corollary 1.
The lower bound of Thm. 3 also holds for the standard ReLU activation , if (which happens if the input dimension is larger than a factor polylogarithmic in the problem parameters).
The lower bound is the minimum of two terms: The first is , which is the order of the number of parameters in the network. This term is to be expected, since the fatshattering dimension of is at most the pseudodimension of , which indeed scales with the number of parameters (see Anthony and Bartlett (1999); Bartlett et al. (2019)). Hence, we cannot expect to be able to shatter many more than points. The second term is norm and dimensiondependent, and dominates the overall lower bound if the network width is large enough. Comparing the theorem with the upper bound from Thm. 2, it seems to suggest that having a bounded dimension
may improve the sample complexity compared to the dimensionfree case, with a smaller dependence on the norm bounds. However, at the moment we do not have upper bounds which match this lower bound, or even establish that bounds better than Thm.
2 are possible when the dimension is small. We leave the question of understanding the sample complexity in the fixeddimension regime as an interesting problem for future research.Remark 4 (No contradiction to upper bound in Thm. 2, due to implicit bound on ).
The theorem requires that the displayed expression is at least order of . In particular, it requires that , or equivalently . Thus, the theorem only applies when the dimension is not too large with respect to the other parameters. We note that this is to be expected: If is larger than , then the lower bound in the theorem may be larger than (at least when is large enough), and this would violate the upper bound implied by Thm. 2.
4 Spectral Norm Control Suffices for Sufficiently Smooth Activations
The lower bounds in the previous section crucially rely on the nonsmoothness of the activation functions. Thus, one may wonder whether smoothness can lead to better upper bounds. In this section, we show that perhaps surprisingly, this is indeed the case: For sufficiently smooth activations (e.g., polynomials), one can provide widthindependent Rademacher complexity bounds, using only the spectral norm. Formally, we return to the class of onehiddenlayer neural networks with spectral norm constraints,
and state the following theorem:
Theorem 4.
Fix some . Suppose for some , , simultaneously for all . Then the Rademacher complexity of on inputs from is at most , if
(assuming the sum converges).
We emphasize that this bound depends only on spectral norms of the network and properties of the activation . In particular, it is independent of the network width as well as the Frobenius norm of . The proof of the theorem (in the appendix) depends on algebraic manipulations, which involve ‘unrolling’ the Rademacher complexity as a polynomial function of the network inputs, and employing a certain technical trick to simplify the resulting expression, given a bound on the spectral norm of the weight matrix.
We now turn to provide some specific examples of and the resulting expression in the theorem (see also Figure 1):
Example 1.
If is a polynomial of degree , then for large enough .
In the example above, the output values of predictors in the class are at most , so it is not surprising that the resulting Rademacher complexity scales in the same manner.
The theorem also extends to nonpolynomial activations, as long as they are sufficiently smooth (although the dependence on in generally becomes exponential). The following is an example for a sigmoidal activation based on the error function:
Example 2.
If (where erf is the error function, and is a scaling parameter), then .
Proof.
We have that equals
and therefore
∎
A sigmoidal activation also allows us to define a smooth approximation of the ReLU function, to which the theorem can be applied:
Example 3.
If , then .
We note that as , converges uniformly to the ReLU function.
Proof.
Using a computation similar to the previous example, equals
and therefore
∎
Although the last two examples imply an exponential dependence on the spectral norm bound in the theorem, they still imply that for any fixed , we can get a finite sizeindependent sample complexity (regardless of the network’s width or input dimension) while controlling only the spectral norm of the weight matrices.
4.1 Extension to Higher Depths for Power Activations
When the activation functions are powers of the form for some , then the previous theorem can be extended to deeper networks. To formalize this, fix integers and , and consider a depth network (parameterized by weight matrices of some arbitrary fixed dimensions, and a weight vector ) defined recursively as follows:
where denotes applying the th power elementwise on a vector .
Theorem 5.
For any integers and choice of matrix dimensions at each layer, consider the class of neural networks as above, over all weight matrices such that for all , and all such that . Then the Rademacher complexity of this class on inputs from is at most , if
For constant and constantdepth networks, the bound in the theorem is , where bounds merely the (relatively weak) spectral norm. We also note that the exponential/doublyexponential dependence on is to be expected: Even if we consider networks where each matrix is a scalar , and the input is exactly , then multiplying by and taking the th power times over leads to the exact same factor. Since the Rademacher complexity depends on the scale of the outputs, such a factor is generally unavoidable. The proof of the theorem (in the appendix) builds on the proof ideas of Thm. 4, which can be extended to deeper networks at least when the activations are power functions.
5 Convolutional Networks
In this section, we study another important example of neural networks which circumvent our lower bounds from Sec. 3, this time by adding additional constraints on the weight matrix. Specifically, we consider onehiddenlayer convolutional neural networks. These networks are defined via a set of patches , where for each , the patch projects the input vector into some subset of its coordinates, namely for some . The hidden layer is parameterized by a convolutional filter vector , and given an input , outputs the vector , where is some activation function (e.g., ReLU). Note that this can be equivalently written as , where each row of embeds the vector in the coordinates corresponding to . In what follows, we say that a matrix conforms to a set of patches , if there exists a vector such that for all . Thus, our convolutional hidden layer corresponds to a standard hidden layer (same as in previous sections), but with the additional constraint on that it must conform to a certain set of patches.
In the first subsection below, we study networks where the convolutional hidden layer is combined with a linear output layer. In the following section, we study the case where the hidden layer is combined with a fixed pooling operation. In both cases, we will get bounds that depend on the spectral norm of and the architecture of the patches.
5.1 Convolutional Hidden Layer + Linear Output Layer
We begin by considering convolutional networks consisting of a convolutional hidden layer (with spectral norm control and with respect to some set of patches), followed by a linear output layer:
The following theorem shows that we can indeed obtain a Rademacher complexity bound depending only on the spectral norm of , and independent of the network width , under a mild assumption about the architecture of the patches:
Theorem 6.
Suppose is Lipschitz and . Fix some set of patches , and let be the maximal number of patches that any single input coordinate (in ) appears in. Then for any , the Rademacher complexity of on inputs from is at most , if
The proof of the theorem (in the appendix) is based on an algebraic analysis of the Rademacher complexity, and the observation that the spectral norm of necessarily upper bounds the Euclidean norm of the convolutional filter vector .
Other than the usual parameters, the bound in the theorem also depends on the architectural parameter , which quantifies the amount of “overlap” between the patches (e.g., it equals if the patches are disjoint). We note that even with overlapping patches, tends to be quite small: For example, if each patch corresponds to a small square patch in an input image, then generally scales with the size of the patch, not the input dimension nor the total number of patches. Nevertheless, an interesting open question is whether the factor in the bound can be reduced or avoided altogether.
5.2 Convolutional Hidden Layer + Pooling Layer
We now turn to consider a slightly different onehiddenlayer convolutional networks, where the linear output layer is replaced by a fixed pooling layer. Specifically, we consider networks of the form
where is an activation function as before, and is Lipschitz with respect to the norm. For example, may correspond to a maxpooling layer , or to an averagepooling layer . We define the following class of networks:
This class is very closely related to a class of convolutional networks recently studied in Ledent et al. (2021) using an elegant covering number argument. Employing their proof technique, we can provide a Rademacher complexity upper bound (Thm. 7 below), which depends merely on the spectral norm of , as well as a logarithmic dependence on the network width . Although a logarithmic dependence is relatively mild, one may wonder if we can remove it and get a fully widthindependent bound, same as our previous results. Our main novel contribution in this section (Thm. 8) is to show that this is not the case: The fatshattering dimension of the class necessarily has a factor, so the upper bound is tight up to factors polylogarithmic in the sample size .
Theorem 7.
Suppose that is Lipschitz and , and that is Lipschitz w.r.t. and satisfies . Fix some set of patches . Then, for any , the Rademacher complexity of on inputs from is at most if
for some universal constant . Thus, it suffices to have .
For the lower bound, we focus for simplicity on the case where is a maxpooling layer, and where is the ReLU function (which satisfies the conditions of Thm. 7 with ). However, we emphasize that unlike the lower bound we proved in Sec. 3, the construction does not rely on the nonsmoothness of , and in fact can easily be verified to apply (up to constants) for any satisfying and (where is a constant).
Theorem 8.
For any , there is such that the following hold: The class , with being the ReLU function and being the max function, can shatter
points from with margin .
The proof of the theorem is rather technical, but can be informally described as follows (where we focus just on where the dependence on comes from): We consider each input as a tensor of size (with entries indexed by a vector in , and the patches are all subtensors of dimensions . We construct inputs , where each contains zeros, and a single value at coordinate (with a at position , and elsewhere). Given a vector of target values, we construct the convolutional filter (a th order tensor of dimensions ) to have zeros, and a single value at coordinate . Thus, we “encode” the full set of target values in , and a simple calculation reveals that given , the network will output if , and otherwise. Thus, we can shatter points. An extension of this idea allows us to also incorporate the right dependence on the other problem parameters.
6 Conclusions and Open Questions
In this paper, we studied sample complexity upper and lower bounds for onehidden layer neural networks, based on bounding the norms of the weight matrices. We showed that in general, bounding the spectral norm cannot lead to sizeindependent guarantees, whereas bounding the Frobenius norm does. However, the constructions also pointed out where the lower bounds can be circumvented, and where a spectral norm control suffices for widthindependent guarantees: First, when the activations are sufficiently smooth, and second, for certain types of convolutional networks.
Our work raises many open questions for future research. For example, how does having a fixed input dimension affect the sample complexity of neural networks? Our lower bound in Thm. 3 indicates small might reduce the sample complexity, but currently we do not have good upper bounds that actually establish that (at least without depending on the network width as well). Alternatively, it could also be that Thm. 3 can be strengthened. In a related vein, our lower bound for convolutional networks (Thm. 8) requires a relatively high dimension, at least on the order of the network width. Can we get smaller bounds if the dimension is constrained?
In a different direction, we showed that spectral norm control does not lead to widthfree guarantees with nonsmooth activations, whereas such guarantees are possible with very smooth activations. What about other activations? Can we characterize when can we get such guarantees for a given activation function? Or at least, can we improve the dependence on the norm bound for sufficiently smooth nonpolynomial activations?
As to convolutional networks, we studied two particular architectures employing weightsharing: One with a linear output layer, and one with a fixed Lipschitz pooling layer mapping to . Even for onehiddenlayer networks, this leaves open the question of characterizing the widthindependent sample complexity of networks , where implements weightsharing and is a pooling operator mapping to with (Ledent et al. (2021) provide upper bounds in this setting, but they are not sizeindependent and we conjecture that they can be improved). Moreover, we still do not know whether parameters such as the amount of overlap in the patches (see Thm. 6) are indeed necessary.
All our bounds are in terms of the parameter matrix norm, or . Some existing bounds, such as in Bartlett et al. (2017), depend instead on the distance from some fixed dataindependent matrix (e.g., the initialization point), a quantity which can be much smaller. We chose to ignore this issue in our paper for simplicity, but it would be useful to generalize our bounds to incorporate this.
Beyond these, perhaps the most tantalizing open question is whether our results can be extended to deeper networks, and what types of bounds we might expect. Even if we treat the depth as a constant, existing bounds for deeper networks are either not widthindependent (e.g., Neyshabur et al. (2018); Daniely and Granot (2019)), utilize norms much stronger than even the Frobenius norm (e.g., Anthony and Bartlett (1999); Bartlett et al. (2017)), or involve products of Frobenius norms, which is quite restrictive (Neyshabur et al., 2015; Golowich et al., 2018). Based on our results, we know that a bound depending purely on the spectral norms is impossible in general, but conjecture that the existing upper bounds are all relatively loose. A similar question can be asked for more specific architectures such as convolutional networks.
Acknowledgements
This research is supported in part by European Research Council (ERC) grant 754705, and NSFBSF award 1718970.
References
 Anthony and Bartlett [1999] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 1999.

Arora et al. [2018]
Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang.
Stronger generalization bounds for deep nets via a compression
approach.
In
International Conference on Machine Learning
, pages 254–263. PMLR, 2018.  Bartlett and Mendelson [2002] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
 Bartlett et al. [2017] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrallynormalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30:6240–6249, 2017.
 Bartlett et al. [2019] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearlytight vcdimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.

Brutzkus and Globerson [2021]
Alon Brutzkus and Amir Globerson.
An optimization and generalization analysis for maxpooling networks.
In
Proceedings of the ThirtySeventh Conference on Uncertainty in Artificial Intelligence
, 2021.  Bubeck et al. [2021] Sébastien Bubeck, Yuanzhi Li, and Dheeraj M Nagaraj. A law of robustness for twolayers neural networks. In Conference on Learning Theory, pages 804–820. PMLR, 2021.
 Daniely and Granot [2019] Amit Daniely and Elad Granot. Generalization bounds for neural networks via approximate description length. Advances in Neural Information Processing Systems, 32:13008–13016, 2019.
 Du and Lee [2018] Simon Du and Jason Lee. On the power of overparametrization in neural networks with quadratic activation. In International Conference on Machine Learning, pages 1329–1338. PMLR, 2018.

Du et al. [2018]
Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Russ R
Salakhutdinov, and Aarti Singh.
How many samples are needed to estimate a convolutional neural network?
Advances in Neural Information Processing Systems, 31, 2018.  Golowich et al. [2018] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Sizeindependent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018.
 Ji and Telgarsky [2020] Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33, 2020.
 Koehler et al. [2021] Frederic Koehler, Lijia Zhou, Danica J Sutherland, and Nathan Srebro. Uniform convergence of interpolators: Gaussian width, norm bounds, and benign overfitting. arXiv preprint arXiv:2106.09276, 2021.
 Ledent et al. [2021] Antoine Ledent, Waleed Mustafa, Yunwen Lei, and Marius Kloft. Normbased generalisation bounds for deep multiclass convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8279–8287, 2021.
 Ledoux and Talagrand [1991] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991.
 Li et al. [2018] Xingguo Li, Junwei Lu, Zhaoran Wang, Jarvis Haupt, and Tuo Zhao. On tighter generalization bound for deep neural networks: Cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159, 2018.
 Long and Sedghi [2019] Philip M Long and Hanie Sedghi. Generalization bounds for deep convolutional neural networks. In International Conference on Learning Representations, 2019.
 Lyu and Li [2019] Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019.
 Mohri et al. [2018] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
 Nagarajan and Kolter [2019] Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems, volume 32, 2019.

Negrea et al. [2020]
Jeffrey Negrea, Gintare Karolina Dziugaite, and Daniel Roy.
In defense of uniform convergence: Generalization via derandomization with an application to interpolating predictors.
In International Conference on Machine Learning, pages 7263–7272. PMLR, 2020.  Neyshabur et al. [2015] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Normbased capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015.
 Neyshabur et al. [2018] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pacbayesian approach to spectrallynormalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
 Seginer [2000] Yoav Seginer. The expected norm of random matrices. Combinatorics, Probability and Computing, 9(2):149–166, 2000.
 ShalevShwartz and BenDavid [2014] Shai ShalevShwartz and Shai BenDavid. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
 Wei and Ma [2019] Colin Wei and Tengyu Ma. Improved sample complexities for deep networks and robust classification via an alllayer margin. arXiv preprint arXiv:1910.04284, 2019.
 Zhang [2002] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527–550, 2002.
Appendix A Proofs
a.1 Proof of Thm. 1
We will assume without loss of generality that the condition stated in the theorem holds without an absolute value, namely
(3) 
To see why, note that if , then can never change sign as a function of (otherwise it will be for some ). Hence, the condition implies that either for all , or that for all . We simply choose to treat the first case, as the second case can be treated with a completely identical analysis, only flipping some of the signs.
Fix some sufficiently large dimension and integer to be chosen later. Choose to be some orthogonal vectors of norm in . Let be the matrix whose th column is . Given this input set, it is enough to show that there is some number , such that for any , we can find a predictor (namely, depending on ) in our class, such that , , and
(4) 
We will do so as follows: We let
Where is a certain scaling factor and is a valued matrix of size , both to be chosen later. In particular, we will assume that is approximately balanced, in the sense that for any column of , if is the portion of entries in the column, then
(5) 
For any , since are orthogonal and of norm , we have
where is the th column of , and is the entry of in the th row and th column. Then we have the following:

If , this equals .
Recalling Eq. (4), we get that by fixing , we can shatter the dataset as long as
(6) 
Leaving this condition for a moment, we now turn to specify how is chosen, so as to satisfy the condition . To that end, we let be any valued matrix which satisfies Eq. (5) as well as , where is some universal constant. Such a matrix necessarily exists assuming is sufficiently larger than ^{1}^{1}1This follows from the probabilistic method: If we pick the entries of uniformly at random, then both conditions will hold with some arbitrarily large probability (assuming is sufficiently larger than , see for example Seginer [2000]), hence the required matrix will result with some positive probability.. It then follows that . Therefore, by assuming
we ensure that .
Collecting the conditions on (namely, that it is in , satisfies Eq. (6), as well as the displayed equation above), we get that there is an appropriate choice of and we can shatter our points, as long as is sufficiently larger than and that
The first inequality is satisfied if (say) we can choose (which we will indeed do in the sequel). As to the second inequality, it is certainly satisfied if , as well as
Thus, we can shatter any number of points up to this upper bound. Picking on this order (assuming it is sufficiently larger than , or ), assuming that the dimension is larger than , and renaming the universal constants, the result follows.
a.2 Proof of Thm. 2
To simplify notation, we rewrite as simply (with the set over which the supremum is taken to be implicit). Also, we let denote the th row of the matrix , and let for any nonzero (or for ).
Fix some set of inputs with norm at most . The Rademacher complexity equals
Each matrix in the set