## 1 Introduction

Neural networks have proved extremely useful across a variety of applications, including speech recognition (Hinton et al., 2012; Graves et al., 2013; Chorowski et al., 2015)

(Jozefowicz et al., 2016), object categorization (Girshick et al., 2014; Szegedy et al., 2015), and image segmentation (Long et al., 2015; Badrinarayanan et al., 2017). But our mathematical understanding of neural networks and deep learning has not developed at the same speed.A central objective is to equip methods for learning neural networks with statistical guarantees. Some guarantees are available for unconstrained estimators (Anthony and Bartlett, 2009), but these bounds are linear in the number of parameters, which conflicts with the large sizes of typical networks. The focus has thus shifted to estimators that involve constraints or regularizers. Recently surged in popularity have estimators with -regularizers (Bartlett, 1998; Bartlett and Mendelson, 2002; Anthony and Bartlett, 2009; Barron and Klusowski, 2018, 2019; Liu and Ye, 2019)

, motivated by the success of this type of regularization in linear regression

(Tibshirani, 1996), compressed sensing (Candès et al., 2006; Donoho, 2006), and many other parts of data science. A key feature of

-regularization is that it is easy to include into optimization schemes and, at the same time, induces sparsity, which has a number of favorable effects in deep learning (Glorot et al., 2011). There has been some progress on guarantees for least-squares with constraints based on the sparsity of the networks (Schmidt-Hieber, 2017) or group-type norms on the weights (Neyshabur et al., 2015b). These developments have provided valuable intuition, for example, about the role of network widths and depths, but important problems remain: for example, the combinatorial constraints in the first paper render the corresponding estimators infeasible in practice, the exponential dependence of the bounds in the second paper are contrary to the trend toward very deep networks, and practitioners usually prefer regularized rather than constraint formulations of the estimators. More generally, many questions about the statistical properties of constraint and regularized estimation of neural networks remain open.In this paper,
we introduce a general class of regularized least-squares estimators.
Our strategy is to disentangle the parameters into a “scale” and a “direction”—similarly to introducing polar coordinates—which allows us to focus the regularization on a one-dimensional parameter.
We call our approach *scale regularization*.
We then equip the scale regularized least-squares estimators with a general statistical guarantee for prediction.
A main feature of this guarantee is that it connects neural networks to standard empirical process theory through a quantity that we call the *effective noise*.
This connection facilitates the specification of the bound to different types of regularization.

In a second step, we exemplify the general bound for -regularization. We find a guarantee for the squared prediction error of the order of

which decreases essentially as in the number of samples , is sub-linear in the number of hidden layers , and is logarithmic in the total number of parameters . This result suggests that -regularization can ensure accurate prediction even of very wide and deep networks.

The organization of the paper is as follows: In Section 2, we introduce our regularization scheme and establish a general prediction bound. In Section 3, we specify this bound to -regularization. In Section 4, we compare to related literature. In Section 5, we establish further mathematical properties of neural networks. Section 6, we state all proofs. In Section 7, we conclude our paper.

## 2 Scale regularization for neural networks

We first establish an alternative parametrization of neural networks and use this parameterization to define our regularization strategy. We then provide a prediction guarantee for the corresponding estimators.

### 2.1 Alternative Parametrization

Consider data that follow a regression model

(1) |

for some function . We are interested in estimating an approximation of

based on neural networks. Following first standard approaches, we consider feed-forward neural networks of the form

(2) | ||||

indexed by the network parameter that summarizes the weight matrices . The and are the network’s inputs and outputs, respectively, and the are the noise variables. For ease of notation, we assume that the are fixed; generalizations to random are straightforward. The network’s architecture is specified by the number of hidden layers or depth

and by the the number of neurons in each layer or width

. The th layer is the input layer with , and the th layer is the output layer with . The functionsare called activation functions. We omit shifts in the activation functions for notational simplicity, but such can often be incorporated as additional neurons

(Barron and Klusowski, 2018).The parameter space in the above formulation is

In the following, however, we propose an alternative parametrization. We say that a function is non-negative homogeneous of degree if

and we say that a function is positive definite if

The corresponding properties for functions on are defined accordingly. For example, every norm on or is non-negative homogeneous of degree 1 and positive definite. We then find the following:

###### Proposition 1 (Equivalence between neural networks).

Assume that the activation functions are non-negative homogeneous of degree . Consider a function that is non-negative homogeneous of degree and positive definite, and denote the corresponding unit ball by

Then, for every , there exists a pair of and such that

and vice versa, for every pair of and , there exists a such that the above equality holds.

The proposition shows that the standard parametrization of neural networks over the set can be replaced by a parametrization over

. The equivalence requires the activation functions to be non-negative homogeneous (ReLU activations are popular examples), but it motivates a different parametrization of neural networks more generally. We thus change the parameter space for estimating the true data generating function

to and the corresponding space of networks to (which, of course, equals the original space under the conditions of the proposition). In other words, we study neural networks(3) | ||||

indexed by the network parameters and . We can interpret as the network’s “scale” and as the network’s “orientation.”

### 2.2 Estimation

The most basic approach to fit the model parameters of the network (2) to the model (1) is the least-squares estimator

But to account for the high-dimensionality of the parameter space , the least-squares estimator is often complemented with a regularizer ; popular choices for are the -norm (Zhang et al., 2016) or group versions of it (Scardapane et al., 2017). A straightforward way to incorporate such regularizers is

where is a tuning parameter. But in neural network frameworks, it turns out difficult to analyze such estimators statistically.

We introduce, therefore, a different way to incorporate regularizers. The approach is based on our new parametrization. The equivalent of the above least-squares estimator in the framework (3) is

It holds that under the conditions of Proposition 1, but we can take this estimator as a starting point more generally. This allows us to focus the regularization on the scale-parameter ; in other words, we propose the estimators

(4) |

where is a tuning parameter.
The fixed constraint captures the type of regularization (such as ),
while the actual regularization concerns only on the scale .
We thus call our approach *scale regularization*.

The concentration of the regularization on a one-dimensional parameter greatly facilitates the statistical analysis. Specifically, it will allow us to focus our attention on

(5) |

which can be thought of as the neural network equivalent of what high-dimensional linear regression refers to as the *effective noise* (Lederer and Vogt, 2020)

. We need to ensure—just as in high-dimensional linear regression—that the effective noise is controlled by the tuning parameter with high probability. In this spirit, we define quantiles

of the effective noise for given level through(6) |

In other words, is the smallest tuning parameter that controls the effective noise at level .

To measure the accuracy of the regularized estimators, we consider the in-sample prediction error with respect to the data generating function :

(7) |

This is a standard measure for how well the data generating function is learned. We find the following guarantee in this measure:

###### Theorem 1 (Prediction guarantee).

Assume that for a . Then,

with probability at least .

The bound is an analog of what has been called sparsity-bound in high-dimensional linear regression (Lederer et al., 2019). For neural networks, however, it is the first such bound. It states that the squared prediction error of the regularized estimator is governed by an approximation error or squared bias

and an excess error or variance

. In other words, the estimator is guaranteed to have a small prediction error if (i) the quantile is small and (ii) the data generating function can be represented well by a neural network with reasonably small . A typical example for (i) is provided in the following section; recent results on approximation theory support (ii) especially for wide and deep networks (Yarotsky, 2017).Since is a supremum over an empirical process, it allows us to connect our statistical theories with theories on empirical processes. Deviation inequalities that bound quantities such as have been established even for noise that has very heavy tails (Lederer and van de Geer, 2014). In Section 3, we derive an explicit bound for for -regularization and sub-Gaussian noise. Crucial in this derivation, and in controlling in general, is that the index set of the empirical process is the constraint parameter space rather than the entire parameter space . This key feature of is due to our novel way of regularizing.

The standard parametrization of neural networks is ambiguous:
there are typically uncountably many parameters that yield the same network .
This ambiguity remains in our new framework with .
But importantly,
our guarantees hold for *every* solution .

## 3 An example: -regularization

Theorem 1 can be specified readily to different types of regularization. Indeed, concrete bounds for the prediction error follow directly from concrete bounds for the quantiles . We highlight this feature in the case of -regularization, that is, we define as

To fix ideas, we impose two assumptions on the activation functions and the noise: First, we assume that the activation functions satisfy and are -Lipschitz continuous for a constant and with respect to the Euclidean norms on their input and output spaces:

This assumption is satisfied by many popular activation functions: for example, the coordinates of the activation functions could be ReLU functions (Nair and Hinton, 2010), “leaky" versions of ReLU for , ELU functions for (Clevert et al., 2015), hyperbolic tangent functions , or SiL/Swish functions (Ramachandran et al., 2017; Elfwing et al., 2018) (throughout, we use the shorthands and for ). Feasible Lipschitz constants for these examples are for SiL/Swish and for all other functions.

Second, we assume that the noise variables are independent, centered, and uniformly sub-Gaussian for constants [van de Geer, 2000, Page 126; Vershynin, 2018, Section 2.5]:

Using the shorthands , , and , we then find the following prediction guarantee for the estimator in (4):

###### Theorem 2 (Prediction guarantee for -regularization).

Assume that , where is a constant that depends only on the sub-Gaussian parameters and of the noise. Then, for large enough,

with probability at least .

The bound establishes essentially a -dependence on the sample size , a sub-linear dependence on the number of hidden layers (note that for typical activation functions), and a logarithmic dependence on the number of parameters . The dependencies on the sample size and the number of parameters match those of standard bounds in -regularized linear regression (Hebiri and Lederer, 2013). But one can argue that the logarithmic dependence on the number of parameters is even more crucial for neural networks: already a small network with , , and involves parameters, which highlights that neural networks typically involve very large .

As an illustration, we can simplify Theorem 2 further in a parametric setting:

###### Corollary 1 (Parametric setting).

Assume that and that there exist parameters such that for all . Then, for large enough,

with probability at least .

The above choice of is not the only way to formulate -constraints. Another way is, for example, . The proofs and results remain virtually the same, and one may choose in practice whatever regularizer is more appropriate or easier to compute. And more broadly, our theories provide a general scheme for deriving prediction guarantees that could account for different regularizers (such as grouped versions of ), activation functions (such as non-Lipschitz functions), and noise (such as heavy-tailed noise) through corresponding bounds for .

## 4 Related literature

In this section, we relate our results to literature. An immediate difference of most papers mentioned below to ours is that their estimators are regularized through constraints, while we add regularizers to the objective functions. Adding regularization terms is more common in practice than adding constraints. And more importantly, while the constraints always involve model parameters that are unknown in practice (such as good bounds on the sparsity level or the Frobenius norms of the weights), we detail how suitable tuning parameters for the regularization term relate to the known model parameters (such as the sample size or the number of parameters)—see, for example, Theorem 2.

Another difference is that most papers bound misclassification probabilities (probability that a new input vector is mislabeled) or generalization errors (square-root of the expected squared-difference between the estimator and the true data generating function evaluated on a new input vector), while we bound in-sample prediction errors (square-root of the averaged squared-differences between the estimator and the true data generating function—

*not*the outcome—evaluated on the available data).

The papers Bartlett (1998); Bartlett and Mendelson (2002); Ledent et al. (2019); Neyshabur et al. (2015b) derive bounds by using fat-shattering dimensions [Kearns and Schapire, 1994; Anthony and Bartlett, 2009, Section 11.3] or Rademacher/Gaussian complexities (Shalev-Shwartz and Ben-David, 2014, Chapter 26) of sparse or -related classes of neural networks. Such bounds translate into misclassification bounds or risk bounds for empirical risk minimizers over those classes—see, for example, (Bartlett, 1998, Section 2) and (Bartlett and Mendelson, 2002, Theorem 8), respectively.

A bound for the misclassification probabilities for empirical risk minimization over -balls is Bartlett (1998, Theorem 28). The measure of complexity used for deriving these bounds is fat-shattering. A common denominator of their theories and our Section 3 is the -regularization. But besides considering classification rather than prediction, their bounds differ in their dependence on the network architecture: for example, they allow for infinite widths, have a slightly different dependence on the input (their versus our ), and have an exponential rather than sub-linear dependence on the depths .

Other bounds for the misclassification probabilities and prediction errors of risk minimizers can be derived from Bartlett and Mendelson (2002) and Ledent et al. (2019). For example, Bartlett and Mendelson (2002, Theorem 18) entails bounds for -regularized empirical risk minimization over two-layer neural networks; the bounds are similar to the ones in our Corollary 1 when . Ledent et al. (2019) derive guarantees that cater to classification with many classes.

Bounds for Rademacher complexities of network classes with general group-norm constraints are provided in (Neyshabur et al., 2015b). A specification to -constraints is their (Neyshabur et al., 2015b, Corollary 2). A feature of their bounds is that they allow for infinite widths. But besides having a slightly different dependence on the input than our bounds in Theorem 2 and Corollary 2 (their versus our ), their bounds are restricted to ReLU activation and have an exponential dependence on the number of hidden layers .

An approach for deriving guarantees that is different to the fat-shatting/Rademacher approaches relates to nonparameteric statistics (Schmidt-Hieber, 2017). Their paper provides upper and lower bounds for empirical risk minimization over classes of sparse networks. A feature of these bounds is that they apply beyond the empirical risk minimizer; in particular, they provide insights into how inaccuracies in computing empirical risk estimators affect the estimators’ risks. A connection to our results is that -regularization typically induces sparsity. But the estimators, bounds, proofs, and the framework more generally in Schmidt-Hieber (2017) differ largely from ours; for example: 1. Empirical risk minimization over sparse networks requires prior knowledge about the necessary sparsity level. 2. In contrast to our setup, theirs assumes that the parameters are bounded by one, that the network functions are bounded, and that the noise is standard normal. 3. The assumptions on the setup and the restriction to sparse networks ensures small covering numbers (while we leverage our novel parametrization to bound covering numbers) and changes the range of potential proof techniques more generally. 4. The results in Schmidt-Hieber (2017) are limited to ReLU activation, while our results hold for arbitary (Section 2.2) or a wide variety of activations (Section 3).

Empirical risk minimizers over sparse networks such as in Schmidt-Hieber (2017) are very different from practical estimators, especially because they are based on (i) non-convex objective functions and (ii) computationally intractable given the combinatorial constraints. Our objective functions are still non-convex, but in view of our general and practical regularization strategy, we make considerably forward in closing this gap. In particular, since -regularization is well-known to induce sparsity, one could see our -approach in Section 3 as a more practical version of the -constraint in Schmidt-Hieber (2017).

Although from very different angles, both Schmidt-Hieber (2017) and our paper highlight that estimating neural network can benefit from sparsity. Kohler and Langer (2019), in contrast, argue that sparsity might not be necessary for classes of extremely deep networks (in particular, they consider polynomially increasing in ), but such architectures are currently not used in practice.

The rates of the bounds in our Section 3 as well as in most other mentioned papers are essentially in the number of samples. In contrast, the results in Schmidt-Hieber (2017) indicate the possibility of rates as fast as . But one can verify that these faster rates follow only under very restrictive assumptions, and we believe that the -rate cannot be improved in general: while a formal proof still needs to be established, a corresponding statement has already been proved for -regularized linear regression (Dalalyan et al., 2017, Proposition 4). In this sense, we might claim some optimality of our results.

Feng and Simon (2017) provide theoretical and practical insights into the effect of group regularization at the input level.

Recent insights into the roles of widths and depths in Rademacher analyses of neural networks with Frobenius norm constraints are provided in Golowich et al. (2018). In particular, the bounds in Golowich et al. (2018, Theorem 1) share our Theorem 2’s sub-linear dependence on the number of layers. Hence, both Golowich et al. (2018) and our paper highlight that even very deep networks can be learned effectively.

## 5 Further technical results

We now establish Lipschitz and complexity properties of neural networks. These results are used in our proofs but might also be of interest by themselves. To start, we define operator norms of the parameters and the weight matrices by

respectively, where

is the largest singular value of

. We also define Frobenius norms of the parameter and weight matrices byWe then define the Euclidean norm of vectors by for . And finally, the prediction distance of any two networks and with is

and similarly,

The Lipschitz property of neural networks is then as follows.

###### Proposition 2 (Lipschitz property of NNs).

Assume that the *activation functions*
are -Lipschitz with respect to the Euclidean norms on their input and output spaces. Then, it holds for every and that

with .

And similarly, it holds that

with .

This property is extremely helpful in bounding the quantiles of the empirical processes. The above inequalities do not guarantee that the networks are Lipschitz in the parameter in general (because and depend on the parameters), but they guarantee that the networks are Lipschitz and bounded over typical sets that originate from our regularization scheme: For every , it holds that . Moreover, for , we get that . Hence, Proposition 2 entails the following:

###### Corollary 2 (Lipschitz and boundedness on ).

To derive the complexity properties, we denote covering and entropy numbers by and , respectively, where , is a set, and is a (pseudo-)norm on an ambient space of (van der Vaart and Wellner, 1996, Page 98). We use these numbers to define a complexity measure for a collection of networks by

(8) |

for (van de Geer, 2000, Section 3.3).
Almost in line with standard terminology, we call this complexity measure the *Dudley integral* (Vershynin, 2018, Section 8.1).
We can bound the complexity of the class of neural networks that have parameters in the constraint set as follows:

###### Proposition 3 (Complexity properties of NNs).

Assume that the *activation functions*
are -Lipschitz continuous with respect to the Euclidean norms on their input and output spaces.
Then, it holds for every and that satisfy that

and

where we recall that .

## 6 Additional materials and proofs

We now state some auxiliary results and then prove our claims.

### 6.1 Additional materials

We first provide three auxiliary results that we use in our proofs. We start with a slightly adapted version of van de Geer (2000, Corollary 8.3):

###### Lemma 1 (Suprema over Gaussian processes).

Consider a set and a constant such that

. Assume that the noise random variables

are independent, centered, and uniformly sub-Gaussian as specified on Page 3. Then, there is a constant that depends only on and such that for all that satisfy andit holds that

This result is used to bound .

We then turn to a Lipschitz property of metric entropy:

###### Lemma 2 (Entropy transformation for Lipschitz functions).

Consider sets and and a metric . Assume that for every and and a fixed function . Then,

where .

We use the convention for . The result allows us to bound entropies on the parameter spaces instead of the network spaces. We prove the lemma in the following section.

We continue with a standard bound on entropies [van der Vaart and Wellner, 1996, Page 94; van de Geer, 2000, Page 20]:

###### Lemma 3 (Entropy of a ball).

Let be a ball in d-dimensional Euclidean space with radius . Then,

This result is used for bounding the entropies over the parameter spaces.

We conclude with a deviation inequality for the noise.

###### Lemma 4 (Deviation of sub-Gaussian noise).

Assume that the noise variables are independent, centered, and uniformly sub-Gaussian as stipulated on Page 3. Then,

This deviation inequality is tailored to our needs in the proof of Theorem 2.

### 6.2 Proofs

We provide here the proofs of our claims.

#### 6.2.1 Proof of Proposition 1

###### Proof.

We prove the two directions in order.

*Direction 1:*
Fix a .
Assume first that for an .
In view of the definition of neural networks in (2) and the assumed non-negative homogeneity of the activation functions, it then holds that

for all . Therefore, and all satisfy , as desired .

Assume now that for all . Define and if . We need to show that 1. and and 2. .

Since is assumed positive definite, it holds that and, therefore, . The fact that also ensures that the parameter is well-defined, and we can invoke the assumed non-negative homogeneity of degree of to derive

This verifies 1.

We can then invoke the assumed non-negative homogeneity of degree 1 of the activation functions to derive for all that

This verifies 2.

*Direction 2:* Fix a and a ,
and
define .
We then invoke the assumed non-negative homogeneity of degree 1 of the activation functions to derive for all that

as desired. ∎

Comments

There are no comments yet.