Statistical Guarantees for Regularized Neural Networks

05/30/2020 ∙ by Mahsa Taheri, et al. ∙ Ruhr University Bochum 0

Neural networks have become standard tools in the analysis of data, but they lack comprehensive mathematical theories. For example, there are very few statistical guarantees for learning neural networks from data, especially for classes of estimators that are used in practice or at least similar to such. In this paper, we develop a general statistical guarantee for estimators that consist of a least-squares term and a regularizer. We then exemplify this guarantee with ℓ_1-regularization, showing that the corresponding prediction error increases at most sub-linearly in the number of layers and at most logarithmically in the total number of parameters. Our results establish a mathematical basis for regularized estimation of neural networks, and they deepen our mathematical understanding of neural networks and deep learning more generally.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have proved extremely useful across a variety of applications, including speech recognition (Hinton et al., 2012; Graves et al., 2013; Chorowski et al., 2015)

, natural language processing 

(Jozefowicz et al., 2016), object categorization (Girshick et al., 2014; Szegedy et al., 2015), and image segmentation (Long et al., 2015; Badrinarayanan et al., 2017). But our mathematical understanding of neural networks and deep learning has not developed at the same speed.

A central objective is to equip methods for learning neural networks with statistical guarantees. Some guarantees are available for unconstrained estimators (Anthony and Bartlett, 2009), but these bounds are linear in the number of parameters, which conflicts with the large sizes of typical networks. The focus has thus shifted to estimators that involve constraints or regularizers. Recently surged in popularity have estimators with -regularizers (Bartlett, 1998; Bartlett and Mendelson, 2002; Anthony and Bartlett, 2009; Barron and Klusowski, 2018, 2019; Liu and Ye, 2019)

, motivated by the success of this type of regularization in linear regression 

(Tibshirani, 1996), compressed sensing (Candès et al., 2006; Donoho, 2006)

, and many other parts of data science. A key feature of

-regularization is that it is easy to include into optimization schemes and, at the same time, induces sparsity, which has a number of favorable effects in deep learning (Glorot et al., 2011). There has been some progress on guarantees for least-squares with constraints based on the sparsity of the networks (Schmidt-Hieber, 2017) or group-type norms on the weights (Neyshabur et al., 2015b). These developments have provided valuable intuition, for example, about the role of network widths and depths, but important problems remain: for example, the combinatorial constraints in the first paper render the corresponding estimators infeasible in practice, the exponential dependence of the bounds in the second paper are contrary to the trend toward very deep networks, and practitioners usually prefer regularized rather than constraint formulations of the estimators. More generally, many questions about the statistical properties of constraint and regularized estimation of neural networks remain open.

In this paper, we introduce a general class of regularized least-squares estimators. Our strategy is to disentangle the parameters into a “scale” and a “direction”—similarly to introducing polar coordinates—which allows us to focus the regularization on a one-dimensional parameter. We call our approach scale regularization. We then equip the scale regularized least-squares estimators with a general statistical guarantee for prediction. A main feature of this guarantee is that it connects neural networks to standard empirical process theory through a quantity that we call the effective noise. This connection facilitates the specification of the bound to different types of regularization.

In a second step, we exemplify the general bound for -regularization. We find a guarantee for the squared prediction error of the order of

which decreases essentially as  in the number of samples , is sub-linear in the number of hidden layers , and is logarithmic in the total number of parameters . This result suggests that -regularization can ensure accurate prediction even of very wide and deep networks.

The organization of the paper is as follows: In Section 2, we introduce our regularization scheme and establish a general prediction bound. In Section 3, we specify this bound to -regularization. In Section 4, we compare to related literature. In Section 5, we establish further mathematical properties of neural networks. Section 6, we state all proofs. In Section 7, we conclude our paper.

2 Scale regularization for neural networks

We first establish an alternative parametrization of neural networks and use this parameterization to define our regularization strategy. We then provide a prediction guarantee for the corresponding estimators.

2.1 Alternative Parametrization

Consider data  that follow a regression model

(1)

for some function . We are interested in estimating an approximation of 

 based on neural networks. Following first standard approaches, we consider feed-forward neural networks of the form

(2)

indexed by the network parameter  that summarizes the weight matrices . The  and  are the network’s inputs and outputs, respectively, and the  are the noise variables. For ease of notation, we assume that the  are fixed; generalizations to random  are straightforward. The network’s architecture is specified by the number of hidden layers or depth 

and by the the number of neurons in each layer or width 

. The th layer is the input layer with , and the th layer is the output layer with . The functions 

are called activation functions. We omit shifts in the activation functions for notational simplicity, but such can often be incorporated as additional neurons 

(Barron and Klusowski, 2018).

The parameter space in the above formulation is

In the following, however, we propose an alternative parametrization. We say that a function is non-negative homogeneous of degree  if

and we say that a function  is positive definite if

The corresponding properties for functions on  are defined accordingly. For example, every norm on  or  is non-negative homogeneous of degree 1 and positive definite. We then find the following:

Proposition 1 (Equivalence between neural networks).

Assume that the activation functions  are non-negative homogeneous of degree . Consider a function  that is non-negative homogeneous of degree  and positive definite, and denote the corresponding unit ball by

Then, for every , there exists a pair of  and  such that

and vice versa, for every pair of  and , there exists a  such that the above equality holds.

The proposition shows that the standard parametrization of neural networks over the set  can be replaced by a parametrization over 

. The equivalence requires the activation functions to be non-negative homogeneous (ReLU activations are popular examples), but it motivates a different parametrization of neural networks more generally. We thus change the parameter space for estimating the true data generating function 

to  and the corresponding space of networks to  (which, of course, equals the original space  under the conditions of the proposition). In other words, we study neural networks

(3)

indexed by the network parameters  and . We can interpret  as the network’s “scale” and  as the network’s “orientation.”

2.2 Estimation

The most basic approach to fit the model parameters of the network (2) to the model (1) is the least-squares estimator

But to account for the high-dimensionality of the parameter space , the least-squares estimator is often complemented with a regularizer ; popular choices for  are the -norm (Zhang et al., 2016) or group versions of it (Scardapane et al., 2017). A straightforward way to incorporate such regularizers is

where  is a tuning parameter. But in neural network frameworks, it turns out difficult to analyze such estimators statistically.

We introduce, therefore, a different way to incorporate regularizers. The approach is based on our new parametrization. The equivalent of the above least-squares estimator in the framework (3) is

It holds that  under the conditions of Proposition 1, but we can take this estimator as a starting point more generally. This allows us to focus the regularization on the scale-parameter ; in other words, we propose the estimators

(4)

where  is a tuning parameter. The fixed constraint  captures the type of regularization (such as ), while the actual regularization concerns only on the scale . We thus call our approach scale regularization.

The concentration of the regularization on a one-dimensional parameter greatly facilitates the statistical analysis. Specifically, it will allow us to focus our attention on

(5)

which can be thought of as the neural network equivalent of what high-dimensional linear regression refers to as the effective noise (Lederer and Vogt, 2020)

. We need to ensure—just as in high-dimensional linear regression—that the effective noise is controlled by the tuning parameter with high probability. In this spirit, we define quantiles 

of the effective noise for given level  through

(6)

In other words,  is the smallest tuning parameter that controls the effective noise  at level .

To measure the accuracy of the regularized estimators, we consider the in-sample prediction error with respect to the data generating function :

(7)

This is a standard measure for how well the data generating function is learned. We find the following guarantee in this measure:

Theorem 1 (Prediction guarantee).

Assume that  for a . Then,

with probability at least .

The bound is an analog of what has been called sparsity-bound in high-dimensional linear regression (Lederer et al., 2019). For neural networks, however, it is the first such bound. It states that the squared prediction error of the regularized estimator is governed by an approximation error or squared bias 

and an excess error or variance 

. In other words, the estimator is guaranteed to have a small prediction error if (i) the quantile  is small and (ii) the data generating function can be represented well by a neural network with reasonably small . A typical example for (i) is provided in the following section; recent results on approximation theory support (ii) especially for wide and deep networks (Yarotsky, 2017).

Since  is a supremum over an empirical process, it allows us to connect our statistical theories with theories on empirical processes. Deviation inequalities that bound quantities such as  have been established even for noise  that has very heavy tails (Lederer and van de Geer, 2014). In Section 3, we derive an explicit bound for  for -regularization and sub-Gaussian noise. Crucial in this derivation, and in controlling  in general, is that the index set of the empirical process is the constraint parameter space  rather than the entire parameter space . This key feature of  is due to our novel way of regularizing.

The standard parametrization  of neural networks is ambiguous: there are typically uncountably many parameters  that yield the same network . This ambiguity remains in our new framework with . But importantly, our guarantees hold for every solution .

3 An example: -regularization

Theorem 1 can be specified readily to different types of regularization. Indeed, concrete bounds for the prediction error follow directly from concrete bounds for the quantiles . We highlight this feature in the case of -regularization, that is, we define  as

To fix ideas, we impose two assumptions on the activation functions and the noise: First, we assume that the activation functions satisfy  and are -Lipschitz continuous for a constant  and with respect to the Euclidean norms on their input and output spaces:

This assumption is satisfied by many popular activation functions: for example, the coordinates of the activation functions could be ReLU functions  (Nair and Hinton, 2010), “leaky" versions of ReLU   for , ELU functions  for  (Clevert et al., 2015), hyperbolic tangent functions , or SiL/Swish functions  (Ramachandran et al., 2017; Elfwing et al., 2018) (throughout, we use the shorthands  and  for ). Feasible Lipschitz constants for these examples are  for SiL/Swish and  for all other functions.

Second, we assume that the noise variables  are independent, centered, and uniformly sub-Gaussian for constants  [van de Geer, 2000, Page 126; Vershynin, 2018, Section 2.5]:

Using the shorthands , and , we then find the following prediction guarantee for the estimator in (4):

Theorem 2 (Prediction guarantee for -regularization).

Assume that , where  is a constant that depends only on the sub-Gaussian parameters  and  of the noise. Then, for  large enough,

with probability at least .

The bound establishes essentially a -dependence on the sample size , a sub-linear dependence on the number of hidden layers  (note that  for typical activation functions), and a logarithmic dependence on the number of parameters . The dependencies on the sample size  and the number of parameters  match those of standard bounds in -regularized linear regression (Hebiri and Lederer, 2013). But one can argue that the logarithmic dependence on the number of parameters is even more crucial for neural networks: already a small network with , and  involves  parameters, which highlights that neural networks typically involve very large .

As an illustration, we can simplify Theorem 2 further in a parametric setting:

Corollary 1 (Parametric setting).

Assume that  and that there exist parameters  such that  for all . Then, for  large enough,

with probability at least .

The above choice of  is not the only way to formulate -constraints. Another way is, for example, . The proofs and results remain virtually the same, and one may choose in practice whatever regularizer is more appropriate or easier to compute. And more broadly, our theories provide a general scheme for deriving prediction guarantees that could account for different regularizers (such as grouped versions of ), activation functions (such as non-Lipschitz functions), and noise (such as heavy-tailed noise) through corresponding bounds for .

4 Related literature

In this section, we relate our results to literature. An immediate difference of most papers mentioned below to ours is that their estimators are regularized through constraints, while we add regularizers to the objective functions. Adding regularization terms is more common in practice than adding constraints. And more importantly, while the constraints always involve model parameters that are unknown in practice (such as good bounds on the sparsity level or the Frobenius norms of the weights), we detail how suitable tuning parameters for the regularization term relate to the known model parameters (such as the sample size or the number of parameters)—see, for example, Theorem 2.

Another difference is that most papers bound misclassification probabilities (probability that a new input vector is mislabeled) or generalization errors (square-root of the expected squared-difference between the estimator and the true data generating function evaluated on a new input vector), while we bound in-sample prediction errors (square-root of the averaged squared-differences between the estimator and the true data generating function—

not the outcome—evaluated on the available data).

The papers Bartlett (1998); Bartlett and Mendelson (2002); Ledent et al. (2019); Neyshabur et al. (2015b) derive bounds by using fat-shattering dimensions [Kearns and Schapire, 1994; Anthony and Bartlett, 2009, Section 11.3] or Rademacher/Gaussian complexities (Shalev-Shwartz and Ben-David, 2014, Chapter 26) of sparse or -related classes of neural networks. Such bounds translate into misclassification bounds or risk bounds for empirical risk minimizers over those classes—see, for example, (Bartlett, 1998, Section 2) and (Bartlett and Mendelson, 2002, Theorem 8), respectively.

A bound for the misclassification probabilities for empirical risk minimization over -balls is Bartlett (1998, Theorem 28). The measure of complexity used for deriving these bounds is fat-shattering. A common denominator of their theories and our Section 3 is the -regularization. But besides considering classification rather than prediction, their bounds differ in their dependence on the network architecture: for example, they allow for infinite widths, have a slightly different dependence on the input (their versus our ), and have an exponential rather than sub-linear dependence on the depths .

Other bounds for the misclassification probabilities and prediction errors of risk minimizers can be derived from Bartlett and Mendelson (2002) and Ledent et al. (2019). For example, Bartlett and Mendelson (2002, Theorem 18) entails bounds for -regularized empirical risk minimization over two-layer neural networks; the bounds are similar to the ones in our Corollary 1 when . Ledent et al. (2019) derive guarantees that cater to classification with many classes.

Bounds for Rademacher complexities of network classes with general group-norm constraints are provided in (Neyshabur et al., 2015b). A specification to -constraints is their (Neyshabur et al., 2015b, Corollary 2). A feature of their bounds is that they allow for infinite widths. But besides having a slightly different dependence on the input than our bounds in Theorem 2 and Corollary 2 (their versus our ), their bounds are restricted to ReLU activation and have an exponential dependence on the number of hidden layers .

An approach for deriving guarantees that is different to the fat-shatting/Rademacher approaches relates to nonparameteric statistics (Schmidt-Hieber, 2017). Their paper provides upper and lower bounds for empirical risk minimization over classes of sparse networks. A feature of these bounds is that they apply beyond the empirical risk minimizer; in particular, they provide insights into how inaccuracies in computing empirical risk estimators affect the estimators’ risks. A connection to our results is that -regularization typically induces sparsity. But the estimators, bounds, proofs, and the framework more generally in Schmidt-Hieber (2017) differ largely from ours; for example: 1. Empirical risk minimization over sparse networks requires prior knowledge about the necessary sparsity level. 2. In contrast to our setup, theirs assumes that the parameters are bounded by one, that the network functions are bounded, and that the noise is standard normal. 3. The assumptions on the setup and the restriction to sparse networks ensures small covering numbers (while we leverage our novel parametrization to bound covering numbers) and changes the range of potential proof techniques more generally. 4. The results in Schmidt-Hieber (2017) are limited to ReLU activation, while our results hold for arbitary (Section 2.2) or a wide variety of activations (Section 3).

Empirical risk minimizers over sparse networks such as in Schmidt-Hieber (2017) are very different from practical estimators, especially because they are based on (i) non-convex objective functions and (ii) computationally intractable given the combinatorial constraints. Our objective functions are still non-convex, but in view of our general and practical regularization strategy, we make considerably forward in closing this gap. In particular, since -regularization is well-known to induce sparsity, one could see our -approach in Section 3 as a more practical version of the -constraint in Schmidt-Hieber (2017).

Although from very different angles, both Schmidt-Hieber (2017) and our paper highlight that estimating neural network can benefit from sparsity. Kohler and Langer (2019), in contrast, argue that sparsity might not be necessary for classes of extremely deep networks (in particular, they consider  polynomially increasing in ), but such architectures are currently not used in practice.

The rates of the bounds in our Section 3 as well as in most other mentioned papers are essentially  in the number of samples. In contrast, the results in Schmidt-Hieber (2017) indicate the possibility of rates as fast as . But one can verify that these faster rates follow only under very restrictive assumptions, and we believe that the -rate cannot be improved in general: while a formal proof still needs to be established, a corresponding statement has already been proved for -regularized linear regression (Dalalyan et al., 2017, Proposition 4). In this sense, we might claim some optimality of our results.

Feng and Simon (2017) provide theoretical and practical insights into the effect of group regularization at the input level.

Recent insights into the roles of widths and depths in Rademacher analyses of neural networks with Frobenius norm constraints are provided in Golowich et al. (2018). In particular, the bounds in Golowich et al. (2018, Theorem 1) share our Theorem 2’s sub-linear dependence on the number of layers. Hence, both Golowich et al. (2018) and our paper highlight that even very deep networks can be learned effectively.

5 Further technical results

We now establish Lipschitz and complexity properties of neural networks. These results are used in our proofs but might also be of interest by themselves. To start, we define operator norms of the parameters and the weight matrices by

respectively, where 

is the largest singular value of 

. We also define Frobenius norms of the parameter and weight matrices by

We then define the Euclidean norm of vectors by  for . And finally, the prediction distance of any two networks  and  with  is

and similarly,

The Lipschitz property of neural networks is then as follows.

Proposition 2 (Lipschitz property of NNs).

Assume that the activation functions  are -Lipschitz with respect to the Euclidean norms on their input and output spaces. Then, it holds for every  and  that

with .

And similarly, it holds that

with .

This property is extremely helpful in bounding the quantiles of the empirical processes. The above inequalities do not guarantee that the networks are Lipschitz in the parameter in general (because  and  depend on the parameters), but they guarantee that the networks are Lipschitz and bounded over typical sets that originate from our regularization scheme: For every , it holds that . Moreover, for , we get that . Hence, Proposition 2 entails the following:

Corollary 2 (Lipschitz and boundedness on ).

Under the conditions of Proposition 2, it holds for every  that

and that

with .

To derive the complexity properties, we denote covering and entropy numbers by and , respectively, where  is a set, and  is a (pseudo-)norm on an ambient space of  (van der Vaart and Wellner, 1996, Page 98). We use these numbers to define a complexity measure for a collection of networks  by

(8)

for  (van de Geer, 2000, Section 3.3). Almost in line with standard terminology, we call this complexity measure the Dudley integral (Vershynin, 2018, Section 8.1). We can bound the complexity of the class of neural networks  that have parameters in the constraint set  as follows:

Proposition 3 (Complexity properties of NNs).

Assume that the activation functions  are -Lipschitz continuous with respect to the Euclidean norms on their input and output spaces. Then, it holds for every  and  that satisfy  that

and

where we recall that .

6 Additional materials and proofs

We now state some auxiliary results and then prove our claims.

6.1 Additional materials

We first provide three auxiliary results that we use in our proofs. We start with a slightly adapted version of van de Geer (2000, Corollary 8.3):

Lemma 1 (Suprema over Gaussian processes).

Consider a set  and a constant  such that 

. Assume that the noise random variables 

are independent, centered, and uniformly sub-Gaussian as specified on Page 3. Then, there is a constant  that depends only on  and  such that for all  that satisfy  and

it holds that

This result is used to bound .

We then turn to a Lipschitz property of metric entropy:

Lemma 2 (Entropy transformation for Lipschitz functions).

Consider sets and  and a metric . Assume that  for every  and  and a fixed function . Then,

where .

We use the convention  for . The result allows us to bound entropies on the parameter spaces instead of the network spaces. We prove the lemma in the following section.

We continue with a standard bound on entropies [van der Vaart and Wellner, 1996, Page 94; van de Geer, 2000, Page 20]:

Lemma 3 (Entropy of a ball).

Let  be a ball in d-dimensional Euclidean space with radius . Then,

This result is used for bounding the entropies over the parameter spaces.

We conclude with a deviation inequality for the noise.

Lemma 4 (Deviation of sub-Gaussian noise).

Assume that the noise variables  are independent, centered, and uniformly sub-Gaussian as stipulated on Page 3. Then,

This deviation inequality is tailored to our needs in the proof of Theorem 2.

6.2 Proofs

We provide here the proofs of our claims.

6.2.1 Proof of Proposition 1

Proof.

We prove the two directions in order.

Direction 1: Fix a . Assume first that  for an . In view of the definition of neural networks in (2) and the assumed non-negative homogeneity of the activation functions, it then holds that

for all . Therefore,  and all  satisfy , as desired .

Assume now that  for all . Define  and  if . We need to show that 1.  and  and 2. .

Since  is assumed positive definite, it holds that  and, therefore, . The fact that  also ensures that the parameter  is well-defined, and we can invoke the assumed non-negative homogeneity of degree  of  to derive

This verifies 1.

We can then invoke the assumed non-negative homogeneity of degree 1 of the activation functions to derive for all  that

This verifies 2.

Direction 2: Fix a  and a , and define . We then invoke the assumed non-negative homogeneity of degree 1 of the activation functions to derive for all  that

as desired. ∎

6.2.2 Proof of Theorem 1

Proof.

Since  is a minimizer of the objective function in (4), we find for every  and  that

Replacing the ’s via the model in (1) then yields

Expanding the squared-terms and rearranging terms, we get

We can then bound the sums on the second line to obtain