## 1. Introduction

We consider supervised learning using deep neural networks. Let

be i.i.d. observed pairs in with input space and output space . Write for a function generated by a neural network with layers and parameter matrices The total number of parameters is denoted by We assume that the set of possible network parameters is constrained to lie in a known parameter space Let be the loss function and denote by the output of the stochastic gradient descent (SGD) algorithm based on the empirical loss The expected loss (generalization error) is then given by . Throughout this manuscript, we assume that the loss if sufficiently regular such that the gradient exists for all .To achieve small generalization error, we are interested in bounds for the generalization gap

with the SGD output. In this work, we derive a new bound for the generalization gap, by exploiting the tendency of SGD iterates to stay in neighborhoods of critical points in the loss surface. We then argue that for complex loss surfaces, this constraints the SGD iterates considerably and leads to implicit regularization. Throughout this article we refer to this type of implicit regularization as loss surface implicit regularization.

### 1.1. Background

Deep learning is known to achieve outstanding performance in various complicated tasks, such as image recognition, text analysis, and reinforcement learning

[1]. For some tasks, super-human performance has been reported [2, 3, 4]. Despite the impressive practical performance, there is still a gap regarding the theoretical understanding of deep learning. Obstacles in the theoretical foundation include the higher-order nonlinear structures due to the stacking of multiple layers and the excessive number of network parameters in state of the art networks. For some recent surveys, see [5, 6].To understand the use of fitting large numbers of parameters, requires to rethink generalization [7]. Indeed, according to the standard statistical learning theory, large models overfit and thus increase the generalization error. Based on the classical analysis (e.g., the textbook [8]), the generalization gap for deep neural networks with layers and parameters is of the order

where stands for the Big O notation ignoring logarithmic factors and the index

indicates that the rates are in probability. While bounds of this type can be used to prove optimal statistical convergence rates for sparsely connected neural networks

[9, 10, 11], the bound is clearly not sharp enough to explain the success of fully connected highly overparametrized models. Hence, a new theoretical framework is needed.One possibility to derive sharper bounds on the generalization gap is to take the implicit regularization

into account. Learning algorithms, such as SGD, implicitly regularize and constraint the degrees of freedom. In several specific models, including linear regression, logistic regression and linear neural networks, it has been shown that gradient descent methods converges to interpolants with minimum norm constraints

[12, 13, 14, 15, 16, 17, 18, 19]. Motivated by this fact, several articles investigate the generalization gap under the assumption that a norm on the network parameters is bounded by a threshold. For example, [20] assumes that the network parameters lie in a norm-bounded subset with given and the Frobenius norm. It is shown that the generalization gap is then Interestingly, the bound only depends on the number of layers , the number of training data and the radii but not on the number of network parameters. This shows that norm control can avoid overfitting even in the case of overparametrization. [21] derives the generalization gap bound for the set of network parameters where denotes the sum of the Euclidean norms of the rows of the matrix and is the spectral norm. This bound has been extended and improved in subsequent work [22, 23]. Especially, [21] bounds the generalization gap by a product of the spectral norms of the parameter matrices. Alternatively, [24] derived a bound on the generalization gap involving the parameter distance between the initial value and a global minimizer. Section 1.3 provides a more comprehensive overview of related work.The imposed constraints on the parameter norms might be violated in practice. [25] shows that during network training, the parameter matrices move far away from the origin and the initialization. Empirically, it is argued that the distance to the initialization increases polynomially with the number of training data. The claim is that only for simple models, implicit regularization favors small norms.

### 1.2. Summary of results

We first define suitable neighborhoods around the local minima of the loss surface. Under a number of conditions, we then prove that if SGD with Gaussian gradient noise enters such a neighborhood and will not escape it anymore with positive probability. For this reason, we also call the neighborhoods stagnation sets. In a second step, we derive bounds for the complexity of these neighborhoods. Conditionally on the SGD iterates lying in one of these neighborhoods, we finally derive a generalization gap bound. Based on these results, we then argue that the loss surface itself constraints the SGD resulting in the loss surface implicit regularization.

To define a suitable notion of a stagnation set, we consider a sequence of parameters generated by Gaussian SGD with an iteration index and a stopping time The initial value is and the iterates are defined via the update equation . Here, is a given learning rate and is a perturbed loss with Gaussian noise such that the updates resemble SGD (a formal definition is provided in the next section). The output of the method is . We define the notion of a stagnation set as follows:

###### Definition 1 (stagnation set).

For , is a -stagnation set, if the following holds with some :

For convenience, we omit the dependence on and only indicate the dependence on in the notation of a -stagnation set. Indeed, the dependence on is of little importance, as is fixed and considered to be small compared to the total number of iterations

Thus with probability at least , SGD will not leave the set anymore. Conditionally on this event, it is sufficient to control the Rademacher complexity of to bound the generalization gap. On the contrary, a set is a -stagnation set if the output of the Gaussian SGD algorithm does not stagnate in the set. In this case, it is impossible to derive a generalization gap bound based on the Rademacher complexity of such a set. According to the empirical study in [25], SGD does not stagnate in the sets or defined in Section 1.1, implying that those are -stagnation set for some small or even .

layers trained using CIFAR-10 data. For the training pruning of shortcuts is applied

[2]. The two-dimensional projection is generated by dimension reduction of the dimensional parameter space using the random-direction method [26]. (Right) Population minima (red dots) and their neighbourhoods (blue balls) on the loss surface. We compute the population minima by utilizing the test data. For the visualization, the radius of all balls is set to in the dimension-reduced space.As illustrated in Figure 1, the loss landscape of neural networks is highly-nonconvex and possesses multiple local minima whose neighborhoods form attractive basins. Note that in the definition above, a stagnation set is non-random. As candidate for good stagnation sets, we take the union over local neighborhoods of the minima of the expected loss surface.

###### Definition 2 (Population Minimum).

We call a subset a population minimum, if is a maximally connected set consisting of local minima of .

Let us emphasize again that those are minima of the expected loss surface and not the empirical loss surface. Consequently, the set

is deterministic. To see why a minimum does not simply consists of one parameter as in the case of a strictly convex loss surface, consider a ReLU network with activation function

Multiplying all parameters in one hidden layer of a ReLU network by and dividing in another hidden layer all parameters by gives the same network function. Thus, if then also For general activation function, if a minimum corresponds to a parameter such that one of the weight matriceshas a zero column vector, then, some of the parameters of the previous layer

do not affect the loss function. Hence, the population minimum can in principle consist of several parameters, regardless of the choice of activation function.In Definition 3, we introduce a suitable notion of -neighborhood around denoted by A key object in our approach is the union of -neighbourhoods over several minima defined as

(1) |

The main idea underlying this definition is that the Gaussian SGD parameter updates are attracted into these local neighborhoods and stagnate there. To prove a result in this direction, we impose local essential convexity on the expected loss surface within the neighbourhoods. This assumption requires the loss function to be convex on a path between a parameter in the neighborhood and its projection on the local minimum , as illustrated in Figure 2 (details are provided in Assumption 1 below). This condition can be used when the minimum consists of more than one parameter vector. It should be noted that the essential convexity is weaker than essential strong convexity [27], and similar to the Polyak-Łojasiewicz condition [28]. Compared to these conditions, the local essential convexity seems better suited to derive bounds for the generalisation error in our framework. When the SGD output stagnates in , we can use the Rademacher complexity of to bound the generalization gap. Figure 1 illustrates the non-convex loss landscape of a deep neural network by using dimension reduction. The displayed population minima and their neighborhoods were found using numerical methods.

In the described setting, we state two main results. The first provides a lower bound for the probability that the SGD iterations stagnate in under several assumptions including local essential convexity.

###### Theorem 1 (informal statement of Theorem 4).

Consider Gaussian SGD with arbitrary initialization (not necessary in ) and a learning rate for some . Under a number of regularity assumptions and sufficiently large and , there exist constants such that for any if the Lebesgue measure of the set scales like and , then is an -stagnation set.

This result shows that under appropriate conditions, the union of population minima neighborhoods is a valid stagnation set. This result can be applied to generic loss surfaces and does not exploit the specific structure of neural networks.

The second main result is a generalization gap bound for deep neural networks if stagnation occurs. The statement can be summarized informally as

###### Theorem 2 (informal statement of Theorem 6).

If Gaussian SGD stagnates in , then,

where denotes the maximum dimension of the population minima , is the maximum width of the deep neural network, and with spectral norm .

This generalization gap bound depends on the network depth, the spectral norm of the weight matrices, and the radius of the minima neighbourhoods. While the number of network parameters does not explicitly appear in the bound, in practice, all quantities might depend in a highly non-trivial way on For the case of linear regression or neural networks with few layers, the radius of the neighbourhood around a minimum increases as the number of parameters increases. Because of the more complex loss surface, we expect that for deep neural networks the overall dependence of the generalization gap bound on is much slower than This indicates then, that even in the overparametrized case surface implicit regularization can make the generalisation gap small.

On the technical side, we have developed three techniques to achieve the results: (I) evaluating the reaching probability of Gaussian SGD to the population minima neighborhood, (II) studying the probability of Gaussian SGD to stay in the neighbourhood of the population minima, and (III) metric entropy evaluation on neural networks within the population minima neighborhood. For (I) reaching probability, we control the transition probability of the Gaussian SGD parameter update. For (II) the staying power of SGD, we apply the local essential convexity and locally uniform convergence of loss surfaces to show that the SGD stays in the neighborhood of population minima with high probability as the learning rate decreases. For (III) the entropy evaluation of neural networks, we combine the recent entropy analysis for deep neural networks in [21] with the uniform convergence tools for loss landscapes in [29].

### 1.3. Related Work and Comparison

This article is closely related to the work on SGD induced implicit regularization. We already briefly mentioned several articles in the previous section, and provide here a more in-depth overview. The argument that learning algorithms lead to implicit regularization has been shown in various settings such as matrix factorization, linear convolutional network, logistic regression, and others [30, 12, 13, 14, 15, 16, 17]. Although a precise understanding of implicit regularization in deep neural networks is still lacking, several articles assume that a norm of the parameters is bounded and give refined upper bounds on the generalization gap via uniform convergence [20, 22, 21, 23, 31]. A similar approach has been used in [24, 32] based on the notion of algorithmic stability. The recent empirical work [25] shows, however, that the bounded norm assumption is violated.

Another way to evaluate the generalisation error is to introduce compressibility [33, 34, 35]. These papers consider settings where the original neural network can be compressed into a neural network with smaller capacity. By studying then uniform convergence for the compressed network class, tighter bounds for the generalisation gap are derived. [33] compresses parameter matrices of a neural network by random projection and obtained an upper bound that can be evaluated in terms of the number of parameters of the compressed network.

For non-convex optimization, bounds for the generalization error for SGD and its variations have been derived in [36, 37, 38, 39, 40, 41]. [36, 41] investigate the invariant distribution of stochastic differential equations. A particularly convenient approach is to employ Langevin dynamics in continuous time, in order to investigate global parameter search and to bound the generalization error [36, 42, 43]. Although the analysis using invariant distributions is useful, it remains unclear whether those invariant distributions exist in general deep neural networks, since their existence requires specific assumptions. Another direction is to study the local behaviour of SGD around local minima associated with loss shape [38, 39, 40, 44]. A limitation of this method is that only local properties can be investigated.

Another line of research is to study the generalization analysis of neural networks and related models in the overparametrized regime. When the number of parameters in the models is excessively large, there are multiple techniques to precisely measure generalization errors. To name a few, the spectrum-based analysis [45, 46, 47, 48, 49, 50, 51, 52], and the utilization of loss functions whose shapes are almost convex or approaches zero due to the excess parameters [53, 54, 55]. A disadvantage of this approach is that until now it can only deal with linear or two-layer neural network models.

In contrast to earlier work, our approach aims to shed some light on the implicit regularization of Gaussian SGD induced by the loss surface and the geometry of its local minima. This source of implicit regularization arises from the structure of the deep networks and is not due to overparametrization of the model. Moreover, we consider the global behavior of SGD and allow the initial value of the algorithm to be far from the learned parameters.

### 1.4. Notation

For real numbers , is the maximum. The Euclidean norm of a vector is denoted by For a matrix , we define the matrix norms and write

for the spectral norm (the largest singular value of

). Moreover, for the weight matrices in a deep network, we introduce the norms and . We define where denotes the Frobenius inner product. For , let be the projection of onto the set with respect to the norm . In the following, let be a positive finite constant depending on . For a set , denotes the Lebesgue measure of . For sequences and , means that there exists such that holds for every . Moreover, iff . Finally, we write if both and hold. We write for . denotes the indicator function. It is if the event holds and otherwise.## 2. Setting

### 2.1. Deep Neural Network

Let denote the number of layers, and let be the weight matrices for each , with intermediate dimensions for . For convenience, we consider neural networks with one output unit, that is, . Moreover, denotes the maximum width and is the number of network parameters. For an activation function that is -Lipschitz continuous, we define an -layer neural network for as the function

with parameter tuples .

### 2.2. Supervised Learning Problem

Given observations with sample spaces and , we define the loss function as To apply gradient descent methods, we need to assume that is differentiable with respect to the parameters. For such a loss and the sample , we define the corresponding empirical risk function and the expected risk function . To calculate derivatives of and with non-differentiable , we can take sub-derivative of instead, e.g., the ReLU activation case . We set . This ensures .

### 2.3. Gaussian Stochastic Gradient Descent

We study stochastic gradient descent (SGD) with Gaussian noise based on the empirical loss to learn the parameters Let be the initial parameter values and denote by for the parameter values after the -th SGD iteration. We assume that is compact and constraint the iterates of the algorithm to parameter tuples . We assume existence of a filtration and define an -dependent -measurable Gaussian vector such that and for all . Here,

is a matrix-valued map and all eigenvalues of

are assumed to be bounded from above by and from below by for all . Given an initial parameter , we define a sequence of parameters by the following SGD update:(2) |

with the learning rate. Here is a pre-determined deterministic stopping time and the output of the algorithm is If falls outside the parameter space , we project the vector onto with respect to the -norm and then substitute it into . Furthermore, we assume that the initial point and the set of observations is -measurable.

Gaussian SGD is an approximation of the widely used minibatch SGD. Theoretical and experimental similarities and differences are discussed in Section 4.3.

## 3. Key Notion and Assumption

### 3.1. Population Minima and their Neighborhoods

We introduce key concepts and assumptions related to the notion of population minima introduced in Definition 2. As we are considering the expected loss, everything is deterministic.

For a positive integer and a constant , we pick all population minima that satisfy for any

(3) |

and that are separated from the boundary of the parameter space with respect to the distance induced by . Let be the number of population minima satisfying the conditions. In the previous formula, can be viewed as a constraint on the dimension of the minima. If are isolated points, the inequality holds with . It should be noted that is not as large as the number of parameters (e.g. [56]). For each , we consider a suitable notion of neighbourhood.

###### Definition 3 (-neighborhoods of ).

For , we define the -neighborhood of as

This neighborhood is defined via the -norm, which is suitable for the generalization analysis of deep neural networks as developed in [21]. Since the minima are assumed to be in the interior of the parameter space, we can find a sufficiently small such that holds for all A key object in the analysis is the union of the -neighborhoods

### 3.2. Loss Surface and Gradient Noise

We firstly discuss the local shape of the expected loss . Denote by the projection of onto a set with respect to the norm .

###### Assumption 1 (Local Essential Convexity).

For any and any two parameters with , we have

This is a convexity property on the path between a parameter and its projected version in as displayed in Figure 2. The condition is weaker than local essential strong-convexity [27]. Since this version of convexity is defined for each projection path, Assumption 1 does not depend on the shape of the population minima : it is valid even if the minima is not isolated nor a non-convex set.

There is no clear connection between local essential convexity and the Polyak-Łojasiewicz (PL) condition [28]. While both conditions have a similar flavour, PL imposes a lower bound on the Frobenius-norm of by a difference up to constants.

Existing work indicates that local essential convexity holds for neural networks. For example, [57, 58, 59] have shown that for specific neural network architectures the Hessian matrix around minima is positive definite. This implies local essential convexity. The theory of neural tangent kernels [55] with over-parametrized neural networks yields the same result. We also mention that another notion of convexity has been shown by [60, 61] in connection with regularization and initial values.

Moreover, we need to impose some smoothness on the expected loss surface . Let denote the partial derivative with respect to the -th entry of the weight matrix .

###### Assumption 2 (Local smoothness).

For any and any parameters , is differentiable at and we have

for all and all .

Assumptions of this type are common, see e.g. [36]. It should be noted that the condition only has to hold locally in the neighborhoods of the minima Since the assumption is about the expected loss, it holds for non-smooth activation functions such as the ReLU activation function. To describe this fact in more detail, we provide a simple example with few parameters and layers.

###### Lemma 3.

Consider a neural network with two layers widths , ReLU activation , and square loss function . Suppose that the maximum absolute value of each element of is bounded by . Then, for some distribution of , Assumption 2 holds with .

The constant will affect the behavior of the SGD stagnation probabilities but does not influence the generalization gap in Theorem 6.

We also impose an assumption on the effect of data-oriented uncertainty on gradient noise. Let be a pair of the input and output variables.

###### Assumption 3 (Gradient noise).

There exists such that for any and any parameter , the following inequality holds:

This condition is the main assumption in [29] and allows us to establish uniform convergence of the generalization gap within the set .

## 4. Main Result

### 4.1. Stagnation Probability

We first introduce the lower bound on the stagnation probability for Gaussian SGD. Explicit expressions for all constants are given in the proofs. Recall that denotes the Lebesgue measure of the set , and denotes a finite positive constant depending on .

###### Theorem 4 (Stagnation probability of Gaussian SGD).

Consider the Gaussian SGD algorithm in (2). Suppose Assumptions 1, 2, and 3 hold. For positive let the learning rates be a monotonically decreasing sequence in , and define . Then, for and all sufficiently large , the union of neighborhoods around local minima is a -stagnation set with

and suitable constants , .

Thinking of as a fixed quantity, the stagnation probability for is strictly positive. Interestingly, the lower bound is a product of the Lebesgue measure and an exponential term. These two terms describe the probability of entering the set and that of not exiting during the Gaussian SGD iterations. Indeed, if denotes the first hitting time, we can lower bound the staying probability as follows

(4) |

where describes the reaching probability to and is the non-exiting probability from . Theorem 4 follows by carefully bounding these terms from below. It should be noted that the distance between the initial value and could potentially be large.

Moreover, the lower bound in the previous theorem increases in both and . For example, if the number of minima is large, then the union of their neighborhoods increases. Alternatively, if the expected loss satisfies for a large neighborhood around each minimum the convexity property above, we can take large and obtain a stronger lower bound on the staying probability. Recall that for a function for refers to any such that With this notation, we obtain the following simplified result:

###### Corollary 5.

Suppose and consider otherwise the same setting as in Theorem 4. If and hold as , then we obtain

### 4.2. Generalization Gap Bound

We provide an upper bound on the generalization error for deep neural networks, provided the SGD iterations remain in the stagnation set. As before, consider a deep neural network with layers, maximum width , and network parameters. We also define the largest possible product of its spectra within by , and set for an -upper bound over all neural network functions. The following result provides a bound on the generalization gap:

###### Theorem 6 (Bound on generalization gap).

If the same conditions as for Theorem 4 hold, then, for any , with probability at least , we obtain

Observe that the bound holds with at least probability , where is the lower bound on the stagnation probability in Theorem 4. In terms of network quantities, the number of network parameters does not affect the derived bound on the generalization gap directly. Instead, the depth and the product of spectral norms appear in the generalization gap, similar as in [21] without stagnation. In practice, is typically in the range up to a few hundred and can be viewed as bounded. is not affected by the number of parameters and does not necessarily increase even with large models. It is also of interest to notice that the increase in the number of minima neighborhoods does not have a significant effect on the bound. This implies that if is constituted from a larger number of neighborhood sets , the generalization gap increases moderately.

### 4.3. Relation between Gaussian SGD and minibatch SGD

Minibatch SGD is defined as follows. For each , we randomly pick a minibatch of observations from the full sample and set . The minibatch SGD algorithm generates the sequence by the following equation

(5) |

For the filtration introduced before, batch gradient noise is defined as the -dependent -measurable random vector

This random variable measures the effect of the subsampling on the gradient and allows to rewrite (

5) as(6) |

We note that holds for all and conditional on the observations, is an -dependent

-valued random variable with zero mean and finite variance.

For large batch-size mini-batch SGD and Gaussian SGD as defined in (2) behave very similar. Indeed for large ,

follows asymptotically a Gaussian distribution by the conditional multiplier central limit theorem. More precisely, for given training dataset and the minibatch sampling regarded as independent multipliers,

weakly converges to a Gaussian law almost surely (e.g. Lemma 2.9.5 in [62]). Empirically, several studies [63, 64, 65] investigate the tail behavior of minibatch SGD, and some of them report that gradient noise has Gaussian-like tail probabilities.We suspect that the lower bound on the stagnation probability in Theorem 4 also holds for mini-batch SGD. This is because if the noise of minibatch SGD and the noise of the Gaussian SGD are sufficiently close, e.g. in the sense of the chi-square divergence, then by a Girsanov-type change of measure transformation [66] one can show that the parameter updates are nearly the same.

### 4.4. Application to Optimization Error Bound

We now apply the stagnation results obtained above to evaluate the generalization gap using Gaussian SGD with respect to the number of iterations. We pick a local minimum and define the minimum value of the expected loss in as . Assuming that for fixed stagnation in occurs, we obtain a new bound on the generalization gap.

Let be the expectation conditionally on the event . Recall that is as defined in (2).

###### Theorem 7 (Optimization Error Bound).

We find that the error linearly converges in , which is the optimal rate for SGD under the Polyak-Łojasiewicz condition [67]. Moreover, the bound increases with the depth and the size of the neighborhoods but only depends logarithmically on the maximum width and the number of network parameters

### 4.5. Proof Outline

We provide an overview of the proofs for Theorem 4 and Theorem 6. Full proofs are given in a later section.

#### 4.5.1. Stagnation Probability (Theorem 4)

For time , constants , and a parameter , we define

(7) |

Obviously, and increases in and , and decreases in and . Now, we show that is a stagnation set by deriving a lower bound of the stagnation probability.

###### Proposition 8.

To prove this result, we decompose the stagnation probability into the reaching probability where and the non-escaping probability as in (4).

(i) Reaching probability : We evaluate the probability in terms of the step-wise reaching probability , that is, the probability of entering the set in step given that the previous iterate was outside . Using that depends on the past only through the previous iterate a standard calculation shows

(8) |

We now bound the step-wise reaching probability . Since is a Gaussian vector, the support of can cover the whole parameter space . Invoking the Gaussian density function, we derive the bound

We substitute this result into (8) and obtain a lower bound on the reaching probability . Obviously, the last inequality is highly suboptimal.

(ii) Non-escaping probability: To derive a lower bound on , we decompose the expression into a step-wise non-escaping probability. Picking one local minima , we find

(9) |

We now evaluate the step-wise conditional staying probability To this end, we define the updated parameter without gradient noise as and observe that holds, implying

(10) |

Lemma 12 and Assumption 1 yield the lower bound with sufficiently small . Then, for , we obtain

Since is Gaussian, we can control its tail behavior. Combining this result with (10) and (9) yields the lower bound on the non-escaping probability.

#### 4.5.2. Generalization Gap Bound (Theorem 6)

This upper bound is based on uniform convergence over the union of neighborhoods around local minima . Suppose that the event holds, whose probability is guaranteed to be no less than by Theorem 4. For a function class , we consider the standard definition of the empirical Rademacher complexity with independent Rademacher variables , that is, both and have probability Then, for sufficiently small , we obtain