Convergence Analysis of the Dynamics of a Special Kind of Two-Layered Neural Networks with ℓ_1 and ℓ_2 Regularization

11/19/2017 ∙ by Zhifeng Kong, et al. ∙ 0

In this paper, we made an extension to the convergence analysis of the dynamics of two-layered bias-free networks with one ReLU output. We took into consideration two popular regularization terms: the ℓ_1 and ℓ_2 norm of the parameter vector w, and added it to the square loss function with coefficient λ/2. We proved that when λ is small, the weight vector w converges to the optimal solution ŵ (with respect to the new loss function) with probability ≥ (1-ε)(1-A_d)/2 under random initiations in a sphere centered at the origin, where ε is a small value and A_d is a constant. Numerical experiments including phase diagrams and repeated simulations verified our theory.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A substantial issue in deep learning is the theoretical analysis of complex systems. Unlike multi-layer perceptrons, deep neural networks have various structures


, which mainly come from intuitions, and they sometimes yield good results. On the other hand, the optimization problem usually turns out to be non-convex, thus it is difficult to analyze whether the system will converge to the optimal solution with simple methods such as stochastic gradient descent.

In Theorem 4 in [2], convergence for a system with square loss with regularization is analyzed. However, assumption 6 in [2]

requires the activation function

to be three times differentiable with on its domain. Thus the analysis cannot be applied to some popular activation functions such as [4] and [5], where and .

Theorem 3.3 in [1] provides another point of view to analyze the situation by using the Lyapunov method [6]. The conclusion is weaker: the probability of convergence is less than . However, this method successfully deals with this activation function. In this paper, we take into consideration and regularization and analyze the convergence of these two systems with an analogous method. Also, a similar conclusion is drawn in the end.

The square of the and norms of a vector are


These two regularization terms are popular because they control the scale of . Because there is an important difference between and regularization (usually it is possible to acquire an explicit solution of a system with regularization, but hard for a system with regularization), we need different tools to deal with the problems.

2 Preliminary

In this paper a two-layered neural network with one output is considered. Let , an matrix (), be the input data. Assume that the columns of are identically distributed Gaussian independent random -dimensional vector variables: ’s . Let , a vector with length , be the vector of weights (parameters) to be learned by the model. Let be the optimal weight with respect to . Let be the activation function. Then, the output with input vector and weight is . For convenience, define an vector with element . Now, we are able to write down the loss function with the regularization term :


where is a parameter. When , there is no regularization. In this paper, we focus on the situation where or and is very small.

We have a easy way to represent by introducing a new matrix function given by where if and if . Then, can be written in matrix form:


Additionally, let for convenience.

Now we introduce the gradient descent algorithm for the model. The iteration has the form


where is the learning rate (usually small) and is the negative gradient of the loss function. According to [1] has the closed form


Its expectation (corresponding to ) is given explicitly by


where and is the angle between and .

3 Theoretical Analysis

Usually, does not converge to because of the regularization term. Let be the optimal weight vector that minimizes , i.e. . First, we’ll solve for small , and then we prove that will converge to in using the Lyapunov method [6], where and is the line .

We firstly provide three lemmas that help the analysis in sections 3.2 and 3.3. The lemmas show that extreme situations will happen with small probability, and provide some mathematical tricks that are useful in the theoretical analysis.

3.1 Preparation

Lemma 1: for .

Proof: Let , then . Thus,




When , is a small value bounded by .

Lemma 2: .

Proof: First we show when , . Since ’s are , any rows of are linearly independent with probability 1. This implies that with probability 1 doesn’t contain more that ’s . However, has more than ’s, so Then since , the probability that is positive definite also equals to this amount.

Lemma 3: For a positive definite matrix and a small value


where refers to a matrix with every element .

Proof: Since is positive definite, exists. Then,


This shows that and are closed to each other.

3.2 Convergence Area for the Regularization Case

In this case, we have , and . Then, the loss function is given in the following equation:


Theorem 1: When is small, can be solved explicitly with probability .

Proof: Let , and according to equation (5), we have


Let’s first assume that . Then the equation can be simplified as


Thus, we have


The inverse exists with probability according to lemmas 2 and 3.

We now show that when is small enough, this ensures that . According to lemmas 2 and 3, we have


It is sufficient to show that and , two vectors in , share the same signs in the positions with probability 1. These two vectors are related by the equation


Since doesn’t contain 0 with probability 1, we can exclude these cases. Then, all terms after above don’t influence the sign of when


The ”2” on the denominator is used for eliminating the effects of .

Now, we have shown that is closed to when is small. The next step is to show that converges to in a certain area, which the Lyapunov method [6] is very good at. In order to apply the Lyapunov method, we regard as a continuous index.

Theorem 2: With probability , the following statement holds. When is large and is small, consider the Lyapunov function . We have in , and thus the system is asymptotically stable. That is, as .

Proof: We can write as:


In order to simplify, let , where is given by


Note ; can be written as where


According to Lemma7.3 [1], is given by the following:


can also be divided into two parts: , where


and satisfies that . From this, we see that is bounded. Since is positive definite for according to Lemma7.3 [1], when is small, is also positive definite for . As a result, , which leads to the result that the system is asymptotically stable in .

3.3 Convergence Area for the Regularization Case

In this case, we have , and , where is the vector of signs of elements in . Then, the loss function is given in the following equation:


Theorem 3: When is small, can be solved (not explicitly) with probability .

Proof: Let , and according to equation (5), we have


We still assume that to simplify the problem. Then, the equation becomes


This problem is hard to solve, so we use the Implicit Function Theorem [7] here. The key is to examine whether the Jacobian matrix is invertible, where . The result is


Since is positive definite with probability according to Lemma 2, and when is small the second term doesn’t influence, we know that is then invertible. Thus, there exists a unique continuously differentiable function such that is the solution. Notice that when , is the solution. As a result, can be extended as for some vector . Additionally, might be very large because there is an after in equations (24)-(26).

Then, we show that for small, we have . The analysis is quite similar to Theorem 1. When


we have that .

Remark: In Theorem 3 the bound of is given in equation (27), where there is an unknown vector on the denominator

. In fact, we are able to estimate its value from known quantities. When we apply the extension

to equation (25), we have


which is equivalent to the following equation


As assumed in Theorem 3, , and assume that for . Let be the matrix consisting the first rows of . Then, . Thus, we have


Then we have


which indicates that


for small such that with small value that eliminates the effect of in equation (31). Finally, we are able to modify the bound in equation (27) by using the upper bound of in equation (32) to substitute this amount. The explicit bound is then given by the following equation:


Although the explicit solution of can’t be found, we still draw the conclusion that is closed to for small . This is enough for the Lyapunov method, because we are able to control in equation (20) with a similar way.

Theorem 4: The statement in Theorem 2 still holds for regularization.

Proof: Similar to the analysis in Theorem 2, we still have equation (20) in this case with a different . Thus, when is small enough is positive definite, and the conclusion is still correct here.

3.4 The Final Result

Since it’s hard to draw samples from , we consider a small sphere centered at the origin. The analysis is in Theorem 7.4 (proof of Theorem 3.3) in [1].

Theorem 5: For both and regularization, if the initial weight vector is sampled uniformly in with , converges to with probability .

Proof: The proof is almost exactly the same as the proof of Theorem 7.4 in [1]. The only thing to notice is that we exclude the line because we need . However, the line has measure zero and thus doesn’t change the conclusion.

Now, we have proved that Theorem 3.3 in [1] still applies for and regularization with small . And this result is consistent with the argument that initial weights should be small rather than being large [8].

4 Experiment Results and Analysis

First, in Figure 1 we demonstrate all possibilities: the dynamics converge/do not converge with / regularization. The parameters are: and . In Figure 2 we show two phase diagrams (or vector fields, after normalized) of the dynamics with and regularization with randomly selected and parameters , and . The big black point is , the small black points are the grid points uniformly selected in the plane, and green lines refer to the orientations of (from the end with a black point to the end without any point). Especially, when equals to in the case the dynamic is meaningless because does not exist.

Figure 1: Four possible dynamics. The left 1 shows the dynamic that converges to with regularization and . The left 2 shows the dynamic that converges to with regularization and . The left 3 shows the dynamic that does not converge to with regularization and . The left 4 shows the dynamic that does not converge to with regularization and .
Figure 2: Phase diagrams (or vector fields, after normalized) in the plane of the dynamics with (left) and (right) regularization.

Then, in order to examine the prediction given by Theorem 5, we made the following simulation. Under different values of , and , we simulated the dynamics for 500 times and compared the experiment ratio of convergence to the theoretical ratio (that is, the probability) of convergence in Theorem 5. Specifically, for both and situation was selected in , was selected in , and was selected in . The learning rate was set to be 0.05 and was set to be 0.1. Each time

was sampled according to normal distribution and

was sampled uniformly in . The results for the and regularization case are demonstrated in Table 1.

Theoretical case with various case with various
0.001 0.01 0.1 0.001 0.01 0.1

10 0.425 0.912 0.832 0.436 0.940 0.912 0.700
20 0.450 0.992 0.976 0.578 0.970 0.956 0.840
100 0.450 0.996 1 0.950 0.996 0.986 0.880
3 10 0.373 0.852 0.712 0.170 0.966 0.972 0.736
20 0.449 0.994 0.966 0.342 0.998 0.996 0.940
100 0.450 1 1 0.856 1 1 0.962
5 10 0.170 0.452 0.304 0.016 1 1 0.612
20 0.441 0.97 0.820 0.112 1 1 0.960
100 0.450 1 1 0.706 1 1 1

Table 1: The comparison between theoretical ratio of convergence (the 3rd col.) and experiment ratio of convergence (the 4th-9th col.) under different parameters.

According to Table 1, we are able to make the following discussion. As shown in the table, there are four bold numbers, all of which lie in the regularization case when , indicating that for the situation is beyond the upper bound of for Theorem 1 or Theorem 2. In most situations, the experiment ratio of convergence decreases as increases, and the gap between and is much larger than the gap between and , which implies that also plays an important role in the convergence probability in Theorem 5. In most cases the experiment ratio is much larger than the theoretical ratio. This indicates that outside the sphere in Theorem 2 and Theorem 4 there is still much area in which the initial weights will converge to . Under the same parameters, the experiment ratio of convergence in the case is always greater than that in the case. This shows that the regularization makes the dynamic easier to converge than the regularization does.

5 Conclusion and Future Work

In this paper, we presented our convergence analysis of the dynamics of two-layered bias-free networks with one output, where the loss function includes the square error loss and or regularization on the weight vector. This is an extension to Theorem 3.3 in [1]. We first solved the optimal weight vector with small regularization coefficient for both cases, and then used the Lyapunov method [6] to show that the system is asymptotically stable in certain area. In the final step, we claimed that Theorem 3.3 in [1] is still correct in these two situations. We also verified our theory through numerical experiments including plotting the phase diagrams and making computer simulations.

Our work made a theoretical justification of convergence for two popular models. We started from the intuition that small regularization doesn’t change the system too much, and our conclusion is compatible with this intuition. In the future, we plan to analyze the system with larger regularization, since in real situations is fixed to be, for example, 0.5, which may be larger than the bound in equations (17) and (27). This is more difficult since we won’t expect , and other advanced techniques may be applied. We also plan to consider other popular regularization terms, and provide a more general theory on this topic.