ReLU_Convergence
None
view repo
In this paper, we made an extension to the convergence analysis of the dynamics of twolayered biasfree networks with one ReLU output. We took into consideration two popular regularization terms: the ℓ_1 and ℓ_2 norm of the parameter vector w, and added it to the square loss function with coefficient λ/2. We proved that when λ is small, the weight vector w converges to the optimal solution ŵ (with respect to the new loss function) with probability ≥ (1ε)(1A_d)/2 under random initiations in a sphere centered at the origin, where ε is a small value and A_d is a constant. Numerical experiments including phase diagrams and repeated simulations verified our theory.
READ FULL TEXT VIEW PDF
We study the emergence of sparse representations in neural networks. We ...
read it
Lossfunction metalearning can be used to discover novel, customized los...
read it
We study the implicit regularization phenomenon induced by simple
optimi...
read it
We provide efficient algorithms for overconstrained linear regression
pr...
read it
In many realworld applications, data usually contain outliers. One popu...
read it
Label noise presents a real challenge for supervised learning algorithms...
read it
None
A substantial issue in deep learning is the theoretical analysis of complex systems. Unlike multilayer perceptrons, deep neural networks have various structures
[3], which mainly come from intuitions, and they sometimes yield good results. On the other hand, the optimization problem usually turns out to be nonconvex, thus it is difficult to analyze whether the system will converge to the optimal solution with simple methods such as stochastic gradient descent.
In Theorem 4 in [2], convergence for a system with square loss with regularization is analyzed. However, assumption 6 in [2]
requires the activation function
to be three times differentiable with on its domain. Thus the analysis cannot be applied to some popular activation functions such as [4] and [5], where and .Theorem 3.3 in [1] provides another point of view to analyze the situation by using the Lyapunov method [6]. The conclusion is weaker: the probability of convergence is less than . However, this method successfully deals with this activation function. In this paper, we take into consideration and regularization and analyze the convergence of these two systems with an analogous method. Also, a similar conclusion is drawn in the end.
The square of the and norms of a vector are
(1) 
These two regularization terms are popular because they control the scale of . Because there is an important difference between and regularization (usually it is possible to acquire an explicit solution of a system with regularization, but hard for a system with regularization), we need different tools to deal with the problems.
In this paper a twolayered neural network with one output is considered. Let , an matrix (), be the input data. Assume that the columns of are identically distributed Gaussian independent random dimensional vector variables: ’s . Let , a vector with length , be the vector of weights (parameters) to be learned by the model. Let be the optimal weight with respect to . Let be the activation function. Then, the output with input vector and weight is . For convenience, define an vector with element . Now, we are able to write down the loss function with the regularization term :
(2) 
where is a parameter. When , there is no regularization. In this paper, we focus on the situation where or and is very small.
We have a easy way to represent by introducing a new matrix function given by where if and if . Then, can be written in matrix form:
(3) 
Additionally, let for convenience.
Now we introduce the gradient descent algorithm for the model. The iteration has the form
(4) 
where is the learning rate (usually small) and is the negative gradient of the loss function. According to [1] has the closed form
(5) 
Its expectation (corresponding to ) is given explicitly by
(6) 
where and is the angle between and .
Usually, does not converge to because of the regularization term. Let be the optimal weight vector that minimizes , i.e. . First, we’ll solve for small , and then we prove that will converge to in using the Lyapunov method [6], where and is the line .
We firstly provide three lemmas that help the analysis in sections 3.2 and 3.3. The lemmas show that extreme situations will happen with small probability, and provide some mathematical tricks that are useful in the theoretical analysis.
Lemma 1: for .
Proof: Let , then . Thus,
(7) 
Finally,
(8) 
When , is a small value bounded by .
Lemma 2: .
Proof: First we show when , . Since ’s are , any rows of are linearly independent with probability 1. This implies that with probability 1 doesn’t contain more that ’s . However, has more than ’s, so Then since , the probability that is positive definite also equals to this amount.
Lemma 3: For a positive definite matrix and a small value
(9) 
where refers to a matrix with every element .
Proof: Since is positive definite, exists. Then,
(10) 
This shows that and are closed to each other.
In this case, we have , and . Then, the loss function is given in the following equation:
(11) 
Theorem 1: When is small, can be solved explicitly with probability .
Proof: Let , and according to equation (5), we have
(12) 
Let’s first assume that . Then the equation can be simplified as
(13) 
Thus, we have
(14) 
The inverse exists with probability according to lemmas 2 and 3.
We now show that when is small enough, this ensures that . According to lemmas 2 and 3, we have
(15) 
It is sufficient to show that and , two vectors in , share the same signs in the positions with probability 1. These two vectors are related by the equation
(16) 
Since doesn’t contain 0 with probability 1, we can exclude these cases. Then, all terms after above don’t influence the sign of when
(17) 
The ”2” on the denominator is used for eliminating the effects of .
Now, we have shown that is closed to when is small. The next step is to show that converges to in a certain area, which the Lyapunov method [6] is very good at. In order to apply the Lyapunov method, we regard as a continuous index.
Theorem 2: With probability , the following statement holds. When is large and is small, consider the Lyapunov function . We have in , and thus the system is asymptotically stable. That is, as .
Proof: We can write as:
(18) 
In order to simplify, let , where is given by
(19) 
Note ; can be written as where
(20) 
According to Lemma7.3 [1], is given by the following:
(21) 
can also be divided into two parts: , where
(22) 
and satisfies that . From this, we see that is bounded. Since is positive definite for according to Lemma7.3 [1], when is small, is also positive definite for . As a result, , which leads to the result that the system is asymptotically stable in .
In this case, we have , and , where is the vector of signs of elements in . Then, the loss function is given in the following equation:
(23) 
Theorem 3: When is small, can be solved (not explicitly) with probability .
Proof: Let , and according to equation (5), we have
(24) 
We still assume that to simplify the problem. Then, the equation becomes
(25) 
This problem is hard to solve, so we use the Implicit Function Theorem [7] here. The key is to examine whether the Jacobian matrix is invertible, where . The result is
(26) 
Since is positive definite with probability according to Lemma 2, and when is small the second term doesn’t influence, we know that is then invertible. Thus, there exists a unique continuously differentiable function such that is the solution. Notice that when , is the solution. As a result, can be extended as for some vector . Additionally, might be very large because there is an after in equations (24)(26).
Then, we show that for small, we have . The analysis is quite similar to Theorem 1. When
(27) 
we have that .
Remark: In Theorem 3 the bound of is given in equation (27), where there is an unknown vector on the denominator
. In fact, we are able to estimate its value from known quantities. When we apply the extension
to equation (25), we have(28) 
which is equivalent to the following equation
(29) 
As assumed in Theorem 3, , and assume that for . Let be the matrix consisting the first rows of . Then, . Thus, we have
(30) 
Then we have
(31) 
which indicates that
(32) 
for small such that with small value that eliminates the effect of in equation (31). Finally, we are able to modify the bound in equation (27) by using the upper bound of in equation (32) to substitute this amount. The explicit bound is then given by the following equation:
(33) 
Although the explicit solution of can’t be found, we still draw the conclusion that is closed to for small . This is enough for the Lyapunov method, because we are able to control in equation (20) with a similar way.
Theorem 4: The statement in Theorem 2 still holds for regularization.
Proof: Similar to the analysis in Theorem 2, we still have equation (20) in this case with a different . Thus, when is small enough is positive definite, and the conclusion is still correct here.
Since it’s hard to draw samples from , we consider a small sphere centered at the origin. The analysis is in Theorem 7.4 (proof of Theorem 3.3) in [1].
Theorem 5: For both and regularization, if the initial weight vector is sampled uniformly in with , converges to with probability .
Proof: The proof is almost exactly the same as the proof of Theorem 7.4 in [1]. The only thing to notice is that we exclude the line because we need . However, the line has measure zero and thus doesn’t change the conclusion.
First, in Figure 1 we demonstrate all possibilities: the dynamics converge/do not converge with / regularization. The parameters are: and . In Figure 2 we show two phase diagrams (or vector fields, after normalized) of the dynamics with and regularization with randomly selected and parameters , and . The big black point is , the small black points are the grid points uniformly selected in the plane, and green lines refer to the orientations of (from the end with a black point to the end without any point). Especially, when equals to in the case the dynamic is meaningless because does not exist.
Then, in order to examine the prediction given by Theorem 5, we made the following simulation. Under different values of , and , we simulated the dynamics for 500 times and compared the experiment ratio of convergence to the theoretical ratio (that is, the probability) of convergence in Theorem 5. Specifically, for both and situation was selected in , was selected in , and was selected in . The learning rate was set to be 0.05 and was set to be 0.1. Each time
was sampled according to normal distribution and
was sampled uniformly in . The results for the and regularization case are demonstrated in Table 1.

Theoretical  case with various  case with various  
0.001  0.01  0.1  0.001  0.01  0.1  
2 
10  0.425  0.912  0.832  0.436  0.940  0.912  0.700 
20  0.450  0.992  0.976  0.578  0.970  0.956  0.840  
100  0.450  0.996  1  0.950  0.996  0.986  0.880  
3  10  0.373  0.852  0.712  0.170  0.966  0.972  0.736 
20  0.449  0.994  0.966  0.342  0.998  0.996  0.940  
100  0.450  1  1  0.856  1  1  0.962  
5  10  0.170  0.452  0.304  0.016  1  1  0.612 
20  0.441  0.97  0.820  0.112  1  1  0.960  
100  0.450  1  1  0.706  1  1  1  

According to Table 1, we are able to make the following discussion. As shown in the table, there are four bold numbers, all of which lie in the regularization case when , indicating that for the situation is beyond the upper bound of for Theorem 1 or Theorem 2. In most situations, the experiment ratio of convergence decreases as increases, and the gap between and is much larger than the gap between and , which implies that also plays an important role in the convergence probability in Theorem 5. In most cases the experiment ratio is much larger than the theoretical ratio. This indicates that outside the sphere in Theorem 2 and Theorem 4 there is still much area in which the initial weights will converge to . Under the same parameters, the experiment ratio of convergence in the case is always greater than that in the case. This shows that the regularization makes the dynamic easier to converge than the regularization does.
In this paper, we presented our convergence analysis of the dynamics of twolayered biasfree networks with one output, where the loss function includes the square error loss and or regularization on the weight vector. This is an extension to Theorem 3.3 in [1]. We first solved the optimal weight vector with small regularization coefficient for both cases, and then used the Lyapunov method [6] to show that the system is asymptotically stable in certain area. In the final step, we claimed that Theorem 3.3 in [1] is still correct in these two situations. We also verified our theory through numerical experiments including plotting the phase diagrams and making computer simulations.
Our work made a theoretical justification of convergence for two popular models. We started from the intuition that small regularization doesn’t change the system too much, and our conclusion is compatible with this intuition. In the future, we plan to analyze the system with larger regularization, since in real situations is fixed to be, for example, 0.5, which may be larger than the bound in equations (17) and (27). This is more difficult since we won’t expect , and other advanced techniques may be applied. We also plan to consider other popular regularization terms, and provide a more general theory on this topic.
Coursera: Neural Networks for Machine Learning
.
Comments
There are no comments yet.