ADMMiRNN: Training RNN with Stable Convergence via An Efficient ADMM Approach

06/10/2020 ∙ by Yu Tang, et al. ∙ 0

It is hard to train Recurrent Neural Network (RNN) with stable convergence and avoid gradient vanishing and exploding, as the weights in the recurrent unit are repeated from iteration to iteration. Moreover, RNN is sensitive to the initialization of weights and bias, which brings difficulty in the training phase. With the gradient-free feature and immunity to poor conditions, the Alternating Direction Method of Multipliers (ADMM) has become a promising algorithm to train neural networks beyond traditional stochastic gradient algorithms. However, ADMM could not be applied to train RNN directly since the state in the recurrent unit is repetitively updated over timesteps. Therefore, this work builds a new framework named ADMMiRNN upon the unfolded form of RNN to address the above challenges simultaneously and provides novel update rules and theoretical convergence analysis. We explicitly specify key update rules in the iterations of ADMMiRNN with deliberately constructed approximation techniques and solutions to each subproblem instead of vanilla ADMM. Numerical experiments are conducted on MNIST and text classification tasks, where ADMMiRNN achieves convergent results and outperforms compared baselines. Furthermore, ADMMiRNN trains RNN in a more stable way without gradient vanishing or exploding compared to the stochastic gradient algorithms. Source code has been available at https://github.com/TonyTangYu/ADMMiRNN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent Neural Network (RNN) [4] has made great progress in various fields, namely language modelling [18], text classification  [15], event extraction [21], and various real-world applications [11]. Although RNN models have been widely used, it is still difficult to train RNN models because of the vanishing gradients and exploding gradients problems111More information about vanishing gradients and vanishing gradients could be found in [1]. Moreover, RNN models are sensitive to the weights and biases [27], which may not converge with poor initialization. These problems still need a method to be solved simultaneously.

Nowadays, gradient-based training algorithms are widely used in deep learning 

[16]

, such as Stochastic Gradient Descent (SGD) 

[24], Adam [13]

, RMSProp 

[29]. However, they still suffer from vanishing or exploding gradients. Compared to the traditional gradient-based optimization algorithms, the Alternating Direction Method of Multipliers (ADMM) is a much more robust method for training neural networks. It has been recognized as a promising policy to alleviate vanishing gradients and exploding gradients problems and exerts a tremendous fascination on researchers. In addition, ADMM is also immune to poor conditioning with gradient-free technique [28].

In light of these properties of ADMM and to alleviate the aforementioned problems in RNN simultaneously, we are motivated to train RNN models with ADMM. However, it is not easy to apply ADMM to RNNs directly due to the recurrent state compared with MLP and CNN[14]. The recurrent states are updated over timesteps instead of iterations, which is not compatible with ADMM. In this paper, to tackle this problem, we propose an ADMMiRNN method with theoretical analysis. Experimental comparisons between ADMMiRNN and some typical stochastic gradient algorithms, such as SGD and Adam, illustrate that ADMMiRNN avoids the vanishing gradients and exploding gradients problems and surpasses traditional stochastic gradient algorithms in term of stability and efficiency. The main contributions of this work are four-fold:

  • We propose a new framework named ADMMiRNN for training RNN models via ADMM. ADMMiRNN is built upon the unfolded RNN unit, which is a remarkable feature of RNN, and could settle the problems of gradient vanishing or exploding and sensitive parameter initialization in RNN at the same time. Instead of using vanilla ADMM, some practical skills in our solution also help converge.

  • To the best of our knowledge, we are the first to handle RNN training problems using ADMM which is a gradient-free approach and brings extremely great advantages on stability beyond traditional stochastic gradient algorithms.

  • Theoretical analysis of ADMMiRNN is presented, including update rules and convergence analysis. Our analysis ensures that ADMMiRNN achieves an efficient and stable result. Moreover, the framework proposed in this work could be applied to various RNN-based works.

  • Based on our theoretical analysis, numerical experiments are conducted on several real-world datasets, and the results of the experiments demonstrate the efficiency of the proposed ADMMiRNN beyond some other typical optimizers. These experiments further verify the stability of our approach.

Figure 1: Two different forms of RNN. a: The typical RNN cell. b: The unfolded form of Fig. 1, which is functionally identical to the original form [10].

2 Related Work

The fundamental research of Recurrent Neural Networks was published in the 1980s. RNNs are powerful to model problems with a defined order but no clear concept of time, with a variant of Long short-term memory(LSTM) 

[12]. In [1], they argued that it is difficult to train RNN models due to the vanishing gradients and exploding gradients. Moreover, since RNN is sensitive to the initialization of weights and bias, those parameters should be initialized according to the input data [27]. In [22], they also state some difficulties in train RNNs. There is still a lack of a method to solve these above problems in RNN at the same time until now.

ADMM was first introduced in [6]. Its convergence was established in [5, 8]

. Since ADMM can decompose large problems with constraints into several small ones, it has been one of the most powerful optimization frameworks and shows a multitude of well-performed properties in plenty of fields, such as machine learning 

[2], signal processing [26]

and tensor decomposition 

[9].

In general, ADMM seeks to tackle the following problem:

(1)

Here, and are usually assumed to be convex functions. In Eq. (1), , , , and is a linear constraint and , are the dimensions of , respectively. It is solved by the Augmented Lagrangian Method which is formalized as:

(2)

where is the penalty term, is the Lagrangian multiplier.

Notations Descriptions
the timestep
the input of RNN cells
the output of RNN cells
the state at timestep
the weight corresponding to the input
the weight corresponding to the state
the prediction
the cell numbers after unfolding

the loss function

the regularization term
the iteration count
Table 1: Important notations and corresponding descriptions.

Since ADMM was first proposed, plenty of theoretical and practical works have been developed in recent years [19]. In 2016, [28] proposed a new method to train neural networks using ADMM. They abandoned traditional optimizers and adopted ADMM, which trains neural networks in a robust and parallel fashion. Furthermore, ADMM was applied to deep learning and obtained a remarkable result [30]

. They provided a gradient-free method to train neural networks, which gains convergent and great performance. Both of their works prove that ADMM is a powerful optimization method for neural networks because of its gradient-free property. However, RNNs are not as simple as a Multilayer Perceptron. The recurrent state brings many challenges into solving RNN with ADMM.

3 ADMM for RNNs

3.1 Notation

Before we dive into the ADMM methods for RNNs, we establish notations in this work. Considering a simple RNN cell as shown in Fig. 1, at timestep , is the input of the RNN cell and is the related output, RNNs could be expressed as:

(3)

where

is an activation function and

and are learnable parameters, . These parameters are also unified. The recurrent state in RNNs varies over timesteps as well as iterations, which brings difficulties to applying ADMM into RNNs directly. We adopt an unfolding form of RNN unit shown in Fig. 1 and decouple these above parameters into three sub-problems. Normally, at timestep , the updates are listed in the following:

(4)

where

is the activation function, such as ReLU 

[20] or tanh, usually tanh in RNNs. Important notations are summarized in Table 1. In this paper, we consider RNN in an unfolding form and present a theoretical analysis based on it.

For the sake of convenience, we define in the sequel. In term of applying ADMM into RNNs, assuming the RNN cell is unfolded into continuous cells, we try to solve the mathematical problem as follows:

Problem 1
(5)

In Problem 1, is the loss function which is convex and continuous, is the regularization term on the parameter . It is also a convex and continuous function. Rather than solving Problem 1 directly, we can relax it by adding an penalty term and transform Eq. (5) into

Problem 2
(6)

where is a tuning parameter. Compared with Problem 1, Problem 2 is much easier to solve. According to [30], the solution of Problem 2 tends to be the solution of Problem 1 when . For simplicity and clarity, we often use to denote the inner product and . For a positive semidefinite matrix , we define the

norm of a vector as

.

3.2 ADMM Solver for RNN

As aforementioned in Section 2, we explain that ADMM utilizes the Augmented Lagrangian Method to solve problems like Eq. (2). Similarly, we adopt the same way and present the corresponding Lagrangian function of Eq. (6), namely Eq. (7):

(7)

where is defined in Eq. (8).

(8)

Problem 2 is separated into eight subproblems and could be solved through the updates of these parameters in . Note that in are not changed over timestep . Consequently, these parameters are supposed to update over iterations. To make it clear, we only describe the typical update rules for and in the following subsections because there are some useful and typical skills in these subproblems while analysis of the other parameters detailed in Appendix 0.A is similar.

3.2.1 Update

We begin with the update of in Eq. (7) at iteration . In Eq. (8), and are coupled. As a result, we need to calculate the pseudo-inverse of the (rectangular) matrix , making it harder for the training process. In order to solve this problem, we define and replace it with Eq. (9).

(9)

It is equivalent to the linearized proximal point method inspired by [25]:

(10)

In this way, the update of is greatly sped up than the vanilla ADMM. It is worth noting that needs to be set properly and could also affect the performance of ADMMiRNN.

3.2.2 Update

Adding a proximal term similar to that in Section 3.2.1, if , this could be done by

(11)

When ,

(12)

Here is a trick: If is small enough, we have as a result of the property of tanh function. In this way, we could simplify the calculation of .

3.2.3 Update

The parameter represents the hidden state in the RNN cell shown in Fig. 1. With regard to the update of , there are and in Eq. (8). However, we only consider in the RNN model. It is because has been updated in last unit and would cause calculation redundancy in the updating process. This is another trick in our solution. Besides, also needs to be decoupled with . If , we could update through

(13)

And when ,

(14)

3.2.4 Update Lagrangian Multipliers

Similar to the parameters update, , and are updated as follows respectively:

(15a)
(15b)
(15c)

3.2.5 Algorithm

Generally, we update the above parameters in two steps. First, these parameters are update in a backward way, namely . Afterwards, ADMMiRNN reverses the update direction in . After all those variables in an RNN cell update, the Lagrangian multipliers then update. Proceeding with the above steps, we could arrive at the algorithms for ADMMiRNN which is outlined in Algorithm 1.

Input: iteration , input , timestep .
Parameter: , , , , , , , , and
Output: , , , , ,

1:  Initialize , , , , , , , , , and .
2:  for  do
3:     for  do
4:        if  then
5:           Update in Eq. (20).
6:        else if  then
7:           Update in Eq. (21).
8:        end if
9:        Update in Eq. (19).
10:        Update in Eq. (18).
11:        if  then
12:           Update in Eq. (13).
13:           Update in Eq. (11).
14:        else if  then
15:           Update in Eq. (14).
16:           Update in Eq. (12).
17:        end if
18:        Update in Eq. (17).
19:        Update in Eq. (16).
20:        Update in Eq. (10).
21:        Update in Eq. (10).
22:        Update in Eq. (16).
23:        Update in Eq. (17).
24:        if  then
25:           Update in Eq. (11).
26:           Update in Eq. (13).
27:        else if  then
28:           Update in Eq. (12).
29:           Update in Eq. (14).
30:        end if
31:        Update in Eq. (18).
32:        Update in Eq. (19).
33:        if  then
34:           Update in Eq. (20).
35:        else if  then
36:           Update in Eq. (21).
37:        end if
38:     end for
39:     Update in Eq. (15a).
40:     Update in Eq. (15b).
41:     Update in Eq. (15c).
42:  end for
43:  return , , , , ,
Algorithm 1 The training algorithm for ADMMiRNN.

3.3 Convergence Analysis

In this section, we present the convergent analysis of ADMMiRNN. For convenience, we define . First, we give some mild assumptions as follows:

Assumption 1

The gradient of is -Lipschitz continuous, , , and is called the Lipschitz constant. This is equivalent to ;

Assumption 2

The gradient of the objective function is bounded, , there exists a constant such that ;

Assumption 3

The second-order moment of the gradient

is uniformly upper-bounded, that is to say .

Such assumptions are typically used in [7, 33]. Under these assumptions, we will have the properties [31] shown in the supplementary materials. Then we can prove that ADMMiRNN converges under the following theorems.

Theorem 3.1

If and Assumption1-3 hold, then Property 1-3 in the supplementary materials hold.

Theorem 3.2

If , for the variables in Problem 2, starting from any , it at least has a limit point and any limit point is a critical point of Problem 2. In other words, .

Theorem 3.2 concludes that ADMMiRNN has a global convergence.

(a) training loss versus iterations.
(b) test loss versus iterations.
Figure 2: Training loss and test loss versus iterations of RNN via ADMM, SGD, AdaGrad, Momentum, RMSprop, and Adam. ADMMiRNN achieves the best performance against other optimizers on MNIST.
Theorem 3.3

For a sequence generated by Algorithm 1, define , the convergence rate of is .

Theorem 3.3 concludes that ADMMiRNN converges globally at a rate of . The convergence rate is consistent with current work of ADMM [32, 31]. Due space limited, the proofs of the above theorems are also omitted in the supplementary materials.

4 Experiments

4.1 Setup

We train a RNN model shown in Fig. 1 on MNIST [17]. This is achieved by numpy and those parameters are updated in a manner of Algorithm 1. The MNIST dataset has 55,000 training samples and 10,000 test samples and was first introduced in [17] to train handwritten-digit image recognition. To make a fair comparison, all the experiments related to MNIST are conducted in 1000 iterations on a 64-bit Ubuntu 16.04 system.

In addition, our experiments are also conducted on a text. The text could also be accessed from our open-source code repository. Training on a text is a typical RNN task. We achieve a typical RNN model and unfolded it to

cells with numpy and is also the length of the input sequence. In our experiments, we adopt a kind of smooth loss. These experiments are performed on a Macbook Pro with an Intel 3.1 GHz Core i5 Processor and 8 GB Memory.

In all of our experiments, we utilize a fixed value strategy for these hyperparameters, such as

and .

Figure 3: The results of ADMMiRNN and SGD on different input sequence length. In this figure, represents the length.

4.2 Convergence Results

4.2.1 Results on MNIST

We train the simple RNN model shown in Fig. 1 through different optimizers, including SGD, Adam, Momentum [23], RMSProp [29] and AdaGrad [3]. We compare our ADMMiRNN with these commonly-used optimizers in training loss and test loss and display our experimental results on MNIST in Fig. 2(a) and Fig. 2(b) respectively. Fig. 2(a) and Fig. 2(b) indicate that ADMMiRNN converges faster than the other optimziers. ADMMiRNN gets a more smooth loss curve while the loss curves of other optimizers shake a lot.

(a) training loss of ADMMiRNN and some typical optimizers.
(b) test loss of ADMMiRNN and some typical optimizers.
Figure 4:

The comparison of stability among ADMMiRNN, SGD, Adam and RMSProp. For each optimization method, we repeated experiments 10 times to obtain the mean and variance of the training loss and test loss against iterations on MNIST.

This means ADMMiRNN trains models in a relatively stable process. Besides, ADMMiRNN gets a much more promising training loss and test loss. These results not only prove that ADMMiRNN could converge in RNN tasks but also confirm that ADMMiRNN is a much powerful tool than traditional gradient-based optimizers in deep learning.

4.2.2 Results on Text Data

Besides experiments on MNIST, we also explore how ADMMiRNN performs in text classification tasks. One critical shortcoming of current RNN models is that they are sensitive to the length of the input sequence because the longer the input sequence is, the worse training results are. To investigate the sensitivity of ADMMiRNN to the input length, we measure the performance of ADMMiRNN and SGD on the text data with different input sequence length. The results are displayed in Fig. 3. Here, we adopt the average loss of the input sequence as our target. From Fig. 3, we have evidence that ADMMiRNN always produces a remarkable result and is nearly immune to the length, which performs much more impressive than SGD regardless of the length of the input sequence.

training loss test loss
1 1 1 1
0.1 1 1 1
1 0.1 1 1
1 1 0.1 1
1 1 10 1
1 1 1 10
1 1 10 10
Table 2: Training loss and test loss under different hyperparameter settings. All of these values are obtained after 20 iterations.

4.3 Stability

As aforementioned, the initialization of weights and biases is critical in RNN models. In this section, we mainly compare ADMM with some different optimizers and explore its stability for RNN. In brief, we compare ADMMiRNN with SGD, Adam, and RMSProp and repeat each scheme ten times independently. The experimental results are displayed in Fig. 4. The blocks in Fig. 4(a) and Fig. 4(b)

represent the standard deviation of the samples drawn from the training and testing process. The smaller the blocks are, the more stable the method is. From Fig. 

4(a) and Fig. 4(b), we observe that at the beginning, SGD has a small fluctuation. But as the training progresses, the fluctuation gets more and sharper, which means that SGD is tending to be unstable. As for Adam and RMSProp, their variance is getting smaller, but still large with regard to ADMMiRNN. According to different initialization of weights and biases, these optimizers may cause different results within a big gap between them. Specifically, ADMMiRNN has a relatively small variance from beginning to end compared with SGD, Adam and RMSProp, which is too small to show clearly in Fig. 4(a) and Fig. 4(b), which indicates that ADMMiRNN is immune to the initialization of weights and biases and settle the sensitivity of RNN models to initialization.

training iterations test iterations
1 1 1 1 2 2
0.1 1 1 1 2 2
1 0.1 1 1 2 2
1 1 0.1 1 2 2
1 1 10 1 3 3
1 1 10 10 11 11
1 1 1 10 2 2
Table 3: In this table, we mainly display the needed iteration counts in the training and test process when the accuracy reaches 100.0. This is a representation of the convergence speed under different hyperparameters.

No matter how the initialization changes, ADMMiRNN always gives a stable training process and promising results. The results demonstrate that ADMMiRNN is a more stable training algorithm for RNN models than stochastic gradient algorithms.

4.4 Different Hyperparameters

4.4.1 varying s

In vanilla ADMM, the value of the penalty term is critical, and it may have negative effects on convergence. In this subsection, we mainly try different hyperparameters in ADMMiRNN and evaluate how they influence the training process of ADMMiRNN. These results are summarized in Table 2 and Table 3. Table 2 implies that determines the best result in ADMMiRNN. More precisely, larger delays the convergence speed in ADMMiRNN. However, if is too large, it may produce non-convergent results. In Table 3, we present the needed iteration count when the accuracy is 100.0 in the training and test process. When is 100, we need 11 iterations for the accuracy reaches 100.0 but we only need 2 iterations when or . Furthermore, it turns out that and account less in ADMMiRNN while plays a much more crucial role with regard to the property of convergence and its convergence speed.

4.4.2 varying

In this subsection, we investigate the influence of in Eq. 8. In our experiments on a text data, we fix all the hyperparameters other than and set it , , , , respectively. We display the curves corresponding to different values of in Fig. 5.

Figure 5: The training loss V.S. iterations of ADMMiRNN on a text classification task with different .

Fig. 5 suggests that larger produces a relatively worse convergence result in ADMMiRNN. Small can not only lead to a small loss but is also able to push the training process to converge fast. However, when is small enough, the influence on the convergence rate and convergent result is not obvious.

4.5 Extension

In this subsection, we conduct another extensional experiment on MNIST to explore how the weight coefficient influences the performance and show it in Fig. 6. Fig. 6(a) and Fig. 6(b) illustrate that if we adopt a large , the convergence would be delayed and also be more quivering than small ones. But we observe that the influence is not as obvious as training accuracy and validation accuracy with respect to the loss shown in Fig. 6(c).

(a)

training accuracy versus epochs

(b) validation accuracy versus epochs
(c) loss versus epochs
Figure 6: Training RNN on MNIST via ADMMiRNN regarding training accuracy, validation accuracy and training loss V.S. epochs.

5 Conclusion

In this paper, we proposed a new framework to train RNN tasks, namely ADMMiRNN. Since it is difficult to train RNNs with ADMM directly, we set up ADMMiRNN on the foundation of the expanded form of RNNs. The convergence analysis of ADMMiRNN is presented and ADMMiRNN could achieve a convergence rate of . We further conduct several experiments on real-world datasets based on our theoretical analysis. Experimental results of comparisons regarding ADMMiRNN and several popular optimizers manifest that ADMMiRNN converges faster than these gradient-based optimizers. Besides, it presents a much more stable process than them. To the best of our knowledge, we are the first to apply ADMM into RNN tasks and present theoretical analysis and ADMMiRNN is the first to alleviate the vanishing and exploding gradients problem and the sensitivity of RNN models to initializations at the same time. In conclusion, ADMMiRNN is a promising tool to train RNN models. In our future work, we will explore how to decide the best penalty parameter for ADMMiRNN in the training process rather than adopting a fixed value. Moreover, we will try to train ADMMiRNN in parallel as ADMM is a great parallel optimization method.

Acknowledgement

This work is sponsored in part by the National Key R&D Program of China under Grant No. 2018YFB2101100 and the National Natural Science Foundation of China under Grant No. 61806216, 61702533, 61932001, 61872376 and 61701506.

References

  • [1] Y. Bengio, P. Simard, P. Frasconi, et al. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §2, footnote 1.
  • [2] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §2.
  • [3] J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §4.2.1.
  • [4] J. L. Elman (1990) Finding structure in time. Cognitive science 14 (2), pp. 179–211. Cited by: §1.
  • [5] D. Gabay (1983) Augmented lagrangian methods: applications to the solution of boundary-value problems, chapter applications of the method of multipliers to variational inequalities. North-Holland, Amsterdam 3, pp. 4. Cited by: §2.
  • [6] D. Gabay and B. Mercier (1976) A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & mathematics with applications 2 (1), pp. 17–40. Cited by: §2.
  • [7] S. Ghadimi, G. Lan, and H. Zhang Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming 155 (1-2), pp. 267–305. Cited by: §3.3.
  • [8] R. Glowinski and P. Le Tallec (1989) Augmented lagrangian and operator-splitting methods in nonlinear mechanics. Vol. 9, SIAM. Cited by: §2.
  • [9] D. Goldfarb and Z. Qin (2014) Robust low-rank tensor recovery: models and algorithms. SIAM Journal on Matrix Analysis and Applications 35 (1), pp. 225–253. Cited by: §2.
  • [10] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: Figure 1.
  • [11] A. Graves, S. Fernández, and J. Schmidhuber (2007) Multi-dimensional recurrent neural networks. In International conference on artificial neural networks, pp. 549–558. Cited by: §1.
  • [12] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
  • [13] D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: §1.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [15] S. Lai, L. Xu, K. Liu, and J. Zhao (2015) Recurrent convolutional neural networks for text classification. In AAAI, Vol. 333, pp. 2267–2273. Cited by: §1.
  • [16] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • [17] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
  • [18] T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur (2010) Recurrent neural network based language model. In Proceedings of ISCA, Cited by: §1.
  • [19] R. D. Monteiro and B. F. Svaiter (2010) Iteration-complexity of block-decomposition algorithms and the alternating minimization augmented lagrangian method. Manuscript, School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, pp. 30332–0205. Cited by: §2.
  • [20] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §3.1.
  • [21] T. H. Nguyen, K. Cho, and R. Grishman (2016) Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 300–309. Cited by: §1.
  • [22] R. Pascanu, T. Mikolov, and Y. Bengio (2013) On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Cited by: §2.
  • [23] N. Qian (1999) On the momentum term in gradient descent learning algorithms. Neural networks 12 (1), pp. 145–151. Cited by: §4.2.1.
  • [24] H. Robbins and S. Monro (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §1.
  • [25] R. T. Rockafellar (1976) Monotone operators and the proximal point algorithm. SIAM journal on control and optimization 14 (5), pp. 877–898. Cited by: §3.2.1.
  • [26] T. Sun, H. Jiang, L. Cheng, and W. Zhu (2018) Iteratively linearized reweighted alternating direction method of multipliers for a class of nonconvex problems. IEEE Transactions on Signal Processing 66 (20), pp. 5380–5391. Cited by: §2.
  • [27] I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §1, §2.
  • [28] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein (2016) Training neural networks without gradients: a scalable admm approach. In International conference on machine learning, pp. 2722–2731. Cited by: §1, §2.
  • [29] T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop, coursera: neural networks for machine learning. University of Toronto, Technical Report. Cited by: §1, §4.2.1.
  • [30] J. Wang, F. Yu, X. Chen, and L. Zhao (2019) Admm for efficient deep learning with global convergence. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 111–119. Cited by: §2, §3.1.
  • [31] J. Wang, L. Zhao, and L. Wu (2019) Multi-convex inequality-constrained alternating direction method of multipliers. arXiv preprint arXiv:1902.10882. Cited by: §3.3, §3.3.
  • [32] W. Zhong and J. Kwok (2014) Fast stochastic alternating direction method of multipliers. In International Conference on Machine Learning, pp. 46–54. Cited by: §3.3.
  • [33] F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu (2019) A sufficient condition for convergences of adam and rmsprop. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 11127–11135. Cited by: §3.3.

Appendix 0.A Appendix A

0.a.1 Update

Similar as the update of in Section 3.2.1, we also define and use with linearized proximal point method, then the update of is transformed into

(16)

0.a.2 Update

As far as is concerned, it is updated by

(17)

0.a.3 Update

Similar as aforementioned, the update rule for is

(18)

0.a.4 Update

The parameter is quite simple, which is updated as follows:

(19)

0.a.5 Update

Finally, we update . It has to be noted that each is also updated separably. If ,

(20)

If ,

(21)