Recurrent Neural Network (RNN)  has made great progress in various fields, namely language modelling , text classification , event extraction , and various real-world applications . Although RNN models have been widely used, it is still difficult to train RNN models because of the vanishing gradients and exploding gradients problems111More information about vanishing gradients and vanishing gradients could be found in . Moreover, RNN models are sensitive to the weights and biases , which may not converge with poor initialization. These problems still need a method to be solved simultaneously.
Nowadays, gradient-based training algorithms are widely used in deep learning
, such as Stochastic Gradient Descent (SGD), Adam 
, RMSProp. However, they still suffer from vanishing or exploding gradients. Compared to the traditional gradient-based optimization algorithms, the Alternating Direction Method of Multipliers (ADMM) is a much more robust method for training neural networks. It has been recognized as a promising policy to alleviate vanishing gradients and exploding gradients problems and exerts a tremendous fascination on researchers. In addition, ADMM is also immune to poor conditioning with gradient-free technique .
In light of these properties of ADMM and to alleviate the aforementioned problems in RNN simultaneously, we are motivated to train RNN models with ADMM. However, it is not easy to apply ADMM to RNNs directly due to the recurrent state compared with MLP and CNN. The recurrent states are updated over timesteps instead of iterations, which is not compatible with ADMM. In this paper, to tackle this problem, we propose an ADMMiRNN method with theoretical analysis. Experimental comparisons between ADMMiRNN and some typical stochastic gradient algorithms, such as SGD and Adam, illustrate that ADMMiRNN avoids the vanishing gradients and exploding gradients problems and surpasses traditional stochastic gradient algorithms in term of stability and efficiency. The main contributions of this work are four-fold:
We propose a new framework named ADMMiRNN for training RNN models via ADMM. ADMMiRNN is built upon the unfolded RNN unit, which is a remarkable feature of RNN, and could settle the problems of gradient vanishing or exploding and sensitive parameter initialization in RNN at the same time. Instead of using vanilla ADMM, some practical skills in our solution also help converge.
To the best of our knowledge, we are the first to handle RNN training problems using ADMM which is a gradient-free approach and brings extremely great advantages on stability beyond traditional stochastic gradient algorithms.
Theoretical analysis of ADMMiRNN is presented, including update rules and convergence analysis. Our analysis ensures that ADMMiRNN achieves an efficient and stable result. Moreover, the framework proposed in this work could be applied to various RNN-based works.
Based on our theoretical analysis, numerical experiments are conducted on several real-world datasets, and the results of the experiments demonstrate the efficiency of the proposed ADMMiRNN beyond some other typical optimizers. These experiments further verify the stability of our approach.
2 Related Work
The fundamental research of Recurrent Neural Networks was published in the 1980s. RNNs are powerful to model problems with a defined order but no clear concept of time, with a variant of Long short-term memory(LSTM). In , they argued that it is difficult to train RNN models due to the vanishing gradients and exploding gradients. Moreover, since RNN is sensitive to the initialization of weights and bias, those parameters should be initialized according to the input data . In , they also state some difficulties in train RNNs. There is still a lack of a method to solve these above problems in RNN at the same time until now.
. Since ADMM can decompose large problems with constraints into several small ones, it has been one of the most powerful optimization frameworks and shows a multitude of well-performed properties in plenty of fields, such as machine learning, signal processing 
and tensor decomposition.
In general, ADMM seeks to tackle the following problem:
Here, and are usually assumed to be convex functions. In Eq. (1), , , , and is a linear constraint and , are the dimensions of , respectively. It is solved by the Augmented Lagrangian Method which is formalized as:
where is the penalty term, is the Lagrangian multiplier.
|the input of RNN cells|
|the output of RNN cells|
|the state at timestep|
|the weight corresponding to the input|
|the weight corresponding to the state|
|the cell numbers after unfolding|
the loss function
|the regularization term|
|the iteration count|
Since ADMM was first proposed, plenty of theoretical and practical works have been developed in recent years . In 2016,  proposed a new method to train neural networks using ADMM. They abandoned traditional optimizers and adopted ADMM, which trains neural networks in a robust and parallel fashion. Furthermore, ADMM was applied to deep learning and obtained a remarkable result 
. They provided a gradient-free method to train neural networks, which gains convergent and great performance. Both of their works prove that ADMM is a powerful optimization method for neural networks because of its gradient-free property. However, RNNs are not as simple as a Multilayer Perceptron. The recurrent state brings many challenges into solving RNN with ADMM.
3 ADMM for RNNs
Before we dive into the ADMM methods for RNNs, we establish notations in this work. Considering a simple RNN cell as shown in Fig. 1, at timestep , is the input of the RNN cell and is the related output, RNNs could be expressed as:
is an activation function andand are learnable parameters, . These parameters are also unified. The recurrent state in RNNs varies over timesteps as well as iterations, which brings difficulties to applying ADMM into RNNs directly. We adopt an unfolding form of RNN unit shown in Fig. 1 and decouple these above parameters into three sub-problems. Normally, at timestep , the updates are listed in the following:
is the activation function, such as ReLU or tanh, usually tanh in RNNs. Important notations are summarized in Table 1. In this paper, we consider RNN in an unfolding form and present a theoretical analysis based on it.
For the sake of convenience, we define in the sequel. In term of applying ADMM into RNNs, assuming the RNN cell is unfolded into continuous cells, we try to solve the mathematical problem as follows:
In Problem 1, is the loss function which is convex and continuous, is the regularization term on the parameter . It is also a convex and continuous function. Rather than solving Problem 1 directly, we can relax it by adding an penalty term and transform Eq. (5) into
where is a tuning parameter. Compared with Problem 1, Problem 2 is much easier to solve. According to , the solution of Problem 2 tends to be the solution of Problem 1 when . For simplicity and clarity, we often use to denote the inner product and . For a positive semidefinite matrix , we define the
norm of a vector as.
3.2 ADMM Solver for RNN
As aforementioned in Section 2, we explain that ADMM utilizes the Augmented Lagrangian Method to solve problems like Eq. (2). Similarly, we adopt the same way and present the corresponding Lagrangian function of Eq. (6), namely Eq. (7):
where is defined in Eq. (8).
Problem 2 is separated into eight subproblems and could be solved through the updates of these parameters in . Note that in are not changed over timestep . Consequently, these parameters are supposed to update over iterations. To make it clear, we only describe the typical update rules for and in the following subsections because there are some useful and typical skills in these subproblems while analysis of the other parameters detailed in Appendix 0.A is similar.
We begin with the update of in Eq. (7) at iteration . In Eq. (8), and are coupled. As a result, we need to calculate the pseudo-inverse of the (rectangular) matrix , making it harder for the training process. In order to solve this problem, we define and replace it with Eq. (9).
It is equivalent to the linearized proximal point method inspired by :
In this way, the update of is greatly sped up than the vanilla ADMM. It is worth noting that needs to be set properly and could also affect the performance of ADMMiRNN.
Adding a proximal term similar to that in Section 3.2.1, if , this could be done by
Here is a trick: If is small enough, we have as a result of the property of tanh function. In this way, we could simplify the calculation of .
The parameter represents the hidden state in the RNN cell shown in Fig. 1. With regard to the update of , there are and in Eq. (8). However, we only consider in the RNN model. It is because has been updated in last unit and would cause calculation redundancy in the updating process. This is another trick in our solution. Besides, also needs to be decoupled with . If , we could update through
And when ,
3.2.4 Update Lagrangian Multipliers
Similar to the parameters update, , and are updated as follows respectively:
Generally, we update the above parameters in two steps. First, these parameters are update in a backward way, namely . Afterwards, ADMMiRNN reverses the update direction in . After all those variables in an RNN cell update, the Lagrangian multipliers then update. Proceeding with the above steps, we could arrive at the algorithms for ADMMiRNN which is outlined in Algorithm 1.
3.3 Convergence Analysis
In this section, we present the convergent analysis of ADMMiRNN. For convenience, we define . First, we give some mild assumptions as follows:
The gradient of is -Lipschitz continuous, , , and is called the Lipschitz constant. This is equivalent to ;
The gradient of the objective function is bounded, , there exists a constant such that ;
The second-order moment of the gradient
The second-order moment of the gradientis uniformly upper-bounded, that is to say .
Such assumptions are typically used in [7, 33]. Under these assumptions, we will have the properties  shown in the supplementary materials. Then we can prove that ADMMiRNN converges under the following theorems.
If and Assumption1-3 hold, then Property 1-3 in the supplementary materials hold.
Theorem 3.2 concludes that ADMMiRNN has a global convergence.
For a sequence generated by Algorithm 1, define , the convergence rate of is .
We train a RNN model shown in Fig. 1 on MNIST . This is achieved by numpy and those parameters are updated in a manner of Algorithm 1. The MNIST dataset has 55,000 training samples and 10,000 test samples and was first introduced in  to train handwritten-digit image recognition. To make a fair comparison, all the experiments related to MNIST are conducted in 1000 iterations on a 64-bit Ubuntu 16.04 system.
In addition, our experiments are also conducted on a text. The text could also be accessed from our open-source code repository. Training on a text is a typical RNN task. We achieve a typical RNN model and unfolded it tocells with numpy and is also the length of the input sequence. In our experiments, we adopt a kind of smooth loss. These experiments are performed on a Macbook Pro with an Intel 3.1 GHz Core i5 Processor and 8 GB Memory.
In all of our experiments, we utilize a fixed value strategy for these hyperparameters, such asand .
4.2 Convergence Results
4.2.1 Results on MNIST
We train the simple RNN model shown in Fig. 1 through different optimizers, including SGD, Adam, Momentum , RMSProp  and AdaGrad . We compare our ADMMiRNN with these commonly-used optimizers in training loss and test loss and display our experimental results on MNIST in Fig. 2(a) and Fig. 2(b) respectively. Fig. 2(a) and Fig. 2(b) indicate that ADMMiRNN converges faster than the other optimziers. ADMMiRNN gets a more smooth loss curve while the loss curves of other optimizers shake a lot.
The comparison of stability among ADMMiRNN, SGD, Adam and RMSProp. For each optimization method, we repeated experiments 10 times to obtain the mean and variance of the training loss and test loss against iterations on MNIST.
This means ADMMiRNN trains models in a relatively stable process. Besides, ADMMiRNN gets a much more promising training loss and test loss. These results not only prove that ADMMiRNN could converge in RNN tasks but also confirm that ADMMiRNN is a much powerful tool than traditional gradient-based optimizers in deep learning.
4.2.2 Results on Text Data
Besides experiments on MNIST, we also explore how ADMMiRNN performs in text classification tasks. One critical shortcoming of current RNN models is that they are sensitive to the length of the input sequence because the longer the input sequence is, the worse training results are. To investigate the sensitivity of ADMMiRNN to the input length, we measure the performance of ADMMiRNN and SGD on the text data with different input sequence length. The results are displayed in Fig. 3. Here, we adopt the average loss of the input sequence as our target. From Fig. 3, we have evidence that ADMMiRNN always produces a remarkable result and is nearly immune to the length, which performs much more impressive than SGD regardless of the length of the input sequence.
|training loss||test loss|
As aforementioned, the initialization of weights and biases is critical in RNN models. In this section, we mainly compare ADMM with some different optimizers and explore its stability for RNN. In brief, we compare ADMMiRNN with SGD, Adam, and RMSProp and repeat each scheme ten times independently. The experimental results are displayed in Fig. 4. The blocks in Fig. 4(a) and Fig. 4(b)
represent the standard deviation of the samples drawn from the training and testing process. The smaller the blocks are, the more stable the method is. From Fig.4(a) and Fig. 4(b), we observe that at the beginning, SGD has a small fluctuation. But as the training progresses, the fluctuation gets more and sharper, which means that SGD is tending to be unstable. As for Adam and RMSProp, their variance is getting smaller, but still large with regard to ADMMiRNN. According to different initialization of weights and biases, these optimizers may cause different results within a big gap between them. Specifically, ADMMiRNN has a relatively small variance from beginning to end compared with SGD, Adam and RMSProp, which is too small to show clearly in Fig. 4(a) and Fig. 4(b), which indicates that ADMMiRNN is immune to the initialization of weights and biases and settle the sensitivity of RNN models to initialization.
|training iterations||test iterations|
No matter how the initialization changes, ADMMiRNN always gives a stable training process and promising results. The results demonstrate that ADMMiRNN is a more stable training algorithm for RNN models than stochastic gradient algorithms.
4.4 Different Hyperparameters
4.4.1 varying s
In vanilla ADMM, the value of the penalty term is critical, and it may have negative effects on convergence. In this subsection, we mainly try different hyperparameters in ADMMiRNN and evaluate how they influence the training process of ADMMiRNN. These results are summarized in Table 2 and Table 3. Table 2 implies that determines the best result in ADMMiRNN. More precisely, larger delays the convergence speed in ADMMiRNN. However, if is too large, it may produce non-convergent results. In Table 3, we present the needed iteration count when the accuracy is 100.0 in the training and test process. When is 100, we need 11 iterations for the accuracy reaches 100.0 but we only need 2 iterations when or . Furthermore, it turns out that and account less in ADMMiRNN while plays a much more crucial role with regard to the property of convergence and its convergence speed.
In this subsection, we investigate the influence of in Eq. 8. In our experiments on a text data, we fix all the hyperparameters other than and set it , , , , respectively. We display the curves corresponding to different values of in Fig. 5.
Fig. 5 suggests that larger produces a relatively worse convergence result in ADMMiRNN. Small can not only lead to a small loss but is also able to push the training process to converge fast. However, when is small enough, the influence on the convergence rate and convergent result is not obvious.
In this subsection, we conduct another extensional experiment on MNIST to explore how the weight coefficient influences the performance and show it in Fig. 6. Fig. 6(a) and Fig. 6(b) illustrate that if we adopt a large , the convergence would be delayed and also be more quivering than small ones. But we observe that the influence is not as obvious as training accuracy and validation accuracy with respect to the loss shown in Fig. 6(c).
In this paper, we proposed a new framework to train RNN tasks, namely ADMMiRNN. Since it is difficult to train RNNs with ADMM directly, we set up ADMMiRNN on the foundation of the expanded form of RNNs. The convergence analysis of ADMMiRNN is presented and ADMMiRNN could achieve a convergence rate of . We further conduct several experiments on real-world datasets based on our theoretical analysis. Experimental results of comparisons regarding ADMMiRNN and several popular optimizers manifest that ADMMiRNN converges faster than these gradient-based optimizers. Besides, it presents a much more stable process than them. To the best of our knowledge, we are the first to apply ADMM into RNN tasks and present theoretical analysis and ADMMiRNN is the first to alleviate the vanishing and exploding gradients problem and the sensitivity of RNN models to initializations at the same time. In conclusion, ADMMiRNN is a promising tool to train RNN models. In our future work, we will explore how to decide the best penalty parameter for ADMMiRNN in the training process rather than adopting a fixed value. Moreover, we will try to train ADMMiRNN in parallel as ADMM is a great parallel optimization method.
This work is sponsored in part by the National Key R&D Program of China under Grant No. 2018YFB2101100 and the National Natural Science Foundation of China under Grant No. 61806216, 61702533, 61932001, 61872376 and 61701506.
-  (1994) Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §2, footnote 1.
-  (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §2.
-  (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §4.2.1.
-  (1990) Finding structure in time. Cognitive science 14 (2), pp. 179–211. Cited by: §1.
-  (1983) Augmented lagrangian methods: applications to the solution of boundary-value problems, chapter applications of the method of multipliers to variational inequalities. North-Holland, Amsterdam 3, pp. 4. Cited by: §2.
-  (1976) A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & mathematics with applications 2 (1), pp. 17–40. Cited by: §2.
-  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming 155 (1-2), pp. 267–305. Cited by: §3.3.
-  (1989) Augmented lagrangian and operator-splitting methods in nonlinear mechanics. Vol. 9, SIAM. Cited by: §2.
-  (2014) Robust low-rank tensor recovery: models and algorithms. SIAM Journal on Matrix Analysis and Applications 35 (1), pp. 225–253. Cited by: §2.
-  (2016) Deep learning. MIT press. Cited by: Figure 1.
-  (2007) Multi-dimensional recurrent neural networks. In International conference on artificial neural networks, pp. 549–558. Cited by: §1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
-  (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: §1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (2015) Recurrent convolutional neural networks for text classification. In AAAI, Vol. 333, pp. 2267–2273. Cited by: §1.
-  (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
-  (2010) Recurrent neural network based language model. In Proceedings of ISCA, Cited by: §1.
-  (2010) Iteration-complexity of block-decomposition algorithms and the alternating minimization augmented lagrangian method. Manuscript, School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, pp. 30332–0205. Cited by: §2.
-  (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §3.1.
-  (2016) Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 300–309. Cited by: §1.
-  (2013) On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Cited by: §2.
-  (1999) On the momentum term in gradient descent learning algorithms. Neural networks 12 (1), pp. 145–151. Cited by: §4.2.1.
-  (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §1.
-  (1976) Monotone operators and the proximal point algorithm. SIAM journal on control and optimization 14 (5), pp. 877–898. Cited by: §3.2.1.
-  (2018) Iteratively linearized reweighted alternating direction method of multipliers for a class of nonconvex problems. IEEE Transactions on Signal Processing 66 (20), pp. 5380–5391. Cited by: §2.
-  (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §1, §2.
-  (2016) Training neural networks without gradients: a scalable admm approach. In International conference on machine learning, pp. 2722–2731. Cited by: §1, §2.
-  (2012) Lecture 6.5-rmsprop, coursera: neural networks for machine learning. University of Toronto, Technical Report. Cited by: §1, §4.2.1.
-  (2019) Admm for efficient deep learning with global convergence. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 111–119. Cited by: §2, §3.1.
-  (2019) Multi-convex inequality-constrained alternating direction method of multipliers. arXiv preprint arXiv:1902.10882. Cited by: §3.3, §3.3.
-  (2014) Fast stochastic alternating direction method of multipliers. In International Conference on Machine Learning, pp. 46–54. Cited by: §3.3.
-  (2019) A sufficient condition for convergences of adam and rmsprop. In , pp. 11127–11135. Cited by: §3.3.
Appendix 0.A Appendix A
Similar as the update of in Section 3.2.1, we also define and use with linearized proximal point method, then the update of is transformed into
As far as is concerned, it is updated by
Similar as aforementioned, the update rule for is
The parameter is quite simple, which is updated as follows:
Finally, we update . It has to be noted that each is also updated separably. If ,