# An Efficient EKF Based Algorithm For LSTM-Based Online Learning

We investigate online nonlinear regression with long short term memory (LSTM) based networks, which we refer to as LSTM-based online learning. For LSTM-based online learning, we introduce a highly efficient extended Kalman filter (EKF) based training algorithm with a theoretical convergence guarantee. Through simulations, we illustrate significant performance improvements achieved by our algorithm with respect to the conventional LSTM training methods. We particularly show that our algorithm provides very similar error performance with the EKF learning algorithm in 25-40 times shorter training time depending on the parameter size of the network.

## Authors

• 6 publications
• 18 publications
• ### Stability of the Decoupled Extended Kalman Filter Learning Algorithm in LSTM-Based Online Learning

We investigate the convergence and stability properties of the decoupled...
11/25/2019 ∙ by N. Mert Vural, et al. ∙ 0

• ### RNN-based Online Learning: An Efficient First-Order Optimization Algorithm with a Convergence Guarantee

We investigate online nonlinear regression with continually running recu...
03/07/2020 ∙ by N. Mert Vural, et al. ∙ 0

• ### A Tree Architecture of LSTM Networks for Sequential Regression with Missing Data

We investigate regression for variable length sequential data containing...
05/22/2020 ∙ by S. Onur Sahin, et al. ∙ 0

• ### Online Spatio-Temporal Learning in Deep Neural Networks

Biological neural networks are equipped with an inherent capability to c...
07/24/2020 ∙ by Thomas Bohnstingl, et al. ∙ 0

• ### Stability of the Decoupled Extended Kalman Filter in the LSTM-Based Online Learning

We investigate the convergence and stability properties of the decoupled...
11/25/2019 ∙ by N. Mert Vural, et al. ∙ 0

• ### SpCoSLAM 2.0: An Improved and Scalable Online Learning of Spatial Concepts and Language Models with Mapping

In this paper, we propose a novel online learning algorithm, SpCoSLAM 2....
03/09/2018 ∙ by Akira Taniguchi, et al. ∙ 0

• ### Online Supervised Learning for Traffic Load Prediction in Framed-ALOHA Networks

Predicting the current backlog, or traffic load, in framed-ALOHA network...
07/25/2019 ∙ by Nan Jiang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Neural networks provide enhanced performance for a wide range of engineering applications [1, 2]

thanks to their highly strong nonlinear modeling capabilities. Among neural networks, especially recurrent neural networks (RNNs) are used to model time series and temporal data due to their inherent memory storing the past information

[3]. However, since simple RNNs lack control structures, the norm of gradient may grow or decay in a fast manner during training [4]. Therefore, simple RNNs are insufficient to capture time dependencies [4]. To circumvent this issue, a novel RNN architecture with control structures, i.e., the LSTM network, is introduced [5]. In this study, we consider online nonlinear regression with LSTM-based networks due to their superior performance in capturing long-term dependencies.

For LSTM-based networks, there exists a wide range of online training methods to learn network parameters [6, 3, 7]. Among them, the first-order gradient-based methods [3] are widely preferred due to their efficiency. However, the first-order techniques, in general, provide poorer performance compared to the second-order techniques [6]. As a second-order technique, the extended Kalman filter (EKF) learning algorithm has often been favored in terms of its accuracy and speed of convergence [7, 6]. However, the EKF learning algorithm has a quadratic computational requirement in the parameter size, which is usually prohibitive for practical applications due to the large number of parameters in LSTMs. To reduce the computational requirement of EKF, the independent EKF (IEKF) algorithm has been introduced in [8].111We note that IEKF is firstly introduced in [8] with the name of the Node Level Extended Kalman Filter Algorithm (NEKA). However, in the following studies [9, 7], NEKA is renamed as IEKF to distinguish the algorithm from the Node Decoupled Extended Kalman Filter algorithm [7]. In this study, we prefer to use IEKF to avoid confusion. The main motivation of IEKF is the observation that during the EKF-based neural network training, the correlation between the weights belonging to different neural nodes is usually much lower than the correlation between the weights in the same neural node [6, 10]. Based on this observation, in IEKF, each neural node is assumed as an independent subsystem, and seperate EKF learning algorithms are used to learn the weights in different nodes. By this method, the computational requirement of the learning procedure is reduced by the number of neural nodes in the network while avoiding considerable performance reduction [8]. We note that since practical LSTM models usually consist of neural nodes, the computational saving with IEKF leads to a considerable run-time reduction in the LSTM training.

Although IEKF provides comparable performance to the second-order methods in a considerably smaller run-time, it is more vulnerable to the divergence problems (compared to the EKF and stochastic gradient descent (SGD) algorithms) due to its treatment of each node as an independent subsystem [9]. In this study, to provide both efficient and robust LSTM-based online learning procedure, we introduce an IEKF-based training algorithm with a theoretical convergence guarantee. To the best of our knowledge, our paper is the first study that provides a theoretical convergence guarantee for an IEKF-based algorithm in the neural network literature. We note that by using IEKF, we introduce a highly efficient counterpart of the state-of-the-art online learning algorithms [11, 12, 13], especially for LSTM-based online learning, where IEKF provides considerable improvements in training time without significant performance reduction [14].

## 2 Model and Problem Description

We222All vectors are column vectors and denoted by boldface lower case letters. Matrices are represented by boldface capital letters. I

is the identity matrix, whose dimensions are understood from the context.

and denote the Euclidean norm and trace operators. Given two vectors x and y, is their vertical concatenation. We use bracket notation to denote the set of the first positive integers, i.e., . define the online regression problem as follows: We sequentially receive ,

, and input vectors,

,

such that our goal is to estimate

based on our current and past observations .333We assume for notational simplicity; however, our derivations hold for any bounded desired data sequence after shifting and scaling in magnitude. Given our estimate , which can only be a function of and , we suffer the loss

. The aim is to optimize the network with respect to the loss function

. In this study, we particularly work with the squared error, i.e., . However, our work can be extended to a wide range of cost functions (including the cross-entropy) using the analysis in [15, Section 3].

In this paper, we study online regression with LSTM-based networks. As the LSTM cell, we use the most widely used LSTM model, where the activation functions are set to the hyperbolic tangent function and the peep-hole connections are eliminated. As the network model, we use a single hidden layer based on the LSTM structure, and an output layer with the hyperbolic tangent function. Hence, the network equations are:

 zt =tanh(W(z)[xt;yt−1]) (1) it =σ(W(i)[xt;yt−1]) (2) ft =σ(W(f)[xt;yt−1]) (3) ct =it⊙zt+ft⊙ct−1 (4) ot =σ(W(o)[xt;yt−1]) (5) yt =ot⊙tanh(ct) (6) ^dt =tanh(W(d)yt). (7)

Here, denotes the element-wise multiplication, is the state vector, is the input vector, and is the output vector, and is our final estimation. Furthermore, , and

are the input, forget and output gates respectively. The sigmoid function

and the hyperbolic tangent function applies point wise to the vector elements. The weight matrices are and . We note that although we do not explicitly write the bias terms, they can be included in (1)-(7) by augmenting the input vector with a constant dimension.

## 3 Independent Extended Kalman Filter

In this section, we derive the IEKF update rules for LSTM-based online learning. To convert the LSTM training into a state estimation problem, we model the desired signal as an autoregressive process that is realized by the LSTM network in (1)-(7), which we describe with the following dynamical system:444For notational convenience, we group all the LSTM parameters, i.e., , , , and , into a vector , where . We also use to denote the input sequence up to time , i.e., .

 θt =θt−1 (8) dt =ht({xt};θt). (9)

Here, we represent the optimal LSTM weights that realize the incoming data stream with a vector , which is modeled as a stationary process. As detailed in Fig. 1, we use to represent the unfolded version of our network model in (1)-(7) over all the time steps up to the current time step , where all forward passes are parametrized by . Here, the dependence of on is due to the increased length of the recursion at each time step.

In the IEKF framework, we apply different (and independent) EKF learning algorithms to the state-space model in (8)-(9) to learn the weights in different nodes. To describe the learning rule, let us denote the LSTM nodes with the first different integers, i.e., , and use to index the nodes, i.e., . Let us also use to denote our estimation for the optimal weights in node at time step . Then, we perform the weight updates in IEKF with the following:

 for i=1,⋯,(4ns+nd) ^θi,t+1=^θi,t+Ki,t(dt−^dt) (10) Pi,t+1=(I−Ki,tHi,t)Pi,t+Qt (11) Hi,t=∂ht({xt};θ)∂θi∣∣θi=^θi,t (12) Ki,t=Pi,tHTi,t(Hi,tPi,tHTi,t+Ri,t)−1. (13)

Here, is our prediction for , which is calculated by performing the forward LSTM propagation in (1)-(7) with the estimated weights, i.e., for all . is the state covariance matrix, is the Jacobian matrix, and is the Kalman gain matrix corresponding to the LSTM weights in node . The noise covariance matrices and are artifically introduced to the algorithm to enhance the training performance [6]. In order to efficiently implement the algorithm, we use diagonal matrices for the artificial noise terms, i.e., and , where for all .

We recall that the EKF algorithm has quadratic complexity in the number of the network weights [6], i.e., . On the other hand, the computational complexity of the IEKF learning algorithm is due to the update rules in (11) and (13). Since in practical LSTM models, the dimension of the state vectors, i.e., , is usually selected between and , the reduction in the computational requirement with IEKF leads to considerable run-time savings in LSTM-based online learning with respect to the EKF algorithm. However, as noted earlier, IEKF is vulnerable to the divergence problems since it assumes each node as an independent system [9]. Therefore, to provide an efficient and robust learning algorithm, we use the IEKF framework and introduce an IEKF-based LSTM training algorithm with a convergence guarantee in the following section.

## 4 Algorithm Development

In this section, we introduce an IEKF-based training algorithm with a theoretical convergence guarantee. For the analysis in the following, we write the error dynamics of the independent EKF structures. To this end, we first write the Taylor series expansion of around :

 ht({xt};θt)=ht({xt};^θt)+Ht(θt−^θt)+χt, (14)

where is the Jacobian matrix of evaluated at , and is the non-linear term in the expansion. Note that and . For notational simplicity, we introduce two shorthand notations: , and . Then, the error dynamics of the EKF learning algorithm applied to node can be written as:

 et =(4ns+nd)∑i=1Hi,tζi,t+χt (15) =Hi,tζi,t+(4ns+nd)∑j=1j≠iHj,tζj,t+χt (16) =Hi,tζi,t+χi,t, (17)

where we consider the effect of partitioning the weights as additional non-linearity for node .

Since our network model only consists of smooth functions, is also smooth, which means the norm of is bounded by a scalar value for all the nodes throughout the training, i.e., . In the following subsection, we use to guarantee convergence by our algorithm.

### 4.1 Main Algorithm

In this subsection, we present the main result of this paper, i.e., Algorithm 1. In Algorithm 1, we take as the input. We initialize the state covariance matrix of each independent EKF as , where . In each time step, we first generate a prediction , then receive the desired data , and suffer the loss . We perform the parameter updates only if the loss is bigger than , i.e., . If so, we calculate the Jacobian matrix , measurement noise level , and the Kalman gain matrix for each in lines 7-9 of Algorithm 1. We update the weights and state covariance matrix of the weights belonging to node in lines 10 and 11.

In the following theorem, we state the theoretical guarantees of Algorithm 1.

###### Theorem 1.

If stays bounded during training, Algorithm 1 guarantees the following statements:

1. The LSTM weights stay bounded during training.

2. The loss sequence converges to the interval in the deterministic sense.

###### Proof.

See the Appendix. ∎

###### Remark 1.

Due to the Kalman gain matrix formulation (line 9 in Algorithm 1), is always smaller than or equal to for each node , i.e., for all . Since , and the artificial process noise level is a user-dependent parameter, the condition in Theorem 1 can be satisfied by the user by selecting sufficiently small .

## 5 Simulations

In this section, we illustrate the performance of our algorithm on two real-life data sets: elevators [16], and the Alcoa price stock dataset [17]. To demonstrate the performance improvements of our algorithm, we compare it with two widely used LSTM training methods, i.e., the EKF learning algorithm [6], and the stochastic gradient descent algorithm [3]. In the following, we use Alg1 to denote Algorithm 1, EKF for the EKF learning algorithm, and SGD for the stochastic gradient descent algorithm. We run each experiment times and provide the mean performances.

We first consider the elevators data set, which is obtained from the procedure that is related to controlling an F16 aircraft [16]. Here, our aim is to predict the scalar variable that expresses the actions of the aircraft, i.e., . For this data set, we use -dimensional input vectors of the dataset with an additional bias dimension, i.e., , where . To get small loss values with relatively lower run-time, we use -dimensional state vectors in LSTM, i.e., . For Alg1, we choose as , the artificial process noise level as for all , and the initial state covariance matrices as for all . In EKF, we use the measurement noise level for all , linearly decreasing process noise level sequence annealed from to , and set the initial state covariance matrix as . For SGD, we choose the learning rate as . In Table 1, we provide the mean squared error and run-time of the compared algorithms for this setup. Here, we observe that the mean squared errors of Alg1 and EKF are very close to each other and significantly smaller than the mean squared error of SGD. However, since we use the IEKF framework to develop our algorithm, Alg1 achieves the resulting error performance with times smaller run-time compared to the EKF learning algorithm.

In our second simulation, we use the Alcoa stock price dataset [17], which contains the daily stock price values of Alcoa Inc. between the years -. Our goal is to predict the opening, closing, highest, lowest and adjacent lowest values of the next day’s stock price by using the observed prices, i.e., . As the input vector, we use the opening, closing, highest, lowest and adjacent lowest stock price values of the current day with an additional bias dimension, where , and . We set the dimension of the state vectors , i.e., . For Alg1, we choose as , and set the initial state covariance matrices as for all . In EKF, we chose the measurement noise level as for all , and set the initial state covariance matrix as . In both Alg1 and EKF, we choose the process noise level as linearly decreasing from to . For SGD, we choose the learning rate as . In Table 1, we provide the mean squared error and run-time of the compared algorithms for this experiment. Similar to the previous experiments, here, we observe that the resulting errors of Alg1 and EKF are very close to each other and lower than the error of SGD. We also observe that Alg1 provides this performance in times shorter training time than EKF, and in a comparable run-time with SGD. We add that Alg1 did not have any divergence problem in both experiments (in a total of simulations), which is parallel with our theoretical results in the previous section.

## 6 Concluding Remarks

We studied online nonlinear regression with long short term memory (LSTM) based networks. For this problem, we introduced a highly efficient and robust IEKF-based training algorithm with a theoretical convergence guarantee. In the simulations, we demonstrate that our algorithm achieves significant performance improvements with respect to the conventional LSTM training methods [3, 6]. We particularly show that our algorithm provides superior error performance compared to SGD and very similar error performance (with times smaller run-time) compared to EKF.

## References

• [1] D. F. Specht, “A general regression neural network,” IEEE Transactions on Neural Networks, vol. 2, no. 6, pp. 568–576, Nov 1991.
• [2] J. Y. Goulermas, P. Liatsis, X. Zeng, and P. Cook, “Density-driven generalized regression neural networks (dd-grnn) for function approximation,” IEEE Transactions on Neural Networks, vol. 18, no. 6, pp. 1683–1696, Nov 2007.
• [3] Ronald J. Williams and David Zipser, chapter Gradient-based Learning Algorithms for Recurrent Networks and Their Computational Complexity, pp. 433–486. L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 1995.
• [4] Yoshua Bengio, Patrice Y. Simard, and Paolo Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5 2, pp. 157–66, 1994.
• [5] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, pp. 1735–1780, 1997.
• [6] Simon Haykin, “Kalman filtering and neural networks,” 2001.
• [7] G. V. Puskorius and L. A. Feldkamp, “Neurocontrol of nonlinear dynamical systems with kalman filter trained recurrent networks,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 279–297, March 1994.
• [8] Samir Shah, Francesco Palmieri, and Michael Datum, “Optimal filtering algorithms for fast learning in feedforward neural networks,” Neural Networks, vol. 5, no. 5, pp. 779 – 787, 1992.
• [9] G. V. Puskorius and L. A. Feldkamp, “Decoupled extended kalman filter training of feedforward layered networks,” in IJCNN-91-Seattle International Joint Conference on Neural Networks, July 1991, vol. i, pp. 771–777 vol.1.
• [10] S. Shah and F. Palmieri, “Meka-a fast, local algorithm for training feedforward neural networks,” in 1990 IJCNN International Joint Conference on Neural Networks, June 1990, pp. 41–46 vol.3.
• [11] Jose De jesus Rubio Avila and Wen Yu Liu, “Nonlinear system identification with recurrent neural networks and dead-zone kalman filter algorithm,” Neurocomputing, vol. 70, no. 13-15, pp. 2460–2466, 8 2007.
• [12] X. Wang and Y. Huang, “Convergence study in extended kalman filter-based training of recurrent neural networks,” IEEE Transactions on Neural Networks, vol. 22, no. 4, pp. 588–600, April 2011.
• [13] T. Ergen and S. S. Kozat, “Efficient online learning algorithms based on lstm neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3772–3783, Aug 2018.
• [14] Juan Antonio Pérez-Ortiz, Felix Gers, Douglas Eck, and Jürgen Schmidhuber, “Kalman filters improve lstm network performance in problems unsolvable by traditional recurrent nets,” Neural networks : the official journal of the International Neural Network Society, vol. 16, pp. 241–50, 04 2003.
• [15] G. V. Puskorius and L. A. Feldkamp, “Extensions and enhancements of decoupled extended kalman filter training,” in Proceedings of International Conference on Neural Networks (ICNN’97), June 1997, vol. 3, pp. 1879–1883 vol.3.
• [16] Jesús Alcalá-Fdez, Alberto Fernández, Julián Luengo, Joaquín Derrac, and Salvador García, “Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework.,” Multiple-Valued Logic and Soft Computing, vol. 17, no. 2-3, pp. 255–287, 2011.
• [17] Alcoa Inc, “Common stock,” http://finance.yahoo.com/quote/AA?ltr=1, Accessed: 2019-07-21.

## 7 Appendix

Before the proof, we present several propositions that will be used to prove the theoretical guarantees of Algorithm 1.

###### Lemma 1.

Algorithm 1 guarantees the following statements:

1. For each node , the difference between the locally optimal weights and LSTM weights is governed with the following dynamical equation:

 ζi,t+1=(I−Ki,tHi,t)ζi,t−Ki,tχi,t, (18)

which can also be written as

 ζi,t+1−ζi,t=−Ki,tHi,tζi,t−Ki,tχi,t. (19)
2. For each node , and exist and they are always positive definite as such

 (Pi,t+1−qtI)−1 =P−1i,t(I−Ki,tHi,t)−1 (20) =P−1i,t+r−1i,tHTi,tHi,t≥0. (21)
3. As a result of the previous two statements,

 (Pi,t+1−qtI)−1ζi,t+1=P−1i,tζi,t−(Pi,t+1−qtI)−1Ki,tχi,t (22)

holds for each node .

###### Proof of Lemma 1.
1. By multiplying both sides of (10) with , we write:

 −^θi,t+1=−^θi,t−Ki,t(d% t−^dt). (23)

Then by using (8), we add and to both sides of (23) respectively:

 θi,t+1−^θi,t+1=(θi,t−^θi,t)−Ki,t(dt−^dt). (24)

By using the Taylor series expansion in (17) and using the notation , we write

 ζi,t+1=ζi,t−Ki,tHi,tζi,t−Ki,tχi,t. (25)

The statements in (18) and (19) follow (25).

2. By (11), for all ,

 Pi,t+1 −qtI=(I−Ki,tHi,t)Pi,t (26) =(I−Pi,tHTi,t(H% i,tPi,tHTi,t+ri,tI)−1Hi,t)Pi,t (27) =Pi,t−Pi,tHTi,t(Hi,tPi,tHTi,t+ri,tI)−1Hi,tPi,t (28)

where we use the formulation of in (13) to write (26) from (27). By applying the matrix inversion lemma555Matrix inversion lemma: . to (28), we write

 (Pi,t+1−qtI)−1=P−1i,t+r−1i,tHTi,tHi,t. (29)

By noting that , and using (29) as the induction hypothesis, we can show that exists and , for all . Since , has the same properties, which leads to (21). Also, (20) can be reached by taking the inverse of both sides in (26).

3. By multiplying both sides of (18) with , and using the statements in (20), (22) can be obtained.

Now, we can prove Theorem 1.

###### Proof of Thorem 1.

In the following, we use the second method of Lyapunov to prove the statements in the theorem. Let us fix an arbitrary node , and choose the Lyapunov function as

 Vi,t=ζTi,tP−1i,tζi,t. (30)

Let us say that . Since we update , and only when , for , . Therefore in the following, we only consider the time steps, where we perform the weight update, i.e, .

To begin with, we write the open formula of :

 ΔVi,t=ζTi,t+1P−1i,t+1ζi,t+1−ζTi,tP−1tζi,t (31) By using the 2nd statement in Lemma ??? ≤ζTi,t+1(Pi,t+1−qtI)−1ζi,t+1−ζTi,tP−1i,tζi,t (32) By using (???) =ζTi,t+1P−1i,tζi,t−ζTi,t+1(Pi,t+1−qtI)−1Ki,tχi,t−ζTi,tP−1i,tζi,t (33) =(ζi,t+1−ζi,t)TP−1i,tζi,t−ζTi,t+1(Pi,t+1−qtI)−1Ki,tχi,t (34) By using (???) =(−Ki,tHi,tζi,t−Ki,tχi,t)TP−1i,tζi,t −ζTi,t+1(Pi,t+1−qtI)−1Ki,tχi,t (35) =−ζTi,tHTi,tKTi,tP−1i,tζi,t−χTi,tKTi,tP−1i,tζi,t −ζTi,t+1(Pi,t+1−qtI)−1Ki,tχi,t. (36)

For the sake of notational simplicity, we introduce , where . Then, we write (36) as

 ΔVi,t ≤−ζTi,tHTi,tM−1i,tHi,tζi,t−χTi,tM−1i,tHi,tζi,t −ζTi,t+1(Pi,t+1−qtI)−1Ki,tχi,t. (37)

We write the last term in (37) as

 − ζTi,t+1(Pi,t+1−qtI)−1Ki,tχi,t =−χTi,tKTi,t(Pi,t+1−qtI)−1ζi,t+1 (38) By using the 3rd statement in Lemma 1 =−χTi,tKTi,t(P−1i,tζi,t−(Pi,t+1−qtI)−1Ki,tχi,t) (39) =−χTi,tKTi,tP−1i,tζi,t+χTi,tKTi,t(Pi,t+1−qtI)−1Ki,tχi,t (40) =−χTi,tM−1i,tHTi,tζi,t+χTi,tKTi,t(Pi,t+1−qtI)−1Ki,tχi,t. (41)

By using (41) in (37), we write

 ΔVi,t≤ −ζTi,tHTi,tM−1i,tHi,tζi,t−2χTi,tM−1i,tHi,tζi,t +χTi,tKTi,t(Pi,t+1−qtI)−1Ki,tχi,t. (42)

We add to (42), and group the terms as

 ΔVi,t≤ χTi,t(KTi,t(Pi,t+1−qtI)−1Ki,t+M−1i,t)χi,t −(Hi,tζi,t+χi,t)TM−1i,t(Hi,tζi,t+χi,t). (43)

By using the second statement in Lemma (1), definition of in (17), and formulation of , we write (43) as

 ΔVi,t ≤χTi,t(M−1i,tHi,tPi,t(P−1i,t+r−1i,tHTi,tHi,t)Pi,tHTi,tM−1i,t +M−1i,t)χi,t−% eTtM−1i,tet =χTi,t(r−1i,tM−1i,tHi,tPi,tHTi,tHi,tPi,tHTi,tM−1t +M−1i,tHi,tPi,tHTi,tM−1i,t+M−1i,t)χi,t−e