Tikhonov Regularization for Long Short-Term Memory Networks

08/09/2017 ∙ by Andrei Turkin, et al. ∙ National Research University of Electronic Technology (MIET) 0

It is a well-known fact that adding noise to the input data often improves network performance. While the dropout technique may be a cause of memory loss, when it is applied to recurrent connections, Tikhonov regularization, which can be regarded as the training with additive noise, avoids this issue naturally, though it implies regularizer derivation for different architectures. In case of feedforward neural networks this is straightforward, while for networks with recurrent connections and complicated layers it leads to some difficulties. In this paper, a Tikhonov regularizer is derived for Long-Short Term Memory (LSTM) networks. Although it is independent of time for simplicity, it considers interaction between weights of the LSTM unit, which in theory makes it possible to regularize the unit with complicated dependences by using only one parameter that measures the input data perturbation. The regularizer that is proposed in this paper has three parameters: one to control the regularization process, and other two to maintain computation stability while the network is being trained. The theory developed in this paper can be applied to get such regularizers for different recurrent neural networks with Hadamard products and Lipschitz continuous functions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A recurrent neural network with the many-to-one architecture can be viewed as a mapping with a set of parameters , where is the set of indexes each of which is regarded as time when some , was taken. In this formulation, the input data are the following set of inputs: , the output data are .

One way to construct the mapping is to use LSTM units. The concept was introduced in Hochreiter1997LongMemory

as a remedy to vanishing gradient problem, refined in

Gers2000a and in later papers (see, for instance, Graves2005FramewiseArchitectures ; Jozefowicz2015AnArchitectures ). The LSTM unit has three gates: input, output, and forget ones that are used to control the data flow through the unit. Its input and the gates have parameters that must be trained with some regularization, which often improves network performance and prevents overfitting. Despite the tremendous performance gain for many applications and abundance of techniques to regularize networks, including dropout Hinton2012 ; Srivastava2014 , weight decay ( regularization), the standard regularization approach, Recurrent Neural Networks in general – and LSTMs in particular – may suffer from overfitting. The usage the techniques for feedforward neural networks is straightforward, though their application to RNNs leads to some difficulties. First, when dropout is applied to recurrent connections, it may cause the memory loss problem that the authors of Gal2016 ; Semeniuta2016 ; Zaremba2014 tried to avoid. Second, though regularization can be used, it is not obvious how it must be applied: whether one regularization parameter should be used or several ones to regularize differently the parts of the unit. Note the latter case is computationally intense than the former one, which leads to slower training, since it is necessary to get the optimal values.

It is feasible to address these problems by derivation a regularizer for LSTM unit that is based on the Tikhonov regularization technique. In Bishop1995TrainingRegularization it was shown that adding noise to initial data is equivalent to Tikhonov regularization. Almost at the time the authors of Wu1996ANetworks showed a possibility to apply the concept to recurrent neural networks, as the regularizer can be obtained by calculating the upper bound of the squared output disturbance , where and

is the independent random noise with zero mean and variance


In this paper, the upper bound is calculated to get a regularizer for LSTM networks in case of solving a regression task with the sum-of-squares objective, though the regularizer can be derived for any other loss.

The paper is organized as follows. Section 2 describes the network architecture and the regularizer, which is derived by assessing the upper bound of the output perturbation. Section 3 provides the theoretical justification for the form of the LSTM regularizer. Section 4 describes the learning procedure with the regularizer derived previously and the relaxed optimization problem. Section 5 concludes the paper.

2 The Output Perturbation

It is assumed that a layered network topology with the LSTM units is used. Since it is not important for further analysis which output is used, the standard dense layer with the sigmoid function is considered:


The objective is to assess the upper bound of the output perturbation , which is the result of the input perturbation . Thus, it is possible to write the following equation for output perturbation:


Obviously, the upper bound for (2) can be found by applying the mean value theorem so the result can be written as follows:


where denotes a point, which is somewhere in between and .

Considering that and for any point , it is possible to write the following equation.


The output perturbation depends on the LSTM layer perturbation and on the dense layer parameters only. Therefore, the upper bound of it must be assessed to get the regularizer for the network.

3 The LSTM unit output perturbation

The LSTM unit Gers2000a can be described by using the following equations:


where are the weight matrices and are the biases, , , , .

Considering the equations (1) and (5)-(11), it can be concluded that the objective model has the following set of parameters:


3.1 The upper bound of the recurrent connection perturbation

Before finding the upper bound of , it is necessary to prove the following

Proposition 1.

Suppose that is an open set in , such that contains the line segment from to and and are differentiable real-valued function on , then the upper bound of the difference of Hadamard products can be found as follows


where for some points , , and perturbations and , , .


Applying the mean value theorem to an

th component of the vector from the left side of the Equation (

13), it is possible to write that

Based on this fact, the norm of the difference of Hadamard products for two functions and can be rewritten as follows:




Applying the Cauchy inequality, one can get


Applying the Cauchy inequality again to the previously obtained equation, it is possible to get the desired result. ∎

Considering the equations (5), (6), (7), and the Proposition (1), it is possible to write the equation for as follows


where is assessed as


and , , is the line segment from to .

Considering the equation (17), it is possible to state that in order to minimize , it is necessary to assess the following two perturbations:




3.2 The upper bound of the output gate perturbation

The upper bound of the output gate perturbation can be assessed by using the following

Proposition 2.

It holds that

where , , is the line segment from to .


Let and be the following differences: , . Then applying the equation (7) to (19), we can get the following result:


Considering the equation (7) and the following fact


one can write the following dynamic function for some Wu1996ANetworks :


where .


then its derivative can be estimated as follows


Therefore, considering that and , it is possible to get the following equation for its derivative:


where .

Thus, applying the equation (25) to (24), the latter one can be rewritten as


where , , .

The upper bound of (24) can be found in the following three steps. First, consider the equation (5), the proposition 1, and the Hölder’s Inequality 222Since , then , then the upper bound of can be obtained as


Second, the upper bound of is


Thus, after some simplifications the upper bound of (24) can be written as


where , , and for some .



Let us assume that the input perturbation and the memory perturbation are either constants or change more slowly than . Thus, the equation (30) can be rewritten as follows



Applying the Grönwall inequality (see, for instance, (Pachpatte1998, )) to the equation (31) for , one can get the upper bound of :


Substituting the previously defined constants, we can end up with the upper bound of :


which proves the proposition. ∎

It should be noted that for the purpose of computational stability it is assumed that .

3.3 The upper bound of the memory perturbation

In order to minimize the output gate perturbation, it is necessary to take into account the memory perturbation (20). This perturbation is the result of applying the following functions to the input data of the unit: the forget gate function (), the input gate function (), and the input of the unit function (). Thus, it is possible to write the following equation for the memory perturbation:


Therefore, the memory perturbation can be assessed by finding the upper bound of the following norm


where .

Considering the equation (37), it is possible to write the following

Proposition 3.

It holds that


where , .


Like in the proposition 2, it is possible to write the following equation for the derivative for :


First, it is necessary to rewrite the second part of the equation by using the following one:


Therefore, the second part of (39) can be rewritten by using the equations (40) and (41) as follows:


where , , .

Considering the equation (42), the equation (39) can be rewritten as follows:


Due to the fact that , it is possible to apply Lemma 2 from Pachpatte1996 to get the following equation:


Applying the mean value theorem, we can find an upper bound of the first part of the equation as follows:


where is the recurrent connection perturbation, , denotes a point, which is between and , ; therefore .

Applying Proposition 1, the upper bound of the second part squared can be assessed as


where , , ; therefore, .

By applying a previously used inequality (), it is possible to write that


Therefore, the upper bound of (39) is


where ,

Applying Lemma 2 from Pachpatte1996 , the upper bound of can be calculated as follows


where .

Assuming that and are either constants or change more slowly than , one can get the following result for the interval :


which proves the proposition ∎

3.4 The upper bound of the output perturbation

Based on Proposition 3, it is possible to rewrite equation (17) as follows:


where and are independent of the time variable and can be calculated based on the parameters of the model only:


After some simplifications, it is possible to conclude that


4 The Learning Procedure

Considering (54) and the upper bound of the output perturbation:


the regularizer can be written as follows


where is a constant that measures the degree of the input perturbation.

It is possible to relax this problem to get the following objective function:


where and , , and are the parameters that must be assessed during the training procedure based on the model evaluation criterion.

The complex regularizer that is the right part of (57) has three parameters: , which is the main parameter of the regularization, and , , which are used to maintain computation stability during the training.

5 Conclusion

In this paper, the Tikhonov regularizer is derived for the LSTM unit by finding the upper bound of the output perturbation, which is the difference between the actual output of the network and the one that is observed if the noise is added to the inputs of the network. The regularizer has three parameters: the first one measures the degree of input perturbation, thus it controls the regularization process, the other two are used to maintain computation stability of the regularization. The regularizer can be used to approach the overfitting problem in LSTM networks by taking into account not only the weights of the gates independently, but also the interaction between them as parts of the LSTM complex structure. The mathematical justification of the proposed regularization derivation is provided, which enables to get regularizers for different architectures.


  • [1] Chris M. Bishop. Training with Noise is Equivalent to Tikhonov Regularization. Neural Computation, 7(1):108–116, jan 1995.
  • [2] Yarin Gal and Zoubin Ghahramani. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. 2016.
  • [3] Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451–2471, oct 2000.
  • [4] Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6):602–610, jul 2005.
  • [5] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. jul 2012.
  • [6] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, nov 1997.
  • [7] R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent network architectures. 2015.
  • [8] B G Pachpatte. Comparision Theorems Related to Certain Inequality used in the Theory of Differential Equations. Soochow Journal Of Mathematics, 22(3):383–394, 1996.
  • [9] B. G. Pachpatte. Inequalities for differential and integral equations. Academic Press, 1998.
  • [10] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent Dropout without Memory Loss. 2016.
  • [11] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

    Journal of Machine Learning Research

    , 15:1929–1958, 2014.
  • [12] Lizhong Wu and John Moody. A Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation, 8(3):461–489, apr 1996.
  • [13] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent Neural Network Regularization. 27(3):100, 2014.