A recurrent neural network with the many-to-one architecture can be viewed as a mapping with a set of parameters , where is the set of indexes each of which is regarded as time when some , was taken. In this formulation, the input data are the following set of inputs: , the output data are .
One way to construct the mapping is to use LSTM units. The concept was introduced in Hochreiter1997LongMemory
as a remedy to vanishing gradient problem, refined inGers2000a and in later papers (see, for instance, Graves2005FramewiseArchitectures ; Jozefowicz2015AnArchitectures ). The LSTM unit has three gates: input, output, and forget ones that are used to control the data flow through the unit. Its input and the gates have parameters that must be trained with some regularization, which often improves network performance and prevents overfitting. Despite the tremendous performance gain for many applications and abundance of techniques to regularize networks, including dropout Hinton2012 ; Srivastava2014 , weight decay ( regularization), the standard regularization approach, Recurrent Neural Networks in general – and LSTMs in particular – may suffer from overfitting. The usage the techniques for feedforward neural networks is straightforward, though their application to RNNs leads to some difficulties. First, when dropout is applied to recurrent connections, it may cause the memory loss problem that the authors of Gal2016 ; Semeniuta2016 ; Zaremba2014 tried to avoid. Second, though regularization can be used, it is not obvious how it must be applied: whether one regularization parameter should be used or several ones to regularize differently the parts of the unit. Note the latter case is computationally intense than the former one, which leads to slower training, since it is necessary to get the optimal values.
It is feasible to address these problems by derivation a regularizer for LSTM unit that is based on the Tikhonov regularization technique. In Bishop1995TrainingRegularization it was shown that adding noise to initial data is equivalent to Tikhonov regularization. Almost at the time the authors of Wu1996ANetworks showed a possibility to apply the concept to recurrent neural networks, as the regularizer can be obtained by calculating the upper bound of the squared output disturbance , where and
is the independent random noise with zero mean and variance.
In this paper, the upper bound is calculated to get a regularizer for LSTM networks in case of solving a regression task with the sum-of-squares objective, though the regularizer can be derived for any other loss.
The paper is organized as follows. Section 2 describes the network architecture and the regularizer, which is derived by assessing the upper bound of the output perturbation. Section 3 provides the theoretical justification for the form of the LSTM regularizer. Section 4 describes the learning procedure with the regularizer derived previously and the relaxed optimization problem. Section 5 concludes the paper.
2 The Output Perturbation
It is assumed that a layered network topology with the LSTM units is used. Since it is not important for further analysis which output is used, the standard dense layer with the sigmoid function is considered:
The objective is to assess the upper bound of the output perturbation , which is the result of the input perturbation . Thus, it is possible to write the following equation for output perturbation:
Obviously, the upper bound for (2) can be found by applying the mean value theorem so the result can be written as follows:
where denotes a point, which is somewhere in between and .
Considering that and for any point , it is possible to write the following equation.
The output perturbation depends on the LSTM layer perturbation and on the dense layer parameters only. Therefore, the upper bound of it must be assessed to get the regularizer for the network.
3 The LSTM unit output perturbation
The LSTM unit Gers2000a can be described by using the following equations:
where are the weight matrices and are the biases, , , , .
3.1 The upper bound of the recurrent connection perturbation
Before finding the upper bound of , it is necessary to prove the following
Suppose that is an open set in , such that contains the line segment from to and and are differentiable real-valued function on , then the upper bound of the difference of Hadamard products can be found as follows
where for some points , , and perturbations and , , .
Applying the mean value theorem to an
th component of the vector from the left side of the Equation (13), it is possible to write that
Based on this fact, the norm of the difference of Hadamard products for two functions and can be rewritten as follows:
Applying the Cauchy inequality, one can get
Applying the Cauchy inequality again to the previously obtained equation, it is possible to get the desired result. ∎
where is assessed as
and , , is the line segment from to .
Considering the equation (17), it is possible to state that in order to minimize , it is necessary to assess the following two perturbations:
3.2 The upper bound of the output gate perturbation
The upper bound of the output gate perturbation can be assessed by using the following
It holds that
where , , is the line segment from to .
Considering the equation (7) and the following fact
one can write the following dynamic function for some Wu1996ANetworks :
then its derivative can be estimated as follows
Therefore, considering that and , it is possible to get the following equation for its derivative:
The upper bound of (24) can be found in the following three steps. First, consider the equation (5), the proposition 1, and the Hölder’s Inequality 222Since , then , then the upper bound of can be obtained as
Second, the upper bound of is
Thus, after some simplifications the upper bound of (24) can be written as
where , , and for some .
Let us assume that the input perturbation and the memory perturbation are either constants or change more slowly than . Thus, the equation (30) can be rewritten as follows
Substituting the previously defined constants, we can end up with the upper bound of :
which proves the proposition. ∎
It should be noted that for the purpose of computational stability it is assumed that .
3.3 The upper bound of the memory perturbation
In order to minimize the output gate perturbation, it is necessary to take into account the memory perturbation (20). This perturbation is the result of applying the following functions to the input data of the unit: the forget gate function (), the input gate function (), and the input of the unit function (). Thus, it is possible to write the following equation for the memory perturbation:
Therefore, the memory perturbation can be assessed by finding the upper bound of the following norm
Considering the equation (37), it is possible to write the following
It holds that
where , .
Like in the proposition 2, it is possible to write the following equation for the derivative for :
First, it is necessary to rewrite the second part of the equation by using the following one:
where , , .
Due to the fact that , it is possible to apply Lemma 2 from Pachpatte1996 to get the following equation:
Applying the mean value theorem, we can find an upper bound of the first part of the equation as follows:
where is the recurrent connection perturbation, , denotes a point, which is between and , ; therefore .
Applying Proposition 1, the upper bound of the second part squared can be assessed as
where , , ; therefore, .
By applying a previously used inequality (), it is possible to write that
Therefore, the upper bound of (39) is
Applying Lemma 2 from Pachpatte1996 , the upper bound of can be calculated as follows
Assuming that and are either constants or change more slowly than , one can get the following result for the interval :
which proves the proposition ∎
3.4 The upper bound of the output perturbation
where and are independent of the time variable and can be calculated based on the parameters of the model only:
After some simplifications, it is possible to conclude that
4 The Learning Procedure
Considering (54) and the upper bound of the output perturbation:
the regularizer can be written as follows
where is a constant that measures the degree of the input perturbation.
It is possible to relax this problem to get the following objective function:
where and , , and are the parameters that must be assessed during the training procedure based on the model evaluation criterion.
The complex regularizer that is the right part of (57) has three parameters: , which is the main parameter of the regularization, and , , which are used to maintain computation stability during the training.
In this paper, the Tikhonov regularizer is derived for the LSTM unit by finding the upper bound of the output perturbation, which is the difference between the actual output of the network and the one that is observed if the noise is added to the inputs of the network. The regularizer has three parameters: the first one measures the degree of input perturbation, thus it controls the regularization process, the other two are used to maintain computation stability of the regularization. The regularizer can be used to approach the overfitting problem in LSTM networks by taking into account not only the weights of the gates independently, but also the interaction between them as parts of the LSTM complex structure. The mathematical justification of the proposed regularization derivation is provided, which enables to get regularizers for different architectures.
-  Chris M. Bishop. Training with Noise is Equivalent to Tikhonov Regularization. Neural Computation, 7(1):108–116, jan 1995.
-  Yarin Gal and Zoubin Ghahramani. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. 2016.
-  Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451–2471, oct 2000.
-  Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6):602–610, jul 2005.
-  Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. jul 2012.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, nov 1997.
-  R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent network architectures. 2015.
-  B G Pachpatte. Comparision Theorems Related to Certain Inequality used in the Theory of Differential Equations. Soochow Journal Of Mathematics, 22(3):383–394, 1996.
-  B. G. Pachpatte. Inequalities for differential and integral equations. Academic Press, 1998.
-  Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent Dropout without Memory Loss. 2016.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Dropout: A Simple Way to Prevent Neural Networks from Overfitting.
Journal of Machine Learning Research, 15:1929–1958, 2014.
-  Lizhong Wu and John Moody. A Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation, 8(3):461–489, apr 1996.
-  Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent Neural Network Regularization. 27(3):100, 2014.