1 Introduction
The study of deep neural network architectures evidenced their efficiency to approximate certain families of complex functions of the input domain
[3, 10]. This characteristic is contrasted with the difficulty of training these neworks using the standard backpropagation algorithm[5]. We present a model that aims to maintain the flexibility of the deep neural networks without producing these problems. Our model increases the expressive power of a recurrent neural network in a way that is comparable with a deep model with the simplicity of training a shallow model.
Our work aims to introduce a modification in the structure of the LSTM cell block, theoreticallydriven by the study of subsequent evaluations of repeating the evaluation of the LSTM cell, updating the hidden state while keeping the input and cell state constant.
We argue that this iterative process defines a dynamic system that accounts for the evolution of the hidden state of the network, driving it to compact regions where the information is presumably retained better. The theory behind the dynamic system studied will be explained in the next sections.
The principal motivation for the proposed modification is the theoretical evidence that the dynamic systems could retain information in the form of an analog state vector. This vector is confined to the basin of attraction corresponding to one of the possible states that are relevant to the resolution of the task
[1]. On the other hand, nonlinear dynamic systems, as the one defined by our model, are capable of defining a complex behavior over the phase plane that is useful to flatten the hidden class manifolds.To decide the number of iterations to perform we found that the usage of a simple logistic regression over the LSTM cell block variables, thresholded to select whether or not to modify the current hidden state by an additional iteration, could serve as a controller that is properly optimized to select whether or not to perform an additional iteration. This simple model avoid the need to select the number of iterations as an hyperparameter of the model.
This controller is inspired by the gating mechanisms already present in LSTM cells and its goal is to weight the importance of the iterations in the final result.
We provide an opensource implementation^{1}^{1}1https://github.com/PalmaLeandro/iterativeLSTM of the LSTM cell network modified as is proposed, along with empirical evidence of the improvements in the performance of the modified models compared to its former baseline on the task of language modeling.
2 Related work
One clear example of an architecture of NN that introduces the iterative scheme is LoopyNN presented in [2]. This recurrent model performs subsequent iterations using the last result as input for the next iteration.
While this behavior is similar to our proposal, differences can be drawn on the method and on the theory behind it.
Nevertheless, much of the characteristics of our model are noted by Caswell and his colleagues. We note that the amount of unrolls of its architecture is related to the amount of iterations performed by our model.
An iterative scheme similar to the proposed one is studied by [7]. That work covers the improvements in the performance of a fully connected NN that performs several evaluations after every new information is exposed to the network. Moreover, such work relates the iterative scheme with recurrent structures found in the human brain. The concept of readout time presented in that work is related to the number of iterations performed by our model. The relation among these concepts is supported by the similar enhancements found in model’s performance while increasing their values.
The analysis performed by [6] explores the chaotic properties of LSTM cells as dynamic systems. Even when the conditions over which its LSTM networks are evaluated differ from real applications such work motivated the study of the nonlinear components of our model in order to avoid the chaotic behavior exhibited and which would prevent the network state to converge towards stable configurations.
3 Iterative LSTM cell
We considered how to fold models that share the weights and state of several layers of LSTM cells to yield a more compact representation that executes a fixed number evaluations.
In order to optimize the model parameters using gradient based methods the state exposed by the network has to be a realvalued vector. Thus, the latching of information in the system is accomplished by the evolution of such vector towards stable configurations[1]. This evolution is governed by the dynamics of a nonautonomous system which is defined by the model’s formulation and is parameterized by the input values at every timestep[9].
For the iterative scheme proposed, the evolution of the network state at every timestep is governed by an autonomous system which would last as long as the model performs additional iterations.
The complete evolution of the network state is the aggregation of the several state changes achieved at every timestep. Such representations are easier to classify by adjacent components due to the sharp frontiers with the set of states that belongs to other attraction basins, and therefore different information.
We note that a model whose state evolution is subject to the conditions mentioned has to organize its fixed points to flatten manifolds into the proper configurations as well as connect or maintain these domains across subsequent timesteps in order to unambiguously recall information.
One important observation about dynamic systems derived from the existence and uniqueness theorem is that the trajectories cannot intersect each other and, over convergent conditions, this shall produce the contraction of the volume of states being attracted. Hence, the contracted volume of states exhibits a simpler surface. This behavior yields sparse and compact regions as more flattened manifolds to be classified by higher layers. Under the stated conditions we expect an increase in the performance of the model modified as is proposed, with respect to its original version. This is supported by the fact that the diffuse or nonlinear edges of latent manifolds jeopardize the model’s performance because of the limitations exhibited by recurrent neural models[4].
In order to achieve the desired convergent conditions we propose a modification to the LSTM structure. This modification aims to induce an autonomous dynamic system whose state, given by the hidden state of the LSTM block, converges to a confined domain closer to the attractor of the basin in which the initial conditions reside. These initial conditions are given by the value of the hidden state at previous timestep.
The model resulting from applying the proposed modification to a LSTM network could be summarized by the following ecuations calculated in the given order, for every iteration .
(1) 
As a result, the evolution of the state through the iterations performed describes a dynamic system. Its phase plane at any timestep of the input sequence is defined by the former internal cell state and the input values since the constants depend on such information.
An iteration activation gate is introduced with the idea of selecting whether to perform an additional iteration and modify the hidden state or expose it to the next layers. This gate consists of a logistic regression of the inputs, the recently calculated state and the internal gate variables , and . Its output is then compared with a threshold which is a parameter of the iteration and varies to more restrictive values to reduce the overall iterations made without constraining the cell to an arbitrary limit.
Another modification introduced in our model is the resolution of a residual mapping of the inputs[5]. This customization of the model is supported by the results found in [5] where the addition of a direct connection of the input allows to reference the result of a deep neural network inference with respect to its input.
Moreover, [7] stated the similarity of the networks that share weights across the depth dimension with residual networks.
Additionally, in [2] the residual mapping is tested experimentally and evidence is found to support its application on neural networks with shared parameters.
4 Convergence of Iterative LSTM cells
The formulation of RNNs produces a wellknown dynamic system studied by [9] and others [1]. Is in the publication of Pascanu et. al. that the effect of the nonautonomous components of the system is explored to expose the challenges that it represents to gradient based methods. Its effect has to be considered in order to produce the intended convergence of the state to the corresponding attractors.
In the proposed iterative scheme this is achieved by maintaining a constant input and cell state values through all the iterations performed by the model within a time step of the input sequence. This constraint yields an autonomous dynamic system whose dynamics are determined by the inputs and cell state values, at a particular timestep of the input sequence.
Then, considering subsequent evaluations of an LSTM network over a constant input and cell state values while varying its hidden state at discrete steps yields the folowing dynamic system
(2) 
where the function corresponds to the calculations required to update the hidden state of the vanilla LSTM network.
This system has been studied by [6], exposing its behavior for more than 200 iterations over a null value of the inputs. The results extracted from that analysis were that the sensibility of the system to variations in the initial conditions produce chaotic trajectories of states.
This implies that sightly different initial states could produce a completely different final state by projecting the behavior of the model long enough. Consequently, the predictions of the model for similar inputs may as well be different.
The following result is intended to provide sufficient conditions to bound the system’s Liapunov coefficients to a subset that leads to a coherent behavior of the model for variations in the initial conditions[12]. By meeting these conditions the difference on the predictions made for slightly different initial states shall be bounded.
Theorem 4.1
Let be a model consisting of a LSTM cell block such that
(3) 
where is the value of the input sequence at the timestep . and are vectors with the values of the internal and exposed state at the previous timestep, respectively.
The subsequent evaluations of implicitly defines the dynamic system
(4) 
where is a function analog to where and are kept constant. Under these conditions the following implication holds
(5) 
where , , and
are the principal singular values of the matrices that weights the recurrent connections of the
, , and gates of the LSTM cell block, respectively. is the Liapunov exponent of the dynamic system defined by (4) at the th dimension. is the number of cell units.Proof
See appendix.
The principal implication of the theorem 4.1 is that the evolution of the hidden state over the iterative scheme proposed is not chaotic for a particular set of the model’s parameters.
An initial configuration of network parameters that matches the conditions of theorem 4.1 could be achieved following the suggestions for initialization in [14] derived from the Geršgorin circle theorem. Moreover, as mentioned in the publication of Zilly et. al., the L1 and L2 regularization techniques could be used to enforce the conditions required for the application of the theorem 4.1.
We believe that the presented analysis could be extended to RNNs in general providing more evidence that the iterative scheme proposed enhances the capabilities of other recurrent models as well.
5 Experiments
Following the configurations in [13]
several experiments were performed training different architectures of RNNs as language models. We used the implementation of the publication of Zaremba et. al. that was released with the TensorFlow library
^{2}^{2}2Code available athttps://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb, as the base for our experiments.
The corpus used to train the model is the Penn Treebank [8]^{3}^{3}3Corpus available at http://www.fit.vutbr.cz/ imikolov/rnnlm/simpleexamples.tgz..
The base arquitecture used for the recurrent network is the corresponding with the ‘medium’ size model presented in [13]
. The medium size model consist of the embedding projection layer, 2 layers of LSTM blocks containing 650 units each and a final softmax layer. The large model architecture is the same but each layer of LSTM cells contained 1500 units and its performance is reported as in its original publication.
The embedding projection layer, the layers of the RNN and the softmax layer are connected through regularization connections that applied the Dropout[11]
method with a probability of keeping the connection of
. The parameters were optimized using minibatch gradient descent with a batch size of. These parameters were initialized by a random uniform distribution at the interval
. The gradients calculated to reduce the loss function are clipped to a norm value of
. The training regime consisted of 39 epochs where for the first 6 epochs a learning rate of
was set and then this value was reduced by a factor of 1.2 for the remaining epochs.Figure 1 presents the results of the experiments performed fixing the amount of iterations executed to incremental values.
The results of the experiments shows that there is an enhancement in the proposed model’s performance as the number of iterations performed increases.
As is visible in the Figure 1 the reduced model constituted by one layer of iterative LSTM cells consistently outperforms its augmented version as the number of iterations increase. We attribute this difference in the performance of the models to the overfitting suffered by the augmented model whose expressive capabilities were improved over the iterative regime and the inclusion of an additional layer.
Such observation supports the conjecture that the iterative scheme proposed improves the expressiveness of the model.
Note that in the case of the execution of a single iteration, where the modified network is comparable to its original form, the performance of the smaller model is worse than the augmented model and this effect is inverted as more iterations are executed. This indicates that the improvements over the models’ performance are not caused by the modifications made to its structure, namely the iteration activation gate and the residual mapping chosen, rather than by the application of the iterative scheme proposed.
Table 1 presents the performance of the trained models in terms of perplexity per word, where lower is better.
Table 1: Perplexity per word on Penn Treebank corpus. Model Size Perplexity on Test Set LSTM 16M 84.48 Iterative LSTM 16M 78.46 LSTM(2 layers) 20M 83.25 Iterative LSTM(2 layers) 20M 78.60 LSTM(2 layers) 50M 78.29
The results reported in table 1 expose that the performance achieved by the model that applies the proposed modification is comparable with larger versions with more than 3 times the total amount of parameters of our model.
6 Conclusions
In this work we studied a modification over the traditional LSTM structure that produces an iterative scheme where the inference is done incrementally.
We presented theoretical evidence to support the proposed scheme based on the study of the dynamic system defined by the iterative evaluation of the recurrent network.
The results of the experiments executed to expose the effect of the proposed modification supports the theoretical motivation that lead to the development of the presented model.
A comparison of the performance achieved by our model showed a capacity comparable to its largest original version, augmented up to 3 times in terms of the total amount of parameters.
7 Appendix
7.1 Proof of theorem 1
Proof
The execution of subsequent evaluations of the proposed model yields the dynamic system defined by (2)
Then, considering an infinitesimal small variation on the initial conditions we look for the variations on the result of the system after evaluations which would be
Where is a vector such that it’s coordinates are the Liapunov exponents of the dynamic system at the th dimension of the vector state .
is the application of the function, defined at (2), times over the state vector .
Then, requiring the Liapunov coefficients to have values that imply the convergence of the sequence of state values yields
Where is the Liapunov exponent of the dynamic system defined by (2) at the th dimension. is the number of cell units. Next, it is possible to derive
applying the chain rule to obtain
Replacing this identity on the conditions required for the convergence of the dynamic system in every dimension yields
Meanwhile, resolving the derivative of the variables and , defined at (1), with respect to a generic state vector yields
where and are diagonal matrices whose coefficients and corresponds to the evaluation of the functions and over the th coordinate of the vector , respectively. Therefore, replacing the definition of obtained from the conditions that Liapunov coefficients have to hold in order to produce the intended convergence yields the inequity
Finally, replacing the derivatives of the variables and with respect to the vector state yields
Hence, if the following condition holds for the matrices the dynamic system, produced by subsequent evaluations over the same input and cell state values while varying the hidden state, produces convergent results for similar initial conditions
where are principal singular values of the matrices respectively.
References
 [1] Bengio, Y., Simard, P., Frasconi, P.: Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks 5(2), 157–166 (1994)
 [2] Caswell, I., Shen, C., Wang, L.: Loopy neural nets: Imitating feedback loops in the human brain (2014)
 [3] Delalleau, O., Bengio, Y.: Shallow vs. deep sumproduct networks. In: Advances in Neural Information Processing Systems. pp. 666–674 (2011)
 [4] Elman, J.L.: Finding structure in time. Cognitive science 14(2), 179–211 (1990)

[5]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
 [6] Laurent, T., von Brecht, J.: A recurrent neural network without chaos. arXiv preprint arXiv:1612.06212 (2016)
 [7] Liao, Q., Poggio, T.: Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640 (2016)
 [8] Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of english: The penn treebank. Computational linguistics 19(2), 313–330 (1993)

[9]
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning. pp. 1310–1318 (2013)
 [10] Pascanu, R., Montufar, G., Bengio, Y.: On the number of response regions of deep feed forward networks with piecewise linear activations. arXiv preprint arXiv:1312.6098 (2013)
 [11] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014)
 [12] Strogatz, S.H.: Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering. Hachette UK (2014)
 [13] Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization (2014)
 [14] Zilly, J.G., Srivastava, R.K., Koutník, J., Schmidhuber, J.: Recurrent highway networks. arXiv preprint arXiv:1607.03474 (2016)
Comments
There are no comments yet.