1 Introduction
Recurrent Neural Networks (RNNs) (Elman1990FindingSI; Rumelhart1986LearningRB)
are a family of architectures that process sequential data by means of internal hidden states. The set of parameters of the network is shared across time steps, allowing the RNN to process inputs of variable length. As RNNs suffer from the socalled exploding and vanishing gradient problem (EVGP)
(bengio1993problem; hochreiter1991untersuchungen), which hinders the learning of longterm dependencies (bengio1994learning; pascanu2013difficulty), previous works have proposed to enrich the recurrent cell with gating mechanisms (hochreiter1997long; jing2019gated). For instance, Long ShortTerm Memory networks (LSTMs) (hochreiter1997long) use gates to control the information flow towards and from the memory cell and to regulate the forgetting process (gers2000forget). LSTMs are adopted in a wide number of tasks, such as neural machine translation
(bahdanau2015neural; sutskever2014sequence), speech recognition (graves2013speech), and also visionandlanguage applications like image and video captioning (vinyals2015show; xu2015show; baraldi2017hierarchical).In this paper, we propose a novel celltogate connection that modifies the classic LSTM block. Our formulation is general and improves LSTM overall performance and training stability without any particular assumption on the underlying task. In the vanilla LSTM formulation, the gates are controlled by the current input of the block and its previous output, which acts as the hidden state for the network. The longterm memory cell, instead, is employed to store information during the forward pass and provides a safe path for backpropagating the error signal. We argue that the content stored in the memory cell could be useful to regulate the gating mechanisms, too. The key element of our design is a connection between the memory cell and the gates with a protection mechanism that prevents the cell state from being exposed directly. We draw inspiration from the gated read operation employed to reveal the cell content at the block output, and enrich it with a learnable projection. In this way, the LSTM block can use the knowledge in the cell (acting as a longterm memory) to control the evolution of the whole network in the shortterm.
A similar concept in cognitive psychology and neuroscience is the socalled working memory (ericsson1995long), a type of memory employed, for instance, to retain the partial results while solving an arithmetic problem without paper, or to combine the premises in a lengthy rhetorical argument (hernandez2018neuroethics). Although definitions are not unanimous, working memory is said to be a cognitive system acting as a third type of memory between longterm and shortterm memory. Our connections share this characteristic with working memory. For this reason, we call them Working Memory Connections (WMCs).
A first attempt to fuse the information of the cell in the gates was made with the design of peepholes (gers2000recurrent): direct multiplicative connections between the memory cell and the gates. This approach has not been largely adopted in literature, as recent studies report mixed results (Greff2017LSTMAS) and discourage their use. Since our idea recalls the rationale of peephole connections, we provide a large comparison with this previous work. By doing so, we point out the major issues in the peephole formulation that hinder effective learning and attest that WMCs do not suffer from the same problems. In our experiments, we show that an LSTM equipped with Working Memory Connections achieves better results than comparable architectures, thus reflecting the theoretical advantages of their design. In particular, WMCs surpass vanilla LSTM and peephole LSTM in terms of final performances, stability during training, and convergence time. All these aspects testify the advantage in letting the cell state participate in the gating dynamics. In order to support our conclusions, we conduct a thorough experimental analysis covering a wide area of current research topics.
To sum up, our contribution is mainly threefold. First, we present a modification of LSTM in which traditional gates are enriched with Working Memory Connections, linking the memory cell with the gates through a protection mechanism. Then, we demonstrate that exposing the LSTM internal state directly and without a proper protection yields unstable training dynamics that compromise the final performance.
Finally, we show the effectiveness of the proposed solution in a variety of tasks, ranging from toy problems with very longterm dependencies (adding problem, copy task, and sequential MNIST) to language modeling and image captioning.
2 Related Work
Long ShortTerm Memory networks (hochreiter1997long) aim to mitigate the exploding and vanishing gradient problem (hochreiter1991untersuchungen; bengio1994learning) with the use of gating mechanisms. Since its introduction, LSTM has gained a lot of attention for its flexibility and efficacy in many different tasks. To simplify the LSTM structure, liu2020simplified propose to exploit the content of the longterm memory cell in a recurrent block with only two gates. However, this model neglects the importance of the LSTM output. While this might be useful for simple tasks, it is unlikely to generalize to more complex settings. arpit2018h
propose to modify the path of the gradients in order to stabilize training with a stochastic algorithm specific to LSTM optimization. This direction of work is not in contrast with our goal, and could possibly be integrated with our proposal since our connection does not require a specific setup to be optimized. Among the LSTM variants, the Gated Recurrent Unit (GRU)
(cho2014LearningPR; cho2014properties) is the most popular and common architecture (chung2014empirical), and features a coupling mechanism between input and forget gates (Greff2017LSTMAS). A recent line of research aims to tailor the LSTM structure for specific tasks. For instance, baraldi2017hierarchical propose a hierarchical model for video captioning, while other works incorporate convolutional models into the LSTM structure (XIAO2020173; LI201841). While these works propose a modification of the LSTM towards a specific goal, we propose a general and powerful idea that adapts to a large set of different tasks.Recently, models based on selfattention, such as Transformer architecture (vaswani2017attention) and its variants, are achieving stateofart performances on many different tasks, and also for sequence modeling. For instance, language representations based on BERT (devlin2018bert) can be finetuned with an additional output layer to obtain stateofart results on many languagebased tasks. However, RNNs require much fewer parameters and operations to run than Transformerbased architectures and are still widely adopted. Moreover, LSTMs still have a large market in embedded systems and edge devices for their low computational and memory requirements.
3 Proposed Method
In this section, we present a complete overview of Working Memory Connections. First, we recall the LSTM equations. Second, we explain the modifications introduced in our design. Finally, we motivate the choices behind WMCs w.r.t. other approaches. Specifically, we identify key problems in previous celltogate connections that hinder the learning process, and we show that the proposed solution does not suffer from these weaknesses.
3.1 Lstm
The core idea behind Long ShortTerm Memory networks is to create a constant error path between subsequent time steps. Being
the input vector at time
we can write the rollout equations for a vanilla LSTM as:(1)  
(2)  
(3)  
(4)  
(5)  
(6) 
Here, is the block input, , , and are respectively the input, forget, and output gates, represents the memory cell value, and is the block output. In this notation,
is the sigmoid function and
denotes elementwise Hadamard product. In its first formulation (hochreiter1997long), LSTM did not include the multiplicative forget gate. However, being able to forget about past inputs (gers2000forget) allows LSTM to tackle longer sequences while not hindering the backpropagation of the error signal.3.2 Working Memory Connections
In the following, we introduce Working Memory Connections, which enable the memory cell to influence the value of the gates through a set of recurrent weights. Given a proper design for the connection, we argue that there is a practical advantage in letting the cell state influence the gating mechanisms in the LSTM block directly. In fact, the cell state provides unique information about the previous time steps that are not present in . For instance, may be close to zero as a consequence of the output gate saturating towards zero (see Eq. 6), while may be growing and changing as a result of a sequence of input vectors. In that case, since the cell state cannot control the output gate, the LSTM block is forced to learn which particular value in the input vector is the marker that signals to open the output gate. Instead, with an appropriate connection strategy, the LSTM block could learn a mapping between the cell internal state and the gate values.
Our solution employs a set of recurrent weights ,
and a nonlinear activation function to model a connection between memory cell and gates. The application of a nonlinearity on the memory cell is coherent with the present LSTM structure: as it can be noticed from Eq.
6, a nonlinear activation function is applied to before the Hadamard product with ^{1}^{1}1Previous works (Greff2017LSTMAS) have also shown that removing this nonlinearity leads to a significant loss in terms of performance.. In light of the abovementioned intuitions, we modify Eq. 2, 3, and 5 by exposing the cell state at time through a protection mechanism as follows:(7)  
(8)  
(9) 
where
denotes a general linear transformation.
At a first glance, Working Memory Connections may seem redundant in the gate structure. In fact, depends from the value of (Eq. 6). This impression is misleading, as the proposed connections introduce two main aspects of novelty. First, the nonlinear activation function operates on three different projections of the cell state, one for each gate type. Second, Eq. 9 shows that the connection on the output gate depends on , rather than on , hence allowing for a more responsive control of the output dynamics of the entire LSTM block.
3.3 Advantages of Working Memory Connections
To formally motivate the improvement given by Working Memory Connections, we start by considering the local gradients of the gates in which the cell interaction is added. We limit our formal analysis to the input gate , but our reasoning can be generalized to and . If we denote by the argument of the sigmoid activation function (Eq. 7) at time :
(10) 
then the local gradient of the input gate is expressed by:
(11) 
where denotes a vector of ones, and indicates a diagonal matrix whose diagonal contains the N elements of vector .
From here, we can easily derive the local gradients on the recurrent weights , , and at time :
(12)  
(13)  
(14) 
where denotes the outer product of two vectors, and:
(15) 
Now, let’s consider what happens as grows: we observe that and are bounded to a limited interval. In particular, is a sample of the input data, and is bounded in the interval by construction. Instead, the cell can grow linearly with the number of recursive steps, making its domain extremely taskdependent. This is a wellknown problem, which motivated the introduction of the forget gate in the original LSTM structure (gers2000forget). Despite this, the range of possible values of cannot be restricted to a fixed domain. The hyperbolic tangent nonlinearity helps to avoid an excessive influence of the unbounded cell state in the gate mechanics, hence preventing unwanted saturation. As it can be seen in Eq. 7, 8, and 9, the term related to the cell state is bounded in the interval . Additionally, it helps screen the connection weights from unstable updates.
Even if grew linearly with the number of time steps, its influence on the sigmoid argument would be mitigated, and it could not take the sigmoid function into its saturated regime against the other two terms driven by and respectively. On the other hand, the growth of the cell state would push the hyperbolic tangent towards its own saturated regime. This behavior helps protect the weight matrix employed in the connection from unstable updates.
Peephole Connections and their Limitations. We now turn our attention to a related connection, namely the peephole connection (gers2000recurrent), which is no longer common in the LSTM formulation. Peephole connections were introduced by Gers and Schmidhuber in (gers2000recurrent), and enrich the LSTM equations with recurrent weights , :
(16)  
(17)  
(18) 
with generally constrained to be diagonal (graves2013generating; Greff2017LSTMAS). While this formulation allows for a more precise control of the gates, there are two issues that limit its effectiveness. In this case, the local gradient at time is expressed by:
(19) 
with being the argument of the sigmoid function in Eq 16:
(20) 
In light of this difference, Eq. 14 and 15 become:
(21) 
We observe that, both in Eq. 7 and in Eq 16, the magnitude of the product can in principle grow unbounded. The activation function introduced in WMCs squashes this term into a closed bounded interval. In peephole connections, hovewer, this term is added inside the gate without an adequate protection (see Fig. 1). The result is that, in the peephole formulation, the sigmoid function applied immediately after could be pushed towards its saturating regime independently from the value of and . In theory, the LSTM block can recover from this situation by setting all the weights in the peephole connection to , but in practice this might not happen if the sigmoid gate is saturated most of the time. Even if the two other summands can compensate for the growth of , hence letting gradients flow through the gate, there is still a key issue that hinders learning. In fact, as shown in Eq. 21, the gradients on the recurrent peephole weights grow linearly with , making updates unstable.
To exemplify this behavior, we report the Euclidean norm of during the early training stages in Fig. 2. After a small number of time steps, the content of the cell floods the gates of the peephole LSTM. A possible consequence would be that both the input and the forget gates would saturate towards . In our example, this aspect leads to an additional and uncontrolled growth of the magnitude of . As it can be seen, Working Memory Connections exhibit a much more regular behavior than peepholes and can prevent the uncontrolled growth of the memory cell.
4 Experiments and Results
Adding Problem – T=200  Adding Problem – T=400 
Copying Task – T=100  Copying Task – T=200 
Sequential MNIST  Permuted MNIST 
Comparison among traditional LSTM, the proposed LSTM with working memory connections, and peephole LSTM. We investigate three different tasks: the adding problem (top), the copying task (center), and the sequential/permuted MNIST (bottom). In all the plots, shading indicates the standard error of the mean.
The effectiveness of Working Memory Connections and their general benefits can be appreciated in many different tasks. The proposed experiments cover a wide area of applications: two different toy problems, digit recognition, language modeling, and image captioning. While the analysis on simple tasks helps to clarify the inherent advantages of the proposed approach, results on more challenging realworld applications motivate a wider adoption of our novel connections, especially for long sequences. We compare our model (LSTMWC) to a traditional LSTM and to an LSTM with peephole connections (LSTMPH).
4.1 Adding Problem and Copying Tasks
In the adding problem (hochreiter1997long), the input to the network consists of a series of pairs , with . The first element is a realvalued number between 0 and 1, and is a corresponding marker. In the entire sequence, only two markers and are set to 1, while the others are set to 0. The goal is to predict the sum of the corresponding realvalued items , for which . In our experiments, we test with and , and we measure the performance using mean squared error. For this experiment, the networks have hidden size and train for epochs. We optimize the parameters using SGD with Nesterov update rule. The learning rate is (momentum factor ) and the batch size is . We also clip the gradient norm to . Results are reported in Fig. 3 (top), where we plot the MSE on the test set for every epoch of training. LSTMWM achieves the best convergence time for , while the final performance on this setup is similar among the three models. The effectiveness of WMCs is striking in the setup. In fact, the proposed model solves the adding problem around epoch , while the other two architectures cannot learn the task and are stuck on the trivial solution.
In the copying task (hochreiter1997long), the network observes a sequence of input symbols, waits for time steps (we use and ), and then must reproduce the same sequence as output. For this experiment, we adopt the same setup described in (arjovsky2016unitary). We keep the same implementation details described for the adding problem, except that we train for epochs. In Fig. 3 (center), we plot the test accuracy achieved by the three models at each epoch. In both setups, WMCs play an important role in terms of final performance and convergence time. As in the adding problem, the performance gain given by the proposed architecture is more evident when working on longer sequences: for , WMCs outperform peephole LSTM and vanilla LSTM by around and .
4.2 Permuted Sequential MNIST
The sequential MNIST (sMNIST) (le2015simple) is the sequential version of the MNIST digit recognition task (lecun1998gradient). In this task, the image pixels are fed sequentially to the network (from left to right, and top to bottom). The permuted sequential MNIST (pMNIST) is a sequential version of the MNIST digit recognition problem in which the pixels are permuted in a random but fixed order. In both tasks, the goal is to predict the correct digit label after the last input pixel. Following the setup proposed in (arpit2018h), we use k images for training, k for validation, and k to test our models. The experimental setup is as follows. We set the hidden size to for all the networks, and train for epochs using SGD with learning rate and batch size (momentum and Nesterov update rule). We clip the gradient norms to .
Fig. 3 (bottom) reports the mean test accuracy of the three LSTM variants for both setups. We report the standard error of the mean as a shaded area. For the sMNIST task, peephole LSTM performs slightly better than vanilla LSTM. LSTM with Working Memory Connections, instead, outperforms the competing architectures in terms of final accuracy and convergence speed. In particular, our architecture employs only 50 epochs to get above accuracy, while other models are still generally stuck around (vanilla LSTM) and (LSTMPH). In this experiment, we also find out that WMCs help stabilize training. In fact, the area given by the standard error of the mean is much thicker for our approach than for the other two variants, in particular during the early stages of training. On the pMNIST task, all the models achieve good final results, with LSTM with Working Memory Connections still being the best option.
Model  sMNIST  pMNIST 

iRNN (le2015simple)  97.00  82.00 
uRNN (arjovsky2016unitary)  95.10  91.40 
hdetach (arpit2018h)  98.50  92.30 
LSTM ()  98.16  92.94 
LSTM ()  97.68  93.97 
LSTMPH ()  98.58  93.25 
LSTMPH ()  98.33  93.40 
LSTMWM ()  98.63  93.97 
Numerical results, reported in Table 1, confirm that our model outperforms the classic LSTM by a discrete margin ( and on the sequential and permuted MNIST respectively). Since WMCs introduce additional learnable parameters in the LSTM structure, we also compare with vanilla and peephole LSTM with increased hidden size (256 instead of 128). Note that, in this setting, LSTM and LSTMPH have more than the number of learnable parameters of LSTMWM. Despite this, LSTMWC achieves the best results on both tasks. It is worth noting that, while additional parameters in vanilla LSTM improves the results on pMNIST, they are not helpful in the sMNIST task. The flexibility given by WMCs, instead, allows the proposed model to achieve the best result in both setups. Always in Table 1, we compare with two stateoftheart RNNs (le2015simple; arjovsky2016unitary), and with a training algorithm for LSTM (arpit2018h). The proposed LSTMWC outperforms the competitors in terms of test accuracy.
Test Bit per Character (BPC)  
Fixed # Params ()  Fixed # Hidden Units ()  
Model  
LSTM  1.334 0.0006  1.343 0.0004  1.386 0.0005  1.395 0.0005 
LSTMPH  1.339 0.0048  1.343 0.0009  1.383 0.0004  1.394 0.0005 
LSTMWM  1.299 0.0005  1.302 0.0008  1.299 0.0005  1.302 0.0008 
4.3 Penn Treebank (PTB) CharacterLevel Language Modeling
Characterlevel language modeling requires to predict a single character at each time step given an observed sequence of text. In our experiments on the Penn Treebank (PTB) dataset (marcus1993building), we evaluate the performance of the three different LSTM variants in terms of test mean bits per character (BPC), where lower BPC denotes better performance. We report the results in Table 2, where we compare truncated backpropagation through time () over 150 and 300 steps. Since our connection introduces new learnable weights, we consider an additional setup in which we keep a fixed number of parameters for the three networks. For this experiment, we follow the setup proposed by merityAnalysis, with the only exception that we employ a single LSTM layer instead of three. The advantage of using Working Memory Connections is more evident for equal number of hidden units, where the proposed architecture overcomes the vanilla LSTM and peephole LSTM by a significant margin. Even when the number of parameters is fixed for all the models, LSTMWC outperforms the competitors by and BPC for and respectively. It is worth noting that peephole LSTM performs similarly to or even worse than vanilla LSTM on this task.
4.4 Image Captioning
We evaluate the performance of our LSTM with Working Memory Connections on the image captioning task, which consists of generating textual descriptions for images. We apply our approach to two different captioning models: Show and Tell (vinyals2015show) and UpDown (anderson2018bottom)
. The first model includes a single LSTM layer and does not employ attention, while the second is composed of two LSTM layers and integrates attention mechanisms over image regions. We use the Microsoft COCO dataset
(lin2014microsoft) following the splits defined in (karpathy2015deep). To represent images, we employ a global feature vector extracted from the average pooling layer of ResNet152 (he2016deep) for the Show and Tell model, and multiple feature vectors extracted from Faster RCNN (ren2015faster) for the UpDown architecture. We train both models with Adam optimizer (kingma2015adam) using a learning rate equal to . All other hyperparameters are left the same as those suggested in the original papers.Model  BLEU1  BLEU4  METEOR  ROUGE  CIDEr  SPICE 

No Attention, ResNet152  
LSTM  70.9  27.9  24.4  51.7  92.0  17.6 
GRU  69.5  26.2  22.7  50.4  82.3  15.6 
LSTMPH  71.4  27.8  24.3  51.7  91.1  17.5 
LSTMWM  71.4  28.3  24.6  52.4  94.0  17.8 
Attention, Faster RCNN  
LSTM  75.9  36.1  27.4  56.3  111.9  20.3 
GRU  76.0  36.1  27.0  56.5  111.0  20.2 
LSTMPH  75.8  35.9  27.3  56.3  111.5  20.2 
LSTMWM  76.2  36.1  27.5  56.5  112.7  20.4 
No Attention, ResNet152  
Attention, Faster RCNN  
Numerical results are reported in Table 3
using standard captioning evaluation metrics (
i.e. BLEU1, BLEU4 (papineni2002bleu), METEOR (banerjee2005meteor), ROUGE (lin2004rouge), CIDEr (vedantam2015cider), and SPICE (anderson2016spice)). For all these, higher metric results indicate better performance, with CIDEr being the metric that best correlates with human judgment. In both settings, our LSTMWM outperforms traditional LSTM and LSTMPH by a clear margin. Specifically, LSTMWM improves the vanilla LSTM results by CIDEr points on the model without attention and CIDEr points on the model with attention over image regions, demonstrating the contribution of WMCs also for this task. As an additional comparison, we replace the LSTM layers with GRU layers. Numerical results suggest that there is not a clear advantage in using GRUs instead of LSTMs for this task. In Fig. 4, we plot the metric gap between LSTMWM and the two competitors in terms of METEOR and CIDEr. On the Xaxis we report the length of the generated captions, meaning that we consider the first words of each predicted sentence. On the Yaxis, a value means that our proposal performs equally, i.e. has no performance gap w.r.t. the competitor, while a higher value indicates better performance for our model. With this analysis, we aim to check whether the improvement given by WMCs can be restricted to a particular subset of the dataset. As one can observe, the metric gap generally increases with the caption length, especially w.r.t. peephole LSTM. We can deduce that the contribution of WMCs escalates with the number of time steps.5 Discussion
With Working Memory Connections, we show that information stored in the LSTM cell should be accessible in the gate structure. We compare the performance of WMCs to a similar approach named peephole connections (gers2000recurrent), and to vanilla LSTM. We find out that the structure of WMCs allows for two distinct improvements:

A more precise control of the gates. The multiplicative gates in the LSTM block must regulate the information flowing through the cell, but they cannot access the state of that same cell in the traditional LSTM formulation. The presence of the cell state in the multiplicative gates motivates the improvements of LSTMWM w.r.t. vanilla LSTM.

Increased stability during training compared to peephole connections. Exposing different projections of the cell state without squashing its content seems to be a critical point for the LSTMPH. This element of novelty in our design explains why WMCs provide a boost in performance even when peepholes fail.
As a consequence of these two improvements, WMCs incorporate the theoretical benefits of peephole connections, originally described by gers2000recurrent, with the training stability and versatility of vanilla LSTM.
It is worth noting that, for tasks that do not require to access the content of the memory cell, Working Memory Connections would not probably bring any benefit in the LSTM formulation, while peepholes might still hinder the whole learning process because of unstable updates.
At the same time, when training stacked LSTMs, the benefits given by WMCs may become less significant. We suppose that this is due to the increased complexity in the network structure, where multiple LSTM blocks can interact through the various layers. Similarly, many architectures employ LSTMs as building blocks together with different components, and the influence of WMCs in these compound deep networks cannot be easily determined. Experiments on image captioning, proposed in this paper, partially answer this question and prove that WMCs afford a small yet existing improvement even in this scenario. However, there are many other complex tasks involving vision, language, and other modalities, that are worth investigating.
6 Conclusion
A current limitation of Long ShortTerm Memory Networks consists in not letting the cell state influence the gate dynamics directly. In this paper, we propose Working Memory Connections (WMCs) for LSTM, which provide an efficient way of using intracell knowledge inside the network. The proposed design performs noticeably better than the vanilla LSTM and overcomes important issues in previous formulations. We formally motivate this improvement as a consequence of more stable training dynamics. Experimental results reflect the theoretical benefits of the proposed approach and motivate further study in this direction. One future direction might consist in testing the efficacy of Working Memory Connections for an even wider set of tasks.
Acknowledgments
This work has been supported by “Fondazione di Modena” under the project “AI for Digital Humanities” and by the national project “IDEHA: Innovation for Data Elaboration in Heritage Areas” (PON ARS01_00421), cofunded by the Italian Ministry of University and Research.
Comments
There are no comments yet.