Working Memory Connections for LSTM

08/31/2021
by   Federico Landi, et al.
0

Recurrent Neural Networks with Long Short-Term Memory (LSTM) make use of gating mechanisms to mitigate exploding and vanishing gradients when learning long-term dependencies. For this reason, LSTMs and other gated RNNs are widely adopted, being the standard de facto for many sequence modeling tasks. Although the memory cell inside the LSTM contains essential information, it is not allowed to influence the gating mechanism directly. In this work, we improve the gate potential by including information coming from the internal cell state. The proposed modification, named Working Memory Connection, consists in adding a learnable nonlinear projection of the cell content into the network gates. This modification can fit into the classical LSTM gates without any assumption on the underlying task, being particularly effective when dealing with longer sequences. Previous research effort in this direction, which goes back to the early 2000s, could not bring a consistent improvement over vanilla LSTM. As part of this paper, we identify a key issue tied to previous connections that heavily limits their effectiveness, hence preventing a successful integration of the knowledge coming from the internal cell state. We show through extensive experimental evaluation that Working Memory Connections constantly improve the performance of LSTMs on a variety of tasks. Numerical results suggest that the cell state contains useful information that is worth including in the gate structure.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/13/2018

The unreasonable effectiveness of the forget gate

Given the success of the gated recurrent unit, a natural question is whe...
03/23/2018

Can recurrent neural networks warp time?

Successful recurrent models such as long short-term memories (LSTMs) and...
08/04/2018

MCRM: Mother Compact Recurrent Memory

LSTMs and GRUs are the most common recurrent neural network architecture...
08/04/2018

MCRM: Mother Compact Recurrent Memory A Biologically Inspired Recurrent Neural Network Architecture

LSTMs and GRUs are the most common recurrent neural network architecture...
05/25/2019

Bivariate Beta LSTM

Long Short-Term Memory (LSTM) infers the long term dependency through a ...
05/09/2018

Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum

LSTMs were introduced to combat vanishing gradients in simple RNNs by au...
07/11/2018

Iterative evaluation of LSTM cells

In this work we present a modification in the conventional flow of infor...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent Neural Networks (RNNs) (Elman1990FindingSI; Rumelhart1986LearningRB)

are a family of architectures that process sequential data by means of internal hidden states. The set of parameters of the network is shared across time steps, allowing the RNN to process inputs of variable length. As RNNs suffer from the so-called exploding and vanishing gradient problem (EVGP) 

(bengio1993problem; hochreiter1991untersuchungen), which hinders the learning of long-term dependencies (bengio1994learning; pascanu2013difficulty), previous works have proposed to enrich the recurrent cell with gating mechanisms (hochreiter1997long; jing2019gated). For instance, Long Short-Term Memory networks (LSTMs) (hochreiter1997long) use gates to control the information flow towards and from the memory cell and to regulate the forgetting process (gers2000forget)

. LSTMs are adopted in a wide number of tasks, such as neural machine translation 

(bahdanau2015neural; sutskever2014sequence), speech recognition (graves2013speech), and also vision-and-language applications like image and video captioning (vinyals2015show; xu2015show; baraldi2017hierarchical).

In this paper, we propose a novel cell-to-gate connection that modifies the classic LSTM block. Our formulation is general and improves LSTM overall performance and training stability without any particular assumption on the underlying task. In the vanilla LSTM formulation, the gates are controlled by the current input of the block and its previous output, which acts as the hidden state for the network. The long-term memory cell, instead, is employed to store information during the forward pass and provides a safe path for back-propagating the error signal. We argue that the content stored in the memory cell could be useful to regulate the gating mechanisms, too. The key element of our design is a connection between the memory cell and the gates with a protection mechanism that prevents the cell state from being exposed directly. We draw inspiration from the gated read operation employed to reveal the cell content at the block output, and enrich it with a learnable projection. In this way, the LSTM block can use the knowledge in the cell (acting as a long-term memory) to control the evolution of the whole network in the short-term.

A similar concept in cognitive psychology and neuroscience is the so-called working memory (ericsson1995long), a type of memory employed, for instance, to retain the partial results while solving an arithmetic problem without paper, or to combine the premises in a lengthy rhetorical argument (hernandez2018neuroethics). Although definitions are not unanimous, working memory is said to be a cognitive system acting as a third type of memory between long-term and short-term memory. Our connections share this characteristic with working memory. For this reason, we call them Working Memory Connections (WMCs).

A first attempt to fuse the information of the cell in the gates was made with the design of peepholes (gers2000recurrent): direct multiplicative connections between the memory cell and the gates. This approach has not been largely adopted in literature, as recent studies report mixed results (Greff2017LSTMAS) and discourage their use. Since our idea recalls the rationale of peephole connections, we provide a large comparison with this previous work. By doing so, we point out the major issues in the peephole formulation that hinder effective learning and attest that WMCs do not suffer from the same problems. In our experiments, we show that an LSTM equipped with Working Memory Connections achieves better results than comparable architectures, thus reflecting the theoretical advantages of their design. In particular, WMCs surpass vanilla LSTM and peephole LSTM in terms of final performances, stability during training, and convergence time. All these aspects testify the advantage in letting the cell state participate in the gating dynamics. In order to support our conclusions, we conduct a thorough experimental analysis covering a wide area of current research topics.

To sum up, our contribution is mainly three-fold. First, we present a modification of LSTM in which traditional gates are enriched with Working Memory Connections, linking the memory cell with the gates through a protection mechanism. Then, we demonstrate that exposing the LSTM internal state directly and without a proper protection yields unstable training dynamics that compromise the final performance.

Finally, we show the effectiveness of the proposed solution in a variety of tasks, ranging from toy problems with very long-term dependencies (adding problem, copy task, and sequential MNIST) to language modeling and image captioning.

2 Related Work

Long Short-Term Memory networks (hochreiter1997long) aim to mitigate the exploding and vanishing gradient problem (hochreiter1991untersuchungen; bengio1994learning) with the use of gating mechanisms. Since its introduction, LSTM has gained a lot of attention for its flexibility and efficacy in many different tasks. To simplify the LSTM structure, liu2020simplified propose to exploit the content of the long-term memory cell in a recurrent block with only two gates. However, this model neglects the importance of the LSTM output. While this might be useful for simple tasks, it is unlikely to generalize to more complex settings. arpit2018h

propose to modify the path of the gradients in order to stabilize training with a stochastic algorithm specific to LSTM optimization. This direction of work is not in contrast with our goal, and could possibly be integrated with our proposal since our connection does not require a specific setup to be optimized. Among the LSTM variants, the Gated Recurrent Unit (GRU) 

(cho2014LearningPR; cho2014properties) is the most popular and common architecture (chung2014empirical), and features a coupling mechanism between input and forget gates (Greff2017LSTMAS). A recent line of research aims to tailor the LSTM structure for specific tasks. For instance, baraldi2017hierarchical propose a hierarchical model for video captioning, while other works incorporate convolutional models into the LSTM structure (XIAO2020173; LI201841). While these works propose a modification of the LSTM towards a specific goal, we propose a general and powerful idea that adapts to a large set of different tasks.

Recently, models based on self-attention, such as Transformer architecture (vaswani2017attention) and its variants, are achieving state-of-art performances on many different tasks, and also for sequence modeling. For instance, language representations based on BERT (devlin2018bert) can be finetuned with an additional output layer to obtain state-of-art results on many language-based tasks. However, RNNs require much fewer parameters and operations to run than Transformer-based architectures and are still widely adopted. Moreover, LSTMs still have a large market in embedded systems and edge devices for their low computational and memory requirements.

3 Proposed Method

In this section, we present a complete overview of Working Memory Connections. First, we recall the LSTM equations. Second, we explain the modifications introduced in our design. Finally, we motivate the choices behind WMCs w.r.t. other approaches. Specifically, we identify key problems in previous cell-to-gate connections that hinder the learning process, and we show that the proposed solution does not suffer from these weaknesses.

3.1 Lstm

The core idea behind Long Short-Term Memory networks is to create a constant error path between subsequent time steps. Being

the input vector at time

we can write the rollout equations for a vanilla LSTM as:

(1)
(2)
(3)
(4)
(5)
(6)

Here, is the block input, , , and are respectively the input, forget, and output gates, represents the memory cell value, and is the block output. In this notation,

is the sigmoid function and

denotes element-wise Hadamard product. In its first formulation (hochreiter1997long), LSTM did not include the multiplicative forget gate. However, being able to forget about past inputs (gers2000forget) allows LSTM to tackle longer sequences while not hindering the back-propagation of the error signal.

Figure 1: Comparison between a vanilla LSTM gate, a peephole connection, and a Working Memory Connection.

3.2 Working Memory Connections

In the following, we introduce Working Memory Connections, which enable the memory cell to influence the value of the gates through a set of recurrent weights. Given a proper design for the connection, we argue that there is a practical advantage in letting the cell state influence the gating mechanisms in the LSTM block directly. In fact, the cell state provides unique information about the previous time steps that are not present in . For instance, may be close to zero as a consequence of the output gate saturating towards zero (see Eq. 6), while may be growing and changing as a result of a sequence of input vectors. In that case, since the cell state cannot control the output gate, the LSTM block is forced to learn which particular value in the input vector is the marker that signals to open the output gate. Instead, with an appropriate connection strategy, the LSTM block could learn a mapping between the cell internal state and the gate values.

Our solution employs a set of recurrent weights ,

and a nonlinear activation function to model a connection between memory cell and gates. The application of a non-linearity on the memory cell is coherent with the present LSTM structure: as it can be noticed from Eq. 

6, a nonlinear activation function is applied to before the Hadamard product with 111Previous works (Greff2017LSTMAS) have also shown that removing this non-linearity leads to a significant loss in terms of performance.. In light of the above-mentioned intuitions, we modify Eq. 23, and 5 by exposing the cell state at time through a protection mechanism as follows:

(7)
(8)
(9)

where

denotes a general linear transformation.

At a first glance, Working Memory Connections may seem redundant in the gate structure. In fact, depends from the value of (Eq. 6). This impression is misleading, as the proposed connections introduce two main aspects of novelty. First, the non-linear activation function operates on three different projections of the cell state, one for each gate type. Second, Eq. 9 shows that the connection on the output gate depends on , rather than on , hence allowing for a more responsive control of the output dynamics of the entire LSTM block.

3.3 Advantages of Working Memory Connections

To formally motivate the improvement given by Working Memory Connections, we start by considering the local gradients of the gates in which the cell interaction is added. We limit our formal analysis to the input gate , but our reasoning can be generalized to and . If we denote by the argument of the sigmoid activation function (Eq. 7) at time :

(10)

then the local gradient of the input gate is expressed by:

(11)

where denotes a vector of ones, and indicates a diagonal matrix whose diagonal contains the N elements of vector .

From here, we can easily derive the local gradients on the recurrent weights , , and at time :

(12)
(13)
(14)

where denotes the outer product of two vectors, and:

(15)

Now, let’s consider what happens as grows: we observe that and are bounded to a limited interval. In particular, is a sample of the input data, and is bounded in the interval by construction. Instead, the cell can grow linearly with the number of recursive steps, making its domain extremely task-dependent. This is a well-known problem, which motivated the introduction of the forget gate in the original LSTM structure (gers2000forget). Despite this, the range of possible values of cannot be restricted to a fixed domain. The hyperbolic tangent non-linearity helps to avoid an excessive influence of the unbounded cell state in the gate mechanics, hence preventing unwanted saturation. As it can be seen in Eq. 78, and 9, the term related to the cell state is bounded in the interval . Additionally, it helps screen the connection weights from unstable updates.

Even if grew linearly with the number of time steps, its influence on the sigmoid argument would be mitigated, and it could not take the sigmoid function into its saturated regime against the other two terms driven by and respectively. On the other hand, the growth of the cell state would push the hyperbolic tangent towards its own saturated regime. This behavior helps protect the weight matrix employed in the connection from unstable updates.

Peephole Connections and their Limitations. We now turn our attention to a related connection, namely the peephole connection (gers2000recurrent), which is no longer common in the LSTM formulation. Peephole connections were introduced by Gers and Schmidhuber in (gers2000recurrent), and enrich the LSTM equations with recurrent weights , :

(16)
(17)
(18)

with generally constrained to be diagonal (graves2013generating; Greff2017LSTMAS). While this formulation allows for a more precise control of the gates, there are two issues that limit its effectiveness. In this case, the local gradient at time is expressed by:

(19)

with being the argument of the sigmoid function in Eq 16:

(20)

In light of this difference, Eq. 14 and 15 become:

(21)
Figure 2: The cell state may grow linearly with the number of time steps. Peephole connections directly expose , creating a key issue (b). Data for this plot is taken from the first training iterations of the sequential MNIST (see § 4.2).

We observe that, both in Eq. 7 and in Eq 16, the magnitude of the product can in principle grow unbounded. The activation function introduced in WMCs squashes this term into a closed bounded interval. In peephole connections, hovewer, this term is added inside the gate without an adequate protection (see Fig. 1). The result is that, in the peephole formulation, the sigmoid function applied immediately after could be pushed towards its saturating regime independently from the value of and . In theory, the LSTM block can recover from this situation by setting all the weights in the peephole connection to , but in practice this might not happen if the sigmoid gate is saturated most of the time. Even if the two other summands can compensate for the growth of , hence letting gradients flow through the gate, there is still a key issue that hinders learning. In fact, as shown in Eq. 21, the gradients on the recurrent peephole weights grow linearly with , making updates unstable.

To exemplify this behavior, we report the Euclidean norm of during the early training stages in Fig. 2. After a small number of time steps, the content of the cell floods the gates of the peephole LSTM. A possible consequence would be that both the input and the forget gates would saturate towards . In our example, this aspect leads to an additional and uncontrolled growth of the magnitude of . As it can be seen, Working Memory Connections exhibit a much more regular behavior than peepholes and can prevent the uncontrolled growth of the memory cell.

4 Experiments and Results

Adding Problem – T=200 Adding Problem – T=400
Copying Task – T=100 Copying Task – T=200
Sequential MNIST Permuted MNIST
Figure 3:

Comparison among traditional LSTM, the proposed LSTM with working memory connections, and peephole LSTM. We investigate three different tasks: the adding problem (top), the copying task (center), and the sequential/permuted MNIST (bottom). In all the plots, shading indicates the standard error of the mean.

The effectiveness of Working Memory Connections and their general benefits can be appreciated in many different tasks. The proposed experiments cover a wide area of applications: two different toy problems, digit recognition, language modeling, and image captioning. While the analysis on simple tasks helps to clarify the inherent advantages of the proposed approach, results on more challenging real-world applications motivate a wider adoption of our novel connections, especially for long sequences. We compare our model (LSTM-WC) to a traditional LSTM and to an LSTM with peephole connections (LSTM-PH).

4.1 Adding Problem and Copying Tasks

In the adding problem (hochreiter1997long), the input to the network consists of a series of pairs , with . The first element is a real-valued number between 0 and 1, and is a corresponding marker. In the entire sequence, only two markers and are set to 1, while the others are set to 0. The goal is to predict the sum of the corresponding real-valued items , for which . In our experiments, we test with and , and we measure the performance using mean squared error. For this experiment, the networks have hidden size and train for epochs. We optimize the parameters using SGD with Nesterov update rule. The learning rate is (momentum factor ) and the batch size is . We also clip the gradient norm to . Results are reported in Fig. 3 (top), where we plot the MSE on the test set for every epoch of training. LSTM-WM achieves the best convergence time for , while the final performance on this setup is similar among the three models. The effectiveness of WMCs is striking in the setup. In fact, the proposed model solves the adding problem around epoch , while the other two architectures cannot learn the task and are stuck on the trivial solution.

In the copying task (hochreiter1997long), the network observes a sequence of input symbols, waits for time steps (we use and ), and then must reproduce the same sequence as output. For this experiment, we adopt the same setup described in (arjovsky2016unitary). We keep the same implementation details described for the adding problem, except that we train for epochs. In Fig. 3 (center), we plot the test accuracy achieved by the three models at each epoch. In both setups, WMCs play an important role in terms of final performance and convergence time. As in the adding problem, the performance gain given by the proposed architecture is more evident when working on longer sequences: for , WMCs outperform peephole LSTM and vanilla LSTM by around and .

4.2 Permuted Sequential MNIST

The sequential MNIST (sMNIST) (le2015simple) is the sequential version of the MNIST digit recognition task (lecun1998gradient). In this task, the image pixels are fed sequentially to the network (from left to right, and top to bottom). The permuted sequential MNIST (pMNIST) is a sequential version of the MNIST digit recognition problem in which the pixels are permuted in a random but fixed order. In both tasks, the goal is to predict the correct digit label after the last input pixel. Following the setup proposed in (arpit2018h), we use k images for training, k for validation, and k to test our models. The experimental setup is as follows. We set the hidden size to for all the networks, and train for epochs using SGD with learning rate and batch size (momentum and Nesterov update rule). We clip the gradient norms to .

Fig. 3 (bottom) reports the mean test accuracy of the three LSTM variants for both setups. We report the standard error of the mean as a shaded area. For the sMNIST task, peephole LSTM performs slightly better than vanilla LSTM. LSTM with Working Memory Connections, instead, outperforms the competing architectures in terms of final accuracy and convergence speed. In particular, our architecture employs only 50 epochs to get above accuracy, while other models are still generally stuck around (vanilla LSTM) and (LSTM-PH). In this experiment, we also find out that WMCs help stabilize training. In fact, the area given by the standard error of the mean is much thicker for our approach than for the other two variants, in particular during the early stages of training. On the pMNIST task, all the models achieve good final results, with LSTM with Working Memory Connections still being the best option.

Model sMNIST pMNIST
iRNN (le2015simple) 97.00 82.00
uRNN (arjovsky2016unitary) 95.10 91.40
h-detach (arpit2018h) 98.50 92.30
LSTM () 98.16 92.94
LSTM () 97.68 93.97
LSTM-PH () 98.58 93.25
LSTM-PH () 98.33 93.40
LSTM-WM () 98.63 93.97
Table 1: Test accuracy on the sequential MNIST task.

Numerical results, reported in Table 1, confirm that our model outperforms the classic LSTM by a discrete margin ( and on the sequential and permuted MNIST respectively). Since WMCs introduce additional learnable parameters in the LSTM structure, we also compare with vanilla and peephole LSTM with increased hidden size (256 instead of 128). Note that, in this setting, LSTM and LSTM-PH have more than the number of learnable parameters of LSTM-WM. Despite this, LSTM-WC achieves the best results on both tasks. It is worth noting that, while additional parameters in vanilla LSTM improves the results on pMNIST, they are not helpful in the sMNIST task. The flexibility given by WMCs, instead, allows the proposed model to achieve the best result in both setups. Always in Table 1, we compare with two state-of-the-art RNNs (le2015simple; arjovsky2016unitary), and with a training algorithm for LSTM (arpit2018h). The proposed LSTM-WC outperforms the competitors in terms of test accuracy.

Test Bit per Character (BPC)
Fixed # Params () Fixed # Hidden Units ()
Model
LSTM 1.334 0.0006 1.343 0.0004 1.386 0.0005 1.395 0.0005
LSTM-PH 1.339 0.0048 1.343 0.0009 1.383 0.0004 1.394 0.0005
LSTM-WM 1.299 0.0005 1.302 0.0008 1.299 0.0005 1.302 0.0008
Table 2: Mean test bit per character on the PTB test set. Error range indicates the standard error of the mean.

4.3 Penn Treebank (PTB) Character-Level Language Modeling

Character-level language modeling requires to predict a single character at each time step given an observed sequence of text. In our experiments on the Penn Treebank (PTB) dataset (marcus1993building), we evaluate the performance of the three different LSTM variants in terms of test mean bits per character (BPC), where lower BPC denotes better performance. We report the results in Table 2, where we compare truncated back-propagation through time () over 150 and 300 steps. Since our connection introduces new learnable weights, we consider an additional setup in which we keep a fixed number of parameters for the three networks. For this experiment, we follow the setup proposed by merityAnalysis, with the only exception that we employ a single LSTM layer instead of three. The advantage of using Working Memory Connections is more evident for equal number of hidden units, where the proposed architecture overcomes the vanilla LSTM and peephole LSTM by a significant margin. Even when the number of parameters is fixed for all the models, LSTM-WC outperforms the competitors by and BPC for and respectively. It is worth noting that peephole LSTM performs similarly to or even worse than vanilla LSTM on this task.

4.4 Image Captioning

We evaluate the performance of our LSTM with Working Memory Connections on the image captioning task, which consists of generating textual descriptions for images. We apply our approach to two different captioning models: Show and Tell (vinyals2015show) and Up-Down (anderson2018bottom)

. The first model includes a single LSTM layer and does not employ attention, while the second is composed of two LSTM layers and integrates attention mechanisms over image regions. We use the Microsoft COCO dataset 

(lin2014microsoft) following the splits defined in (karpathy2015deep). To represent images, we employ a global feature vector extracted from the average pooling layer of ResNet-152 (he2016deep) for the Show and Tell model, and multiple feature vectors extracted from Faster R-CNN (ren2015faster) for the Up-Down architecture. We train both models with Adam optimizer (kingma2015adam) using a learning rate equal to . All other hyper-parameters are left the same as those suggested in the original papers.

Model BLEU-1 BLEU-4 METEOR ROUGE CIDEr SPICE
No Attention, ResNet-152
     LSTM 70.9 27.9 24.4 51.7 92.0 17.6
     GRU 69.5 26.2 22.7 50.4 82.3 15.6
     LSTM-PH 71.4 27.8 24.3 51.7 91.1 17.5
     LSTM-WM 71.4 28.3 24.6 52.4 94.0 17.8
Attention, Faster R-CNN
     LSTM 75.9 36.1 27.4 56.3 111.9 20.3
     GRU 76.0 36.1 27.0 56.5 111.0 20.2
     LSTM-PH 75.8 35.9 27.3 56.3 111.5 20.2
     LSTM-WM 76.2 36.1 27.5 56.5 112.7 20.4
Table 3: Image captioning results on COCO test set.
No Attention, ResNet-152
Attention, Faster R-CNN
Figure 4: Metric gaps on the image captioning task for increasing instruction lengths.

Numerical results are reported in Table 3

using standard captioning evaluation metrics (

i.e. BLEU-1, BLEU-4 (papineni2002bleu), METEOR (banerjee2005meteor), ROUGE (lin2004rouge), CIDEr (vedantam2015cider), and SPICE (anderson2016spice)). For all these, higher metric results indicate better performance, with CIDEr being the metric that best correlates with human judgment. In both settings, our LSTM-WM outperforms traditional LSTM and LSTM-PH by a clear margin. Specifically, LSTM-WM improves the vanilla LSTM results by CIDEr points on the model without attention and CIDEr points on the model with attention over image regions, demonstrating the contribution of WMCs also for this task. As an additional comparison, we replace the LSTM layers with GRU layers. Numerical results suggest that there is not a clear advantage in using GRUs instead of LSTMs for this task. In Fig. 4, we plot the metric gap between LSTM-WM and the two competitors in terms of METEOR and CIDEr. On the X-axis we report the length of the generated captions, meaning that we consider the first words of each predicted sentence. On the Y-axis, a value means that our proposal performs equally, i.e. has no performance gap w.r.t. the competitor, while a higher value indicates better performance for our model. With this analysis, we aim to check whether the improvement given by WMCs can be restricted to a particular subset of the dataset. As one can observe, the metric gap generally increases with the caption length, especially w.r.t. peephole LSTM. We can deduce that the contribution of WMCs escalates with the number of time steps.

5 Discussion

With Working Memory Connections, we show that information stored in the LSTM cell should be accessible in the gate structure. We compare the performance of WMCs to a similar approach named peephole connections (gers2000recurrent), and to vanilla LSTM. We find out that the structure of WMCs allows for two distinct improvements:

  1. A more precise control of the gates. The multiplicative gates in the LSTM block must regulate the information flowing through the cell, but they cannot access the state of that same cell in the traditional LSTM formulation. The presence of the cell state in the multiplicative gates motivates the improvements of LSTM-WM w.r.t. vanilla LSTM.

  2. Increased stability during training compared to peephole connections. Exposing different projections of the cell state without squashing its content seems to be a critical point for the LSTM-PH. This element of novelty in our design explains why WMCs provide a boost in performance even when peepholes fail.

As a consequence of these two improvements, WMCs incorporate the theoretical benefits of peephole connections, originally described by gers2000recurrent, with the training stability and versatility of vanilla LSTM.

It is worth noting that, for tasks that do not require to access the content of the memory cell, Working Memory Connections would not probably bring any benefit in the LSTM formulation, while peepholes might still hinder the whole learning process because of unstable updates.

At the same time, when training stacked LSTMs, the benefits given by WMCs may become less significant. We suppose that this is due to the increased complexity in the network structure, where multiple LSTM blocks can interact through the various layers. Similarly, many architectures employ LSTMs as building blocks together with different components, and the influence of WMCs in these compound deep networks cannot be easily determined. Experiments on image captioning, proposed in this paper, partially answer this question and prove that WMCs afford a small yet existing improvement even in this scenario. However, there are many other complex tasks involving vision, language, and other modalities, that are worth investigating.

6 Conclusion

A current limitation of Long Short-Term Memory Networks consists in not letting the cell state influence the gate dynamics directly. In this paper, we propose Working Memory Connections (WMCs) for LSTM, which provide an efficient way of using intra-cell knowledge inside the network. The proposed design performs noticeably better than the vanilla LSTM and overcomes important issues in previous formulations. We formally motivate this improvement as a consequence of more stable training dynamics. Experimental results reflect the theoretical benefits of the proposed approach and motivate further study in this direction. One future direction might consist in testing the efficacy of Working Memory Connections for an even wider set of tasks.

Acknowledgments

This work has been supported by “Fondazione di Modena” under the project “AI for Digital Humanities” and by the national project “IDEHA: Innovation for Data Elaboration in Heritage Areas” (PON ARS01_00421), cofunded by the Italian Ministry of University and Research.

References