MCRM: Mother Compact Recurrent Memory

08/04/2018 ∙ by Abduallah A. Mohamed, et al. ∙ The University of Texas at Austin 0

LSTMs and GRUs are the most common recurrent neural network architectures used to solve temporal sequence problems. The two architectures have differing data flows dealing with a common component called the cell state (also referred to as the memory). We attempt to enhance the memory by presenting a modification that we call the Mother Compact Recurrent Memory (MCRM). MCRMs are a type of a nested LSTM-GRU architecture where the cell state is the GRU hidden state. The concatenation of the forget gate and input gate interactions from the LSTM are considered an input to the GRU cell. Because MCRMs has this type of nesting, MCRMs have a compact memory pattern consisting of neurons that acts explicitly in both long-term and short-term fashions. For some specific tasks, empirical results show that MCRMs outperform previously used architectures.



There are no comments yet.


Code Repositories


MCRM: Mother Compact Recurrent Memory

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent neural networks (RNNs) are a class of neural networks that can relate temporal information. They have been widely used for a lot of problems including image caption generation [23, 43, 38, 31], text to speech generation [11, 39, 1], object detection and tracking [33, 46, 22, 50, 42, 25]

, neural machine translation

[3, 35, 34, 48, 47]

. RNNs are generally categorized into three well-known architectures, Vanilla RNNs, Long short-term memory (LSTM) proposed by

[18] and Gated recurrent neural networks (GRU) proposed by [5]. The vanishing and exploding gradients are well-known Vanilla RNNs problems, which GRUs and LSTMs solves. Vanilla RNNs also lack the ability to remember long-term sequences, unlike GRUs and LSTMs. The main difference between LSTMs and GRUs are in the terms of architecture. LSTMs have more control gates than GRUs do. LSTM output is a part of the cell state content unlike GRUs which its cell state is the output. Another difference is that GRUs are clearly computationally inexpensive, unlike LSTMs. Still there is no clear evidence that GRUs is better than LSTMs or not, see for instance the work of [4].

The core idea of LSTMs and GRUs is to control information flow to the cell state, which can be described as the memory, through control gates. Yet the cell state is very simple neural network layer. The core idea of the article is to investigate if the the performance of LSTMs or GRUs can be enhanced by developing a better cell state architecture. We noticed that Nested LSTMs (NLSTM) introduced by [37] outperforms previous RNN architectures, nonetheless the inner LSTM isn’t fully utilized as the cell state is exposed to the outer NLSTM via the output gate. Thus we chose to create a new deep recurrent model that has a GRU unit nested within a LSTM. The GRU is chosen to be inside the LSTM as it fully exposes the hidden state. The GRU represents the cell state of the LSTM. We call this architecture Mother Compact Recurrent Memory (MCRM). The Mother term came from our visualization of the LSTM as a mother that carries the GRU as a fetus. The compact term came from the compact memory pattern that is produced by MCRM which inherits both GRU and LSTM memory behaviors.

The MCRMs are positioned as follow:

  1. A novel class of nested RNNs.

  2. A compact memory pattern that support both long and short terms behaviors.

  3. The model is validated using empirical test problems.

The rest of this article is organized as follows. Section 2 reviews the history of RNNs, LSTMs and GRUs and their development highlighting similar approaches. Section 3 discusses the MCRM model in details, and provides its mathematical model. An experimental validation of MCRM against different recurrent architectures on well-known benchmark recurrent tasks is shown in 4. Section 5 shows a visualization of MCRM hidden state and compares these with other RNNs cell states. This demonstrates the compact memory pattern outlined earlier. The MCRM source code is available at:

2 Related Work

One of the earliest work in the history of RNNs was by [20], this work represented an early concept of state in neural networks. It described a recurrent connection with an in-unit loop. It successfully integrated a time series data into a neural network. A simpler RNN architecture was proposed by [10] which can be called the Vanilla RNN. It simplified the concept of [20] to use a context unit or what can be called a hidden unit removing the in-unit loop that that was previously introduced. The work done by [32] provides more details about the history of RNNs and its development until it was formalized into the LSTM architecture. Different attempts to improve RNNs itself have been made, including the introduction of an auxiliary memory to enhance its performance [45]. [7] introduced a gated feedback RNN architecture which stacks multiple RNNs to pass/not pass and control signals flowing from upper layers to the lower layers. Another modification for RNNs is the Clockwork RNN [26] which introduced a method that makes RNNs work for long-term sequences requirements. It was shown to outperform LSTMs and RNNs in some specific tasks.

LSTMs have been originally developed by [18]. The main motivation of LSTMs is to skive the problem of vanishing gradients in vanilla RNNs and to remember longer sequences. The hidden layer of a RNN was replaced by a memory unit or what called a memory cell. The LSTM has specific function gates to control the flow of data and its storage within the memory cell. One extra gate added to the LSTM called the forget gate it has been introduced by [12]

to give LSTM the ability to forget specific information from the memory cell. From this point, multiple developments has been done to improve the performance of LSTMs. One of them is to replace the feed forward units with a convolutional neural networks (CNNs) introduced by

[29] to improve its ability for visual sequence problems [9]. Other approaches involved stacking LSTMs [14, 41, 15] Or by introducing a depth gate between stacked LSTMs [49]. Also, some proposed a hyper-architecture between RNNs and LSTMs such as the work of [27]. Nesting the LSTM within another LSTM, resulting in a nested LSTM (NLSTM) is the focus of [37], which is used as a reference in this article. [21] organized LSTM in the form of a multidimensional grid. An extensive work by [17]

explored different variations of LSTM by introducing six variants to the architecture. It concluded that the current LSTM is indeed performing well compared to these variants. Also it found that the forget gate and output gate activation functions are very critical to the LSTM performance.

The GRUs architecture, which is on the par with LSTMs in terms of performance, was introduced by introduced by [5]. GRUs requires less memory and computationally less expensive than LSTM. GRUs in some cases may outperform LSTMs, as shown in the comparative study by [6]. Also, GRUs fully expose the hidden state content unlike LSTMs. For the development of GRUs, an approach in the work done by [40] introduced the shuttleNet concept. The shuttleNet uses multiple GRUs treated as processors. These processors are loops connected to mimic the human’s brain feedback and feed-forward connections. A study on three different variants of GRUs was done by [8] concluded that the current GRUs have a similar performance as these three variants.

3 Mother Compact Recurrent Memory (MCRM)

In this section, we first define the notations used in this article. We then recall the LSTM and GRU models, and use them to derive the MCRM model.

Mathematical notions

The following mathematical notations will be used. stands for current time step, is the Hadamard product, is the sigmoid activation function and is the tanh activation function. is the concatenation symbol defined in [44]. The input to any architecture is . Weights which interacts with the input are . Weights which interacts with the hidden state are . are the biases. represent the different gates in both LSTM and GRU equations (1) ,(2) respectively.

LSTMs architecture

LSTMs address the vanishing gradient problem commonly found in RNNs by controlling the information flow through specific functions gates


. At each time step, an LSTM maintains a hidden state vector

and a memory state vector responsible for controlling the state updates and generating the outputs. The computation at time step is defined as follows:


is usually refereed as the cell or the memory state where the information are stored. The hidden state is the output or the exposed state of an LSTM. The LSTM operation is as follows: the input gate at time step decides how much information to take from the input into the memory. The forget gate decides how much information to keep from the previous cell state . Both input gate interaction and forget gate interaction are used to compute the new cell state . Then the output gate is used to compute the quantity of information to expose from the cell state to outside world through the hidden state representing the output of LSTM.

GRUs architecture

GRUs also address the same problems found in RNNs. They have been introduced by [5]. The main difference between GRUs and LSTMs is that GRUs totally expose the hidden state information through . They require less memory and are computationally less expensive unlike LSTMs. The computation at time step is defined as follows:


The GRU operates as follows: the reset gate is used to compute how much information to remove from the previous hidden state . This reset gate interaction is added to the input and saved into an intermediate vessel called the node state . The update gate decides how much information from the previous hidden state should be added to the node state to form the new hidden state . This new hidden state is the output of the GRU cell.

MCRMs architecture

MCRM nests a GRU cell inside an LSTM treating the GRU hidden state as the LSTM cell state . The input to the GRU unit is which is the concatenation of what the LSTM should forget and what it should remember from the input coming into the MCRM cell, it’s defined in equation (3).


The modified equation of the LSTM cell now become:


And the modified equations of GRU cell now becomes:


Figure 1 illustrates the data flow inside the MCRM, following equations 4 and 5. The closest architecture to MCRMs is the Nested LSTMs (NLSTMs) introduced by [37], which nests am LSTM inside another LSTM. MCRMs had the following advantages over NLSTMs: first, they are less computationally expensive than NLSTMs as they use GRUs instead of LSTM as the cell state. Secondly, they have a better neurons utilization as the full hidden state is exposed from the GRU to the LSTM unlike NLSTMs where the inner LSTM are not fully utilized because of the usage of the cell state only.

Figure 1: MCRM detailed data flow diagram starting from the previous hidden state to the next hidden state. The black colored lines refers to the LSTM part and the blue colored lines refers to the GRU part.

4 Experiments and results

In this section, the MCRM performance is evaluated empirically against different recurrent architectures on well-known tasks. For fairness of comparison, we use the same hyper-parameters as in [4]. Also, our results are consist with the results from [4]. These experiments were executed under a controlled environment, using the same initial random seed and weights initializations. The parameters of the models were kept the same between multiple experiments to check which architecture has a better usage of the neurons. The different configuration parameters are shown in table 2

. Each experiment was executed multiple times with different initial seeds and the reported performance metrics are the mean performance of these multiple executions. We avoided the usage of any drop-out or batch normalization layers to have a fair evaluation of the performance.

Sequence Modeling Task Model Size () Models
Seq. MNIST (accuracy) 152K 19.57 98.58 85.16 91.02 98.79
Adding problem (loss) 95K 0.165 3.2e-04 0.001 0.004 4.0e-06
Copy memory (loss) 3.3M 0.021 0.013 0.004 7.3e-05 8.5e-06
Char-level PTB (bpc) 17.1M 1.683 1.397 1.374 1.365 1.331
Word-level PTB (ppl) 1.3M 140.58 110.6 110.64 140.1 120.9
Table 1: Evaluation of MCRM versus different recurrent architectures on synthetic stress tests, character-level language modeling, and word-level language modeling. The MCRM architecture outperforms most of these recurrent networks in some cases or tend to be better than other architectures across different tasks and datasets.  means that higher is better.  means that lower is better.

The adding problem

The adding problem has been used as a stress test for sequence models. It was introduced by [18]. The test is about creating an input with a length sequence of depth 2. The first dimension is randomly chosen between 0 and 1. The second dimension is all zeros except the last two elements marked by 1. The objective is find the sum the last two random elements marked by 1 in the second dimension. We used a sequence of length . The test results are shown in table 1. MCRM outperforms all other models with an error of . Also, the GRU has a close performance of an error of which explains why the MCRM have this performance. The learning curves are shown in figure 2.

Figure 2: The loss as a function of iteration for the adding problem test predictions for both train and test datasets.

Copy memory test

This method has been used previously in [51, 2, 19] for measuring the performance of a recurrent architecture in the context of remembering information seen time steps earlier. The input sequence is in the length of . The input sequence defined as: , where is chosen to be a zero digit and . The is randomly chosen from digits . The is set to be digit 9, where is the delimiter. The model is expected to generate an output identical to the input sequence is. A test was conducted with a sequence length of . The results are shown in table 1. The MCRM outperforms all other models with an error of . The learning curves are shown in figure 3.

Figure 3: The loss as a function of iteration for the copy memory test predictions for both train and test datasets.

Sequential MNIST test

This test is similar in intent to the copy memory test. In this task the MNIST dataset [30] images are presented to the model as a input sequence of pixels intensity values. The recurrent model should be able to reconstruct the image again. This test was used as a stress test in several recurrent related problems [28, 52]. From table 1 MCRM achieved accuracy of 98.79% outperforming any other model. Surprisingly, the LSTM and NSLTM had a poor performance unlike the GRU. We relate the success of MCRM in this task to the GRU core inside it. The learning curves are presented in figure 4.

Figure 4: The accuracy as a function of iteration for the sequential MNIST predictions for both train and test datasets.

PennTreebank character and word levels tests

The PennTreebank (PTB) [36] is a text data set for both character-level and word-level language modeling tasks. It is widely used in many RNN architectures for evaluating the model performance. The PTB is divided into train, test and validation datasets. To measure the character-level task performance the bits per character (bpc) is used as a performance index. BPC has been introduced by [16]and it is defined as the cross-entropy loss [13] divided by . The performance index for the word-level language modeling task is the perplexity (ppl). The ppl is defined as the exponential of the cross-entropy loss. The two tasks results are reported in table 1. The reported values are from the validation dataset. When the PTB is used as character-level language corpus the MCRM outperforms other models with a bpc of 1.331 exceeding the NLSTM by 0.034 ppl. The learning curves are shown in figure 5. When the PTB used as word-level language corpus MCRM performance is in-between the GRU and NLSTM. The learning curves are shown in figure 6

Figure 5: The bpc performance index as a function of iteration for character-level prediction on PTB’s train and test data sets.

Figure 6: The ppl performance index as a function of iteration for word-level prediction on PTB’s train and test data sets.
Key Parameters
Adding 308 177 153 77 85 Adam (1e-3), (0.5) NLSTM (0.01), (0.1)
Seq. MNIST 384 222 192 108 97 RMSprop (1e-3), (1.0) NLSTM (0.25), LSTM (1e-4)
Copy Memory 1800 1050 900 448 500 RMSprop (5e-4), (1.0) NLSTM (1e-4), (0.25)
Word-level PTB 125 119 117 100 109 SGD (30), (0.35) -
Char-level PTB 2900 1680 1050 920 1000 Adam (1e-3), (0.15) -
Table 2: Parameter settings for experiments in Section 4. The integers beneath each architecture type are the hidden state sizes. stands for learning rate.

stands for gradient clipping

5 Visualization

To understand the internal behavior of MCRM, we performed a visual analysis of the memory cell. Following a similar approach as the work of [24], specific neurons of interest are monitored versus an input sequence. The work of [24] is also expanded by introducing a method of selecting these neurons. This method consists of a heat map of the propagation of all neurons activation values in the memory in contrast to an input sequence (shown in figure 7.). MCRM, NLSTM, LSTM and GRU are trained over the PTB character dataset, fixing the memory cell size to be around 150 neurons.

In figure 7 the heat maps columns represents a step in this visual analysis. LSTM cell states tend to have neurons changing slowly over the sequence. This is in contrast to GRU architectures where-in neuron activation values change rapidly. NLSTM outer cell acts in a short-term fashion remembering small sequences. The inner cell of the NLSTM acts in a long-term fashion to support longer sequences. MCRM memory cell has some neurons acting in an explicit long-term fashion and some acting in a short-term explicit fashion. This means by nature MCRM inherits both LSTM and GRU behaviors in a one memory cell. This leads to a better neurons utilization which is an important advantage of MCRM. The neurons of interest column in Figure 7 shows specific neurons extracted from the heat maps column that have long or short-term behaviors support the analysis of the heat maps.

Figure 7: Heat Maps column is the heat maps of the propagation of all memory states neurons versus sequence of character inputs. The vertical axis represents neurons activation values versus the horizontal axis. The horizontal axis represents the characters input to the cell. Each row is the propagation of a specific neuron activation values.The propagation is from left to right. The neurons of interest columns is a visualization of specific neurons of interest activation values versus an input sequence of characters. Red denotes a negative cell state value, and blue a positive one. A darker shade denotes a larger magnitude. The memory state of GRU is its hidden cell state . The memory state of LSTM is its cell state . The memory state of NLSTM are its outer and inner cell states respectively. The memory state of MCRM cell is its cell state .

6 Conclusion

Mother Compact Recurrent Memory (MCRMs) are a Nested LSTM-GRU architecture. They create a unique compact memory pattern that supports both long and short-terms behaviors. MCRMs can outperform other RNNs architectures, on benchmark tests. Because of their promising results, MCRMs could be used in temporal sequence modeling tasks.


This work was partially supported by the National Science Foundation under grant 1739964: CPS: Medium: Augmented reality for control of reservation-based intersections with mixed autonomous-non autonomous flows.


  • [1] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, et al. (2017) Deep voice: real-time neural text-to-speech. arXiv preprint arXiv:1702.07825. Cited by: §1.
  • [2] M. Arjovsky, A. Shah, and Y. Bengio (2015) Unitary evolution recurrent neural networks. CoRR abs/1511.06464. External Links: Link, 1511.06464 Cited by: §4.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
  • [4] S. Bai, J. Z. Kolter, and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1, §4.
  • [5] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078. External Links: Link, 1406.1078 Cited by: §1, §2, §3.
  • [6] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555. External Links: Link, 1412.3555 Cited by: §2.
  • [7] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2015) Gated feedback recurrent neural networks. In

    International Conference on Machine Learning

    pp. 2067–2075. Cited by: §2.
  • [8] R. Dey and F. M. Salem (2017)

    Gate-variants of gated recurrent unit (gru) neural networks

    arXiv preprint arXiv. Cited by: §2.
  • [9] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2625–2634. Cited by: §2.
  • [10] J. L. Elman (1990) Finding structure in time. Cognitive science 14 (2), pp. 179–211. Cited by: §2.
  • [11] Y. Fan, Y. Qian, F. Xie, and F. K. Soong (2014) TTS synthesis with bidirectional lstm based recurrent neural networks. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • [12] F. A. Gers, J. Schmidhuber, and F. Cummins (1999) Learning to forget: continual prediction with lstm. Cited by: §2.
  • [13] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: Cited by: §4.
  • [14] A. Graves, A. Mohamed, and G. E. Hinton (2013) Speech recognition with deep recurrent neural networks. CoRR abs/1303.5778. External Links: Link, 1303.5778 Cited by: §2.
  • [15] A. Graves (2012) Supervised sequence labelling. In Supervised sequence labelling with recurrent neural networks, pp. 5–13. Cited by: §2.
  • [16] A. Graves (2013) Generating sequences with recurrent neural networks. CoRR abs/1308.0850. External Links: Link, 1308.0850 Cited by: §4.
  • [17] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber (2017) LSTM: a search space odyssey. IEEE transactions on neural networks and learning systems 28 (10), pp. 2222–2232. Cited by: §2.
  • [18] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §2, §3, §4.
  • [19] L. Jing, Y. Shen, T. Dubcek, J. Peurifoy, S. A. Skirlo, M. Tegmark, and M. Soljacic (2016) Tunable efficient unitary neural networks (EUNN) and their application to RNN. CoRR abs/1612.05231. External Links: Link, 1612.05231 Cited by: §4.
  • [20] M. I. Jordan (1997) Serial order: a parallel distributed processing approach. In Advances in psychology, Vol. 121, pp. 471–495. Cited by: §2.
  • [21] N. Kalchbrenner, I. Danihelka, and A. Graves (2015) Grid long short-term memory. CoRR abs/1507.01526. External Links: Link, 1507.01526 Cited by: §2.
  • [22] K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu, and X. Wang (2017) Object detection in videos with tubelet proposal networks. In Proc. CVPR, Vol. 2, pp. 7. Cited by: §1.
  • [23] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §1.
  • [24] A. Karpathy, J. Johnson, and F. Li (2015) Visualizing and understanding recurrent networks. CoRR abs/1506.02078. External Links: Link, 1506.02078 Cited by: §5.
  • [25] Cited by: §1.
  • [26] J. Koutník, K. Greff, F. J. Gomez, and J. Schmidhuber (2014) A clockwork RNN. CoRR abs/1402.3511. External Links: Link, 1402.3511 Cited by: §2.
  • [27] B. Krause, L. Lu, I. Murray, and S. Renals (2016) Multiplicative LSTM for sequence modelling. CoRR abs/1609.07959. External Links: Link, 1609.07959 Cited by: §2.
  • [28] Q. V. Le, N. Jaitly, and G. E. Hinton (2015)

    A simple way to initialize recurrent networks of rectified linear units

    arXiv preprint arXiv:1504.00941. Cited by: §4.
  • [29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.
  • [30] Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2. Cited by: §4.
  • [31] Cited by: §1.
  • [32] Z. C. Lipton (2015) A critical review of recurrent neural networks for sequence learning. CoRR abs/1506.00019. External Links: Link, 1506.00019 Cited by: §2.
  • [33] Y. Lu, C. Lu, and C. Tang (2017) Online video object detection using association lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2344–2352. Cited by: §1.
  • [34] M. Luong and C. D. Manning (2015) Stanford neural machine translation systems for spoken language domains. In Proceedings of the International Workshop on Spoken Language Translation, pp. 76–79. Cited by: §1.
  • [35] M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §1.
  • [36] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini (1993) Building a large annotated corpus of english: the penn treebank. Computational linguistics 19 (2), pp. 313–330. Cited by: §4.
  • [37] J. R. A. Moniz and D. Krueger (2018) Nested lstms. CoRR abs/1801.10308. External Links: Link, 1801.10308 Cited by: §1, §2, §3.
  • [38] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1881–1889. Cited by: §1.
  • [39] K. Rao, F. Peng, H. Sak, and F. Beaufays (2015) Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4225–4229. Cited by: §1.
  • [40] Y. Shi, Y. Tian, Y. Wang, W. Zeng, and T. Huang (2017) Learning long-term dependencies for action recognition with a biologically-inspired deep network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–725. Cited by: §2.
  • [41] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. CoRR abs/1409.3215. External Links: Link, 1409.3215 Cited by: §2.
  • [42] S. Tripathi, Z. C. Lipton, S. Belongie, and T. Nguyen (2016) Context matters: refining object detection in video with recurrent neural networks. arXiv preprint arXiv:1607.04648. Cited by: §1.
  • [43] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 3156–3164. Cited by: §1.
  • [44] P. Wadler (1989) Theorems for free!. In Proceedings of the Fourth International Conference on Functional Programming Languages and Computer Architecture, FPCA ’89, New York, NY, USA, pp. 347–359. External Links: ISBN 0-89791-328-0, Link, Document Cited by: §3.
  • [45] J. Wang, L. Zhang, Q. Guo, and Z. Yi (2017) Recurrent neural networks with auxiliary memory units. IEEE transactions on neural networks and learning systems. Cited by: §2.
  • [46] C. Wolf (2017) Recurrent neural networks for object detection and motion recognition. Cited by: §1.
  • [47] Cited by: §1.
  • [48] Cited by: §1.
  • [49] K. Yao, T. Cohn, K. Vylomova, K. Duh, and C. Dyer (2015) Depth-gated recurrent neural networks. arXiv preprint. Cited by: §2.
  • [50] Y. Yuan, X. Liang, X. Wang, D. Y. Yeung, and A. Gupta (2017) Temporal dynamic graph lstm for action-driven video object detection. arXiv preprint arXiv:1708.00666. Cited by: §1.
  • [51] S. Zhang, Y. Wu, T. Che, Z. Lin, R. Memisevic, R. Salakhutdinov, and Y. Bengio (2016) Architectural complexity measures of recurrent neural networks. CoRR abs/1602.08210. External Links: Link, 1602.08210 Cited by: §4.
  • [52] S. Zhang, Y. Wu, T. Che, Z. Lin, R. Memisevic, R. R. Salakhutdinov, and Y. Bengio (2016) Architectural complexity measures of recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1822–1830. Cited by: §4.