MCRM: Mother Compact Recurrent Memory
LSTMs and GRUs are the most common recurrent neural network architectures used to solve temporal sequence problems. The two architectures have differing data flows dealing with a common component called the cell state (also referred to as the memory). We attempt to enhance the memory by presenting a modification that we call the Mother Compact Recurrent Memory (MCRM). MCRMs are a type of a nested LSTM-GRU architecture where the cell state is the GRU hidden state. The concatenation of the forget gate and input gate interactions from the LSTM are considered an input to the GRU cell. Because MCRMs has this type of nesting, MCRMs have a compact memory pattern consisting of neurons that acts explicitly in both long-term and short-term fashions. For some specific tasks, empirical results show that MCRMs outperform previously used architectures.READ FULL TEXT VIEW PDF
LSTMs and GRUs are the most common recurrent neural network architecture...
In this short note, we present an extension of long short-term memory (L...
We propose Nested LSTMs (NLSTM), a novel RNN architecture with multiple
Recently, recurrent neural networks (RNNs) as powerful sequence models h...
Long Short-Term Memory (LSTM) infers the long term dependency through a ...
In this work we present a modification in the conventional flow of
In recent years, memory-augmented neural networks(MANNs) have shown prom...
MCRM: Mother Compact Recurrent Memory
Recurrent neural networks (RNNs) are a class of neural networks that can relate temporal information. They have been widely used for a lot of problems including image caption generation [23, 43, 38, 31], text to speech generation [11, 39, 1], object detection and tracking [33, 46, 22, 50, 42, 25]3, 35, 34, 48, 47]
. RNNs are generally categorized into three well-known architectures, Vanilla RNNs, Long short-term memory (LSTM) proposed by and Gated recurrent neural networks (GRU) proposed by . The vanishing and exploding gradients are well-known Vanilla RNNs problems, which GRUs and LSTMs solves. Vanilla RNNs also lack the ability to remember long-term sequences, unlike GRUs and LSTMs. The main difference between LSTMs and GRUs are in the terms of architecture. LSTMs have more control gates than GRUs do. LSTM output is a part of the cell state content unlike GRUs which its cell state is the output. Another difference is that GRUs are clearly computationally inexpensive, unlike LSTMs. Still there is no clear evidence that GRUs is better than LSTMs or not, see for instance the work of .
The core idea of LSTMs and GRUs is to control information flow to the cell state, which can be described as the memory, through control gates. Yet the cell state is very simple neural network layer. The core idea of the article is to investigate if the the performance of LSTMs or GRUs can be enhanced by developing a better cell state architecture. We noticed that Nested LSTMs (NLSTM) introduced by  outperforms previous RNN architectures, nonetheless the inner LSTM isn’t fully utilized as the cell state is exposed to the outer NLSTM via the output gate. Thus we chose to create a new deep recurrent model that has a GRU unit nested within a LSTM. The GRU is chosen to be inside the LSTM as it fully exposes the hidden state. The GRU represents the cell state of the LSTM. We call this architecture Mother Compact Recurrent Memory (MCRM). The Mother term came from our visualization of the LSTM as a mother that carries the GRU as a fetus. The compact term came from the compact memory pattern that is produced by MCRM which inherits both GRU and LSTM memory behaviors.
The MCRMs are positioned as follow:
A novel class of nested RNNs.
A compact memory pattern that support both long and short terms behaviors.
The model is validated using empirical test problems.
The rest of this article is organized as follows. Section 2 reviews the history of RNNs, LSTMs and GRUs and their development highlighting similar approaches. Section 3 discusses the MCRM model in details, and provides its mathematical model. An experimental validation of MCRM against different recurrent architectures on well-known benchmark recurrent tasks is shown in 4. Section 5 shows a visualization of MCRM hidden state and compares these with other RNNs cell states. This demonstrates the compact memory pattern outlined earlier. The MCRM source code is available at: https://github.com/abduallahmohamed/MCRM.
One of the earliest work in the history of RNNs was by , this work represented an early concept of state in neural networks. It described a recurrent connection with an in-unit loop. It successfully integrated a time series data into a neural network.
A simpler RNN architecture was proposed by  which can be called the Vanilla RNN. It simplified the concept of  to use a context unit or what can be called a hidden unit removing the in-unit loop that that was previously introduced. The work done by  provides more details about the history of RNNs and its development until it was formalized into the LSTM architecture. Different attempts to improve RNNs itself have been made, including the introduction of an auxiliary memory to enhance its performance .  introduced a gated feedback RNN architecture which stacks multiple RNNs to pass/not pass and control signals flowing from upper layers to the lower layers. Another modification for RNNs is the Clockwork RNN  which introduced a method that makes RNNs work for long-term sequences requirements. It was shown to outperform LSTMs and RNNs in some specific tasks.
LSTMs have been originally developed by . The main motivation of LSTMs is to skive the problem of vanishing gradients in vanilla RNNs and to remember longer sequences. The hidden layer of a RNN was replaced by a memory unit or what called a memory cell. The LSTM has specific function gates to control the flow of data and its storage within the memory cell. One extra gate added to the LSTM called the forget gate it has been introduced by 
to give LSTM the ability to forget specific information from the memory cell. From this point, multiple developments has been done to improve the performance of LSTMs. One of them is to replace the feed forward units with a convolutional neural networks (CNNs) introduced by to improve its ability for visual sequence problems . Other approaches involved stacking LSTMs [14, 41, 15] Or by introducing a depth gate between stacked LSTMs . Also, some proposed a hyper-architecture between RNNs and LSTMs such as the work of . Nesting the LSTM within another LSTM, resulting in a nested LSTM (NLSTM) is the focus of , which is used as a reference in this article.  organized LSTM in the form of a multidimensional grid. An extensive work by 
explored different variations of LSTM by introducing six variants to the architecture. It concluded that the current LSTM is indeed performing well compared to these variants. Also it found that the forget gate and output gate activation functions are very critical to the LSTM performance.
The GRUs architecture, which is on the par with LSTMs in terms of performance, was introduced by introduced by . GRUs requires less memory and computationally less expensive than LSTM. GRUs in some cases may outperform LSTMs, as shown in the comparative study by . Also, GRUs fully expose the hidden state content unlike LSTMs. For the development of GRUs, an approach in the work done by  introduced the shuttleNet concept. The shuttleNet uses multiple GRUs treated as processors. These processors are loops connected to mimic the human’s brain feedback and feed-forward connections. A study on three different variants of GRUs was done by  concluded that the current GRUs have a similar performance as these three variants.
In this section, we first define the notations used in this article. We then recall the LSTM and GRU models, and use them to derive the MCRM model.
The following mathematical notations will be used. stands for current time step, is the Hadamard product, is the sigmoid activation function and is the tanh activation function. is the concatenation symbol defined in . The input to any architecture is . Weights which interacts with the input are . Weights which interacts with the hidden state are . are the biases. represent the different gates in both LSTM and GRU equations (1) ,(2) respectively.
LSTMs address the vanishing gradient problem commonly found in RNNs by controlling the information flow through specific functions gates
. At each time step, an LSTM maintains a hidden state vectorand a memory state vector responsible for controlling the state updates and generating the outputs. The computation at time step is defined as follows:
is usually refereed as the cell or the memory state where the information are stored. The hidden state is the output or the exposed state of an LSTM. The LSTM operation is as follows: the input gate at time step decides how much information to take from the input into the memory. The forget gate decides how much information to keep from the previous cell state . Both input gate interaction and forget gate interaction are used to compute the new cell state . Then the output gate is used to compute the quantity of information to expose from the cell state to outside world through the hidden state representing the output of LSTM.
GRUs also address the same problems found in RNNs. They have been introduced by . The main difference between GRUs and LSTMs is that GRUs totally expose the hidden state information through . They require less memory and are computationally less expensive unlike LSTMs. The computation at time step is defined as follows:
The GRU operates as follows: the reset gate is used to compute how much information to remove from the previous hidden state . This reset gate interaction is added to the input and saved into an intermediate vessel called the node state . The update gate decides how much information from the previous hidden state should be added to the node state to form the new hidden state . This new hidden state is the output of the GRU cell.
MCRM nests a GRU cell inside an LSTM treating the GRU hidden state as the LSTM cell state .
The input to the GRU unit is which is the concatenation of what the LSTM should forget and what it should remember from the input coming into the MCRM cell, it’s defined in equation (3).
The modified equation of the LSTM cell now become:
And the modified equations of GRU cell now becomes:
Figure 1 illustrates the data flow inside the MCRM, following equations 4 and 5. The closest architecture to MCRMs is the Nested LSTMs (NLSTMs) introduced by , which nests am LSTM inside another LSTM. MCRMs had the following advantages over NLSTMs: first, they are less computationally expensive than NLSTMs as they use GRUs instead of LSTM as the cell state. Secondly, they have a better neurons utilization as the full hidden state is exposed from the GRU to the LSTM unlike NLSTMs where the inner LSTM are not fully utilized because of the usage of the cell state only.
In this section, the MCRM performance is evaluated empirically against different recurrent architectures on well-known tasks. For fairness of comparison, we use the same hyper-parameters as in . Also, our results are consist with the results from . These experiments were executed under a controlled environment, using the same initial random seed and weights initializations. The parameters of the models were kept the same between multiple experiments to check which architecture has a better usage of the neurons. The different configuration parameters are shown in table 2
. Each experiment was executed multiple times with different initial seeds and the reported performance metrics are the mean performance of these multiple executions. We avoided the usage of any drop-out or batch normalization layers to have a fair evaluation of the performance.
|Sequence Modeling Task||Model Size ()||Models|
|Seq. MNIST (accuracy)||152K||19.57||98.58||85.16||91.02||98.79|
|Adding problem (loss)||95K||0.165||3.2e-04||0.001||0.004||4.0e-06|
|Copy memory (loss)||3.3M||0.021||0.013||0.004||7.3e-05||8.5e-06|
|Char-level PTB (bpc)||17.1M||1.683||1.397||1.374||1.365||1.331|
|Word-level PTB (ppl)||1.3M||140.58||110.6||110.64||140.1||120.9|
The adding problem has been used as a stress test for sequence models. It was introduced by . The test is about creating an input with a length sequence of depth 2. The first dimension is randomly chosen between 0 and 1. The second dimension is all zeros except the last two elements marked by 1. The objective is find the sum the last two random elements marked by 1 in the second dimension. We used a sequence of length . The test results are shown in table 1. MCRM outperforms all other models with an error of . Also, the GRU has a close performance of an error of which explains why the MCRM have this performance. The learning curves are shown in figure 2.
This method has been used previously in [51, 2, 19] for measuring the performance of a recurrent architecture in the context of remembering information seen time steps earlier. The input sequence is in the length of . The input sequence defined as: , where is chosen to be a zero digit and . The is randomly chosen from digits . The is set to be digit 9, where is the delimiter. The model is expected to generate an output identical to the input sequence is. A test was conducted with a sequence length of . The results are shown in table 1. The MCRM outperforms all other models with an error of . The learning curves are shown in figure 3.
This test is similar in intent to the copy memory test. In this task the MNIST dataset  images are presented to the model as a input sequence of pixels intensity values. The recurrent model should be able to reconstruct the image again. This test was used as a stress test in several recurrent related problems [28, 52]. From table 1 MCRM achieved accuracy of 98.79% outperforming any other model. Surprisingly, the LSTM and NSLTM had a poor performance unlike the GRU. We relate the success of MCRM in this task to the GRU core inside it. The learning curves are presented in figure 4.
The PennTreebank (PTB)  is a text data set for both character-level and word-level language modeling tasks. It is widely used in many RNN architectures for evaluating the model performance. The PTB is divided into train, test and validation datasets. To measure the character-level task performance the bits per character (bpc) is used as a performance index. BPC has been introduced by and it is defined as the cross-entropy loss  divided by . The performance index for the word-level language modeling task is the perplexity (ppl). The ppl is defined as the exponential of the cross-entropy loss. The two tasks results are reported in table 1. The reported values are from the validation dataset. When the PTB is used as character-level language corpus the MCRM outperforms other models with a bpc of 1.331 exceeding the NLSTM by 0.034 ppl. The learning curves are shown in figure 5. When the PTB used as word-level language corpus MCRM performance is in-between the GRU and NLSTM. The learning curves are shown in figure 6
|Adding||308||177||153||77||85||Adam (1e-3), (0.5)||NLSTM (0.01), (0.1)|
|Seq. MNIST||384||222||192||108||97||RMSprop (1e-3), (1.0)||NLSTM (0.25), LSTM (1e-4)|
|Copy Memory||1800||1050||900||448||500||RMSprop (5e-4), (1.0)||NLSTM (1e-4), (0.25)|
|Word-level PTB||125||119||117||100||109||SGD (30), (0.35)||-|
|Char-level PTB||2900||1680||1050||920||1000||Adam (1e-3), (0.15)||-|
stands for gradient clipping
To understand the internal behavior of MCRM, we performed a visual analysis of the memory cell. Following a similar approach as the work of , specific neurons of interest are monitored versus an input sequence. The work of  is also expanded by introducing a method of selecting these neurons. This method consists of a heat map of the propagation of all neurons activation values in the memory in contrast to an input sequence (shown in figure 7.). MCRM, NLSTM, LSTM and GRU are trained over the PTB character dataset, fixing the memory cell size to be around 150 neurons.
In figure 7 the heat maps columns represents a step in this visual analysis. LSTM cell states tend to have neurons changing slowly over the sequence. This is in contrast to GRU architectures where-in neuron activation values change rapidly. NLSTM outer cell acts in a short-term fashion remembering small sequences. The inner cell of the NLSTM acts in a long-term fashion to support longer sequences. MCRM memory cell has some neurons acting in an explicit long-term fashion and some acting in a short-term explicit fashion. This means by nature MCRM inherits both LSTM and GRU behaviors in a one memory cell. This leads to a better neurons utilization which is an important advantage of MCRM. The neurons of interest column in Figure 7 shows specific neurons extracted from the heat maps column that have long or short-term behaviors support the analysis of the heat maps.
Mother Compact Recurrent Memory (MCRMs) are a Nested LSTM-GRU architecture. They create a unique compact memory pattern that supports both long and short-terms behaviors. MCRMs can outperform other RNNs architectures, on benchmark tests. Because of their promising results, MCRMs could be used in temporal sequence modeling tasks.
This work was partially supported by the National Science Foundation under grant 1739964: CPS: Medium: Augmented reality for control of reservation-based intersections with mixed autonomous-non autonomous flows.
International Conference on Machine Learning, pp. 2067–2075. Cited by: §2.
Gate-variants of gated recurrent unit (gru) neural networks. arXiv preprint arXiv. Cited by: §2.
A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941. Cited by: §4.