1 Introduction
Recent years have witnessed the great success of deep learning models. Owing to the increasing computation resources and strong model capacity, neural network models have been applied in numerous applications. Among all neural network models, Recurrent Neural Networks (RNNs)
williams1986learning have shown notable potential on sequence modeling tasks, e.g. speech recognition pmlrv48amodei16 and machine translation cho2014learning sutskever2014sequence, and therefore receive particular attention. With increasing explorations on RNN, several variants, such as Long ShortTerm Memory (LSTM)
hochreiter1997long, Gated Recurrent Unit (GRU)
chung2015gated have been proposed successively.The key advantages of RNN come from the recurrent structures, which carry out the same transition at all time steps, and eventually contribute to a satisfactory performance. Yet this merit may validate under the assumption that all sequences follow the same pattern. The conventional RNN may be inappropriate when processing sequences with multiple patterns. As mentioned in goodfellow2016deep ijcai2017205 , it is difficult to optimize the network when using the same parameters at all time steps under multiple pattern scenarios. To this end, more adaptive RNN networks are required.
Recently, some extended mechanisms on RNN are proposed to augment model adaptability. The first one is the attention mechanism bahdanau2014neural , which is a popular technique in machine translation. The attention mechanism suggests aligning data to capture different patterns when translating different words. Instead of aligning to different parts in the current input, our PRNN aligns the current sample to the similar samples in historical data. Another attractive mechanism is the memory mechanism weston2014memory
. The basic idea of the memory mechanism is to setup an external memory for each sequence. However, most memory based approaches build the “temporal” memory and the memory is reset when a new sequence arrives. Thus this “temporal” memory probably cannot capture the principle patterns in training sequences. For instance, when people read documents, their comprehension is based on not only context in the current document, but also knowledge accumulated from previous reading and life experiences. Therefore, besides the “temporal” memory, a persistent memory is immensely needed to capture all historical principle patterns.
In this paper, an adaptive persistent memory (pmemory) augmented RNN (called PRNN) is proposed. Different from the external memories in existing works, our pmemory holds principle patterns along training and testing phases. The memory is accessed by contentbased addressing and it is updated through gradient descent. Each slot in pmemory (the column of memory matrix) denotes one principle pattern. The relationship between the memory accessing and the mixture model will be illustrated. By introducing the pmemory, our PRNN presents a stronger capacity in processing sequences with multiple patterns than the conventional RNNs. Moreover, the PRNN model can be flexibly extended when the prior knowledge of data is provided.
The contributions of our work are summarized as follows:

We propose a novel persistent memory and construct PRNN to adaptively process sequences with multiple patterns.

We derive contentbased addressing for memory accessing, and interpret memory accessing from the mixture model perspective.

We show that our method can be easily extended by combining the prior knowledge of data.

We evaluate the proposed PRNN on extensive experiments, including time series prediction task and language modeling task. The experimental results demonstrate significant advantages of our PRNN.
The remainder of the paper is organized as follows. Section 2 reviews related work. In Section 3, the persistent memory is introduced in detail, as well as its extension on LSTM and combination with prior knowledge. Experimental evaluations are included in Section 4, and final conclusion with looking forward comes in Section 5.
2 Related Work
The research on RNN can be traced back to 1990’s williams1986learning . In the past decades, a number of variants of RNN model appear, in which the most popular one is Long ShortTerm Memory (LSTM) hochreiter1997long zaremba2014recurrent pmlrv48amodei16 . LSTM networks introduce memory cells and utilize a gating mechanism to control information flow. Another classic RNN variant, Gated Recurrent Unit (GRU), simplifies LSTM with a single update gate, which controls the forgetting factor and updating factor simultaneously chung2015gated .
Recently, two advanced mechanisms, attention and memory, appear to extend the RNNs. The attention mechanism is particularly useful in machine translation cho2014learning sutskever2014sequence , which requires extra data alignment, and the similarity between the encoded source sentence and the output word is calculated. A comprehensive study on attention mechanism can be found in NIPS2017_7181
. The attention model is often used in specific scenarios (such as Seq2Seq), however, our PRNN focuses on solving the problem of modeling multiple patterns and it is a universal block for RNNs.
In terms of the memory mechanism, its basic idea is to borrow an external memory for each sequence weston2014memory . The MemoryAugmented Neural Network (MANN) aims to memorize the recent and useful samples, and then compare them with the current one in the input sequence. The MANN is often used in meta learning and oneshot learning tasks graves2014neural santoro2016meta . Although named after “memory”, our approach differs from previous memory approaches in that a “persistent” memory is proposed. The persistent memory works along training and testing phases, and is used to store the principle patterns in training sequences.
3 PRNN Model
3.1 Persistent Memory
In conventional RNNs, the hidden states are updated with a unique cell at all time steps, which can be expressed as:
(1) 
In this paper, an external persistent memory (with dimension , called pmemory) is employed to memorize the principal patterns in training sequences, and the new persistent memory augmented RNN (PRNN) can flexibly process sequences with multiple patterns. The structure for PRNN is shown in Figure 1, and the hidden states in PRNN are updated by:
(2) 
where denotes the memory accessing via contentbased addressing. Note that when the pmemory is introduced, the function remains similar, which means the RNN cell need not change its inner structure. In this manner, our persistent memory can equip any RNN cell, and thus is generally applicable.
3.1.1 Memory accessing
The memory matrix contains different slots, and each slot is utilized to represent certain pattern in training sequences. Given an input sequence , the hidden state is able to represent the subsequence . Following the contentbased addressing, the similarity between and each slot of pmemory , denoting as , can be easily calculated by retrieving each column of the memory matrix. This similarity
is used to produce a weight vector
, with elements computed according to a softmax:(3) 
The weight vector is the strength to amplify or to attenuate the slots in pmemory. Consequently, the memory accessing is formulated as:
(4) 
In terms of the similarity measure, two alternative approaches are suggested in this section. The first one is derived from the Mahalanobis distance and defined as:
(5) 
where is the projection matrix from pmemory to hidden states, and
denotes the precision matrix. The other measure is the cosine similarity, which is widely applied in related works
graves2014neural santoro2016meta considering its robustness and computational efficiency:(6) 
3.1.2 Memory updating
The memory in graves2014neural santoro2016meta acts as “temporal memory” and is explicitly updated in both training and testing phases. In this paper, the pmemory is updated in an entirely implicit manner. The updating mechanism in PRNN is completely based on gradient descent, regardless of the read weights, write weights as well as usage weights utilized in previous works. The memory matrix is updated straightforwardly during the training phase, leading to a simple training procedure in memory augmented networks.
Specifically, the procedure for updating the pmemory is based on the gradient of loss function
on the th slot of pmemory, which is given by^{1}^{1}1The detailed derivation can be found in the supplementary material.:(7) 
where
(8) 
When the similarity is calculated by Eq (5), then:
(9) 
and when the similarity measure follows the cosine similarity, then:
(10) 
3.1.3 Mixture model perspective
As the pmemory has slots, the memory matrix intends to partition all hidden states into clusters (). Given the hidden state , the probability that belongs to the th cluster is:
(11) 
Assume the hidden states are Gaussian variables, then
(12) 
where is the dimension of , and are the mean and covariance matrix of component respectively.
For uniformly distributed prior, i.e.
, assume all clusters share the same covariance matrix () and let , we have:(13) 
which is consistent with Eq (3) when the similarity measure follows Eq (5). Thus, the process of accessing pmemory is indeed soft clustering, and is assigned to the corresponding center:
(14) 
Let be the MoorePenrose inverse of , we have:
(15) 
and thus the memory accessing in Eq (4) can be interpreted from the mixture model perspective.
As a matter of fact, the pmemory accessing and updating can be seen as a variant of the EM algorithm. For given hidden state and current memory matrix , Estep assigns “responsibility” to each cluster via memory addressing. The Mstep optimizes the parameters in each memory slot based on
. Rather than explicit parameter updating mechanism, which appears in conventional EM procedure for Gaussian mixture model, the pmemory updating is implicit and derived from gradient descent.
Instead of Eq (5), when the cosine similarity is selected, and follows Von MisesFisher distribution (
) with probability density function:
(16) 
where and is the normalization constant, we also have , and the memory accessing can be represented from the mixture model perspective as well.
3.2 LSTM with Persistent Memory
As a general external memory mechanism, the persistent memory is able to equip almost all RNN models. In this section, we take LSTM as an example and illustrate how the pmemory works in practice. Since the persistent memory is added to all gates and cells, the forget gate and the input gate are revised as:
(17) 
(18) 
Then the memory cell can be updated adaptively:
(19) 
(20) 
Consequently, the output gate and cell become:
(21) 
(22) 
3.3 Persistent Memory with Prior Knowledge
In many practical situations, some prior or domain knowledge, either about sample distribution or about data pattern, is known beforehand. As a matter of fact, the prior knowledge can provide much information and help to build more efficient persistent memories, which substantially benefits the training process.
In this subsection, we select an example in text modeling for illustration. The category information (e.g. sports, science) is often provided in advance in text modeling. To leverage prior knowledge, we set up multiple persistent memories for RNN. Assume buckets (categories) exist in the training data, an independent pmemory () is allocated for each bucket. Therefore, the memory accessing is extended to be:
(23) 
4 Experiments
The performance of the PRNN is evaluated on two different tasks: time series prediction task and language modeling task. Experiments are implemented in Tensorflow
abadi2016tensorflow . Two different measures for calculating similarity are suggested in Section 3.1, and considering the popularity of cosine similarity, we select this measure in all experiments for consistency. It is claimed that for any baseline model (e.g. RNN or LSTM), we name the persistent memory augmented model with prefix P (e.g. PRNN or PLSTM), and name the prior knowledge based memory model with prefix PP (e.g. PPRNN or PPLSTM).When tuning the hyperparameters (e.g. the dimension and the slot number in pmemory), an independent validation set is randomly drawn from the training set (except for PTB, where an extra validation set is provided), and the model is trained on the remaining samples. When the hyperparameters are determined, the model is trained again on entire training set. All experiments are run several times and the average results are reported.
4.1 Time Series Prediction
We conduct time series prediction experiments on two datasets: Power Consumption (PC) and Sales Forecast (SF).
4.1.1 Dataset

Power Consumption (PC) Lichman:2013 . This dataset contains measurements of electric power consumption in one household over a period of 198 weeks^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption. The globalactivepower is aggregated into hourly averaged time series, and the prediction target is the globalactivepower at every hour on the next day. The entire dataset is divided into two parts: training part (dates range [20070408, 20101119]) and testing part (dates range [20101120, 20101126]).

Sales Forecast (SF). This dataset is collected from one of the largest Ecommerce platforms in the world, and contains four features (i.e. browse times, customer number, price and sales) of 1 million items from 1500 categories over 13 weeks. The target is to predict the total sales in the next week for each item. As the category information is provided, it can be utilized as the prior knowledge. The training set includes over 3 million samples, and the size of testing set is about 2 thousand (testing sample list is provided by large merchants).
4.1.2 Setup
We select the LSTM as the baseline model. The LSTM is first compared with the classical ARIMA model hamilton1994time . Then the pmemory is added to the LSTM to construct our approach. The hyperparameters are tuned on two datasets as follows:

Power Consumption (PC). For each hour, the input sequence contains the power consumption on the last 56 days. At every step, the sequence considers values at three adjacent hours centering at the target hour. The goal is to predict globalactivepower at every hour on the next day. The LSTM has a single layer with dimension of hidden units equals to 32. In terms of our approach, the pmemory augmented LSTM (PLSTM) is constructed by adding pmemory (with dimension
) into LSTM. The models are trained for 30 epochs.

Sales Forecast (SF). For each item, the input is a sequence of four features on recent 56 days, and the target is the total sales in the next week. The LSTM also has a single layer, in which the dimension of hidden units is 64. Similarly, a PLSTM model with a pmemory is constructed. In addition, since we have the category information as the prior knowledge, we construct the prior knowledge based memory model PLSTM (PPLSTM), and assign a pmemory to each category. The models are trained for 15 epochs.
All trainable parameters are initialized randomly from a uniform distribution within [0.05, 0.05]. The parameters are updated through back propagation with Adam rule kinga2015method and the learning rate is 0.001.
4.1.3 Results
The results are measured in Relative Mean Absolute Error (RMAE), and it is written as:
(24) 
where is the number of testing samples, is the true value and is the prediction result. All RMAE results are shown in Table 2.
From Table 2, it is evident to see that our PLSTM model significantly outperforms LSTM and ARIMA, on both PC and SF datasets. Therefore, the advantages of our pmemory is fully demonstrated. Particularly, on the SF dataset, by utilizing the prior knowledge, the PPLSTM could further enhance the prediction accuracy, which indicates the strong adaptability of our approach as well.
4.2 Language Modeling
We conduct wordlevel prediction experiments on two datasets: Penn Treebank (PTB) and 20Newsgroup (20NG).
4.2.1 Dataset

Penn Treebank (PTB) marcus1993building . The PTB is a popular dataset in language modeling. In this experiment, we select the version provided by mikolov2010recurrent ^{3}^{3}3http://www.fit.vutbr.cz/~imikolov/rnnlm/simpleexamples.tgz. The dataset consists of 1 million words and has 10 thousand words in its vocabulary.

20Newsgroup (20NG) lang1995newsweeder . This dataset is originally a benchmark for text categorization, where 20 thousand news documents are evenly categorized into 20 groups. In wordlevel prediction, the group information is considered as the prior knowledge. The preprocessed data can be found in 2007:phdAnaCardosoCachopo ^{4}^{4}4http://ana.cachopo.org/datasetsforsinglelabeltextcategorization, and it consists of 5 million words and 74 thousand words are included in its vocabulary.
Model  RMAE (%)  
PC  ARIMA  40.2 
LSTM  35.4  
PLSTM  34.4  
SF  ARIMA  84.2 
LSTM  56.8  
PLSTM  52.5  
PPLSTM  42.9 
Simple  Perp.  Complex  Perp.  
PTB  LSTM  148.6  LARGE  78.4 
PLSTM  146.2  PLARGE  78.2  
20NG  LSTM  178.9  LARGE  109.1 
PLSTM  168.2  PLARGE  108.4  
PPLSTM  157.1  PPLARGE  105.4 
4.2.2 Setup
There are two different settings for the experiments: simple setting and complex setting. The purpose of the simple setting is to demonstrate the benefit of pmemory based on a simple LSTM model. The complex setting extends the advantages of our approach to more complicated models.

Simple setting. In this setting, each word is embedded into a 32dimensional representation. The size of hidden units is 128 in the baseline LSTM model. The dimension of the pmemory in PLSTM is . On the 20NG dataset, since the group information is used as prior knowledge, we construct PPLSTM and allocate a pmemory for each group. Other settings, such as the parameter initialization, optimization method and learning rate, remain the same with those in time series prediction tasks. The model is trained for 20 epochs.

Complex setting. The “large” network architecture in zaremba2014recurrent is one of the stateofart models and provides a strong baseline for language modeling tasks, and there is an open source implementation^{5}^{5}5https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py
. It contains many extensions, including multiple layers stacking, dropout, gradient clipping, learning rate decay and so on. In this setting, we select the “large” network (named by LARGE) as the baseline model. In terms of our approach, a pmemory
is added into LARGE to construct the persistent memory augmented LARGE (PLARGE) model. Similarly, the prior knowledge based PLARGE (PPLARGE) is established by introducing a particular pmemory for each group on the 20NG dataset.
With more sophisticated model zilly2016recurrent merity2016pointer or ensemble of multiple models, lower perplexities on the wordlevel prediction tasks can be achieved. We here simply focus on assessing the impact of pmemory when added to an existing architecture, rather than absolute stateoftheart performance.
4.2.3 Results
All results are measured in perplexity, which is a popular metric to evaluate language models katz1987estimation . The results are summarized in Table 2.
In the simple setting, the benefit from the pmemory is significant on both the PTB and 20NG datasets, especially on the 20NG dataset. One possible explanation is that the content in 20NG is semantically richer, and our model could easily distinguish various semantic patterns in the dataset. Similarly, with the assist from prior knowledge, the efficiency of pmemory is further improved.
The advantages of our approach remain in the complex setting. According to the results in Table 2, we can see that the LARGE model has already achieved a high accuracy for both PTB and 20NG datasets, and our approach is able to further enhance the model performance. Thus, the general superiority of our approach is proved under both simple setting and complex setting. In addition, it is also noted that another advantage of our approach appears in model convergence. By introducing the pmemory, the PLARGE model is able to provide a better convergence rate. Taking the PTB dataset as an example, during the training procedure, wordlevel perplexity on the validation set is shown in Figure 3. We can clearly see that the PLARGE model requires less epochs before it converges. Hence, beyond the superiority on high accuracy, the advantage on convergence performance also makes our approach appealing.
As mentioned before, the memory accessing can be interpreted from the mixture model perspective, and the pmemory stores the center of each cluster (or the principle pattern) in each slot. To gain more insights about memory matrix, we take the “Simple setting” on 20NG dataset as an example, and plot the change of average Euclidean distance among all centers during the training procedure in Figure 3. The green solid line denotes the wordlevel perplexity on testing set and the red dotted line is the average Euclidean distance. As we can see, all slots are close after initialization at the beginning of training, and they gradually become isolated as the model converges, which means each cluster center represents a distinct pattern in training data.
5 Conclusion
Conventional RNN is limited to adaptively process sequences with multiple patterns. In this paper, a novel PRNN approach is proposed and an external pmemory is introduced to memorize the principal patterns in training sequences. Through contentbased accessing, the PRNN applies adaptive transition at each time step. The pmemory is updated by gradient descent, and the entire memory accessing can be interpreted as a mixture model. Moreover, the proposed approach can easily combine the prior knowledge of data. Experiments on time series task and language modeling task demonstrate the superiority and effectiveness of our PRNN method.
The proposed pmemory is a universal block, and we look forward to applying it to other types of neural networks, such as feedforward networks and convolutional neural networks. Another interesting topic for further studies is the memory updating mechanism. The gradient descent based updating mechanism works very well in practice, and it simplifies the endtoend training. However, it is still necessary to explore more appropriate updating mechanisms, and compare them with existing mechanisms comprehensively.
Appendix A Supplementary Material
a.1 Derivation of Updating Persistent Memory
According to the chain rule of calculus, we have:
(25) 
is the gradient of final loss function on the pmemory accessing and can be calculated by:
(26)  
The gradient of strength on the th pmemory slot is:
(27)  
where
(28) 
So we only need to care about . When the similarity is calculated by Eq (5):
(29)  
and when the similarity is calculated by Eq (6):
(30)  
Then we get:
(31) 
where
(32) 
or
(33) 
Finally, we get the gradient of final loss function on the th pmemory slot by substituting Eq (26) and Eq (31) into Eq (25):
(34)  
where
(35) 
References
 [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [3] A. CardosoCachopo. Improving methods for singlelabel text categorization. PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, 2007.
 [4] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, et al. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

[5]
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio.
Gated feedback recurrent neural networks.
In
Proceedings of the International Conference on Machine Learning
, pages 2067–2075, 2015.  [6] A. Dario, A. Sundaram, A. Rishita, et al. Deep speech 2 : Endtoend speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, pages 173–182, 2016.
 [7] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
 [8] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 [9] Hamilton and D. James. Time series analysis, volume 2. Princeton University Press, 1994.
 [10] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 [11] S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3):400–401, 1987.
 [12] D. Kinga and J. B. Adam. A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 [13] K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the International Conference on Machine Learning, pages 331–339, 1995.
 [14] M. Lichman. UCI machine learning repository, 2013.
 [15] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313–330, 1993.
 [16] S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 [17] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, page 3, 2010.
 [18] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Metalearning with memoryaugmented neural networks. In International Conference on Machine Learning, pages 1842–1850, 2016.
 [19] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
 [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pages 6000–6010. 2017.
 [21] J. Weston, S. Chopra, and A. Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
 [22] D. Williams and G. Hinton. Learning representations by backpropagating errors. Nature, 323(6088):533–538, 1986.
 [23] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.

[24]
C. Zhu, W. Linlin, and d. M. Gerard.
Multipleweight recurrent neural networks.
In
Proceedings of the International Joint Conference on Artificial Intelligence
, pages 1483–1489, 2017.  [25] J. G. Zilly, R. K. Srivastava, J. Koutník, and J. Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.
Comments
There are no comments yet.