Recent years have witnessed the great success of deep learning models. Owing to the increasing computation resources and strong model capacity, neural network models have been applied in numerous applications. Among all neural network models, Recurrent Neural Networks (RNNs)williams1986learning have shown notable potential on sequence modeling tasks, e.g. speech recognition pmlr-v48-amodei16 and machine translation cho2014learning sutskever2014sequence
, and therefore receive particular attention. With increasing explorations on RNN, several variants, such as Long Short-Term Memory (LSTM)hochreiter1997long
, Gated Recurrent Unit (GRU)chung2015gated have been proposed successively.
The key advantages of RNN come from the recurrent structures, which carry out the same transition at all time steps, and eventually contribute to a satisfactory performance. Yet this merit may validate under the assumption that all sequences follow the same pattern. The conventional RNN may be inappropriate when processing sequences with multiple patterns. As mentioned in goodfellow2016deep ijcai2017-205 , it is difficult to optimize the network when using the same parameters at all time steps under multiple pattern scenarios. To this end, more adaptive RNN networks are required.
Recently, some extended mechanisms on RNN are proposed to augment model adaptability. The first one is the attention mechanism bahdanau2014neural , which is a popular technique in machine translation. The attention mechanism suggests aligning data to capture different patterns when translating different words. Instead of aligning to different parts in the current input, our PRNN aligns the current sample to the similar samples in historical data. Another attractive mechanism is the memory mechanism weston2014memory
. The basic idea of the memory mechanism is to setup an external memory for each sequence. However, most memory based approaches build the “temporal” memory and the memory is reset when a new sequence arrives. Thus this “temporal” memory probably cannot capture the principle patterns in training sequences. For instance, when people read documents, their comprehension is based on not only context in the current document, but also knowledge accumulated from previous reading and life experiences. Therefore, besides the “temporal” memory, a persistent memory is immensely needed to capture all historical principle patterns.
In this paper, an adaptive persistent memory (p-memory) augmented RNN (called PRNN) is proposed. Different from the external memories in existing works, our p-memory holds principle patterns along training and testing phases. The memory is accessed by content-based addressing and it is updated through gradient descent. Each slot in p-memory (the column of memory matrix) denotes one principle pattern. The relationship between the memory accessing and the mixture model will be illustrated. By introducing the p-memory, our PRNN presents a stronger capacity in processing sequences with multiple patterns than the conventional RNNs. Moreover, the PRNN model can be flexibly extended when the prior knowledge of data is provided.
The contributions of our work are summarized as follows:
We propose a novel persistent memory and construct PRNN to adaptively process sequences with multiple patterns.
We derive content-based addressing for memory accessing, and interpret memory accessing from the mixture model perspective.
We show that our method can be easily extended by combining the prior knowledge of data.
We evaluate the proposed PRNN on extensive experiments, including time series prediction task and language modeling task. The experimental results demonstrate significant advantages of our PRNN.
The remainder of the paper is organized as follows. Section 2 reviews related work. In Section 3, the persistent memory is introduced in detail, as well as its extension on LSTM and combination with prior knowledge. Experimental evaluations are included in Section 4, and final conclusion with looking forward comes in Section 5.
2 Related Work
The research on RNN can be traced back to 1990’s williams1986learning . In the past decades, a number of variants of RNN model appear, in which the most popular one is Long Short-Term Memory (LSTM) hochreiter1997long zaremba2014recurrent pmlr-v48-amodei16 . LSTM networks introduce memory cells and utilize a gating mechanism to control information flow. Another classic RNN variant, Gated Recurrent Unit (GRU), simplifies LSTM with a single update gate, which controls the forgetting factor and updating factor simultaneously chung2015gated .
Recently, two advanced mechanisms, attention and memory, appear to extend the RNNs. The attention mechanism is particularly useful in machine translation cho2014learning sutskever2014sequence , which requires extra data alignment, and the similarity between the encoded source sentence and the output word is calculated. A comprehensive study on attention mechanism can be found in NIPS2017_7181
. The attention model is often used in specific scenarios (such as Seq2Seq), however, our PRNN focuses on solving the problem of modeling multiple patterns and it is a universal block for RNNs.
In terms of the memory mechanism, its basic idea is to borrow an external memory for each sequence weston2014memory . The Memory-Augmented Neural Network (MANN) aims to memorize the recent and useful samples, and then compare them with the current one in the input sequence. The MANN is often used in meta learning and one-shot learning tasks graves2014neural santoro2016meta . Although named after “memory”, our approach differs from previous memory approaches in that a “persistent” memory is proposed. The persistent memory works along training and testing phases, and is used to store the principle patterns in training sequences.
3 PRNN Model
3.1 Persistent Memory
In conventional RNNs, the hidden states are updated with a unique cell at all time steps, which can be expressed as:
In this paper, an external persistent memory (with dimension , called p-memory) is employed to memorize the principal patterns in training sequences, and the new persistent memory augmented RNN (PRNN) can flexibly process sequences with multiple patterns. The structure for PRNN is shown in Figure 1, and the hidden states in PRNN are updated by:
where denotes the memory accessing via content-based addressing. Note that when the p-memory is introduced, the function remains similar, which means the RNN cell need not change its inner structure. In this manner, our persistent memory can equip any RNN cell, and thus is generally applicable.
3.1.1 Memory accessing
The memory matrix contains different slots, and each slot is utilized to represent certain pattern in training sequences. Given an input sequence , the hidden state is able to represent the subsequence . Following the content-based addressing, the similarity between and each slot of p-memory , denoting as , can be easily calculated by retrieving each column of the memory matrix. This similarity
is used to produce a weight vector, with elements computed according to a softmax:
The weight vector is the strength to amplify or to attenuate the slots in p-memory. Consequently, the memory accessing is formulated as:
In terms of the similarity measure, two alternative approaches are suggested in this section. The first one is derived from the Mahalanobis distance and defined as:
where is the projection matrix from p-memory to hidden states, and
denotes the precision matrix. The other measure is the cosine similarity, which is widely applied in related worksgraves2014neural santoro2016meta considering its robustness and computational efficiency:
3.1.2 Memory updating
The memory in graves2014neural santoro2016meta acts as “temporal memory” and is explicitly updated in both training and testing phases. In this paper, the p-memory is updated in an entirely implicit manner. The updating mechanism in PRNN is completely based on gradient descent, regardless of the read weights, write weights as well as usage weights utilized in previous works. The memory matrix is updated straightforwardly during the training phase, leading to a simple training procedure in memory augmented networks.
Specifically, the procedure for updating the p-memory is based on the gradient of loss functionon the -th slot of p-memory, which is given by111The detailed derivation can be found in the supplementary material.:
When the similarity is calculated by Eq (5), then:
and when the similarity measure follows the cosine similarity, then:
3.1.3 Mixture model perspective
As the p-memory has slots, the memory matrix intends to partition all hidden states into clusters (). Given the hidden state , the probability that belongs to the -th cluster is:
Assume the hidden states are Gaussian variables, then
where is the dimension of , and are the mean and covariance matrix of component respectively.
For uniformly distributed prior, i.e., assume all clusters share the same covariance matrix () and let , we have:
Let be the Moore-Penrose inverse of , we have:
and thus the memory accessing in Eq (4) can be interpreted from the mixture model perspective.
As a matter of fact, the p-memory accessing and updating can be seen as a variant of the EM algorithm. For given hidden state and current memory matrix , E-step assigns “responsibility” to each cluster via memory addressing. The M-step optimizes the parameters in each memory slot based on
. Rather than explicit parameter updating mechanism, which appears in conventional EM procedure for Gaussian mixture model, the p-memory updating is implicit and derived from gradient descent.
3.2 LSTM with Persistent Memory
As a general external memory mechanism, the persistent memory is able to equip almost all RNN models. In this section, we take LSTM as an example and illustrate how the p-memory works in practice. Since the persistent memory is added to all gates and cells, the forget gate and the input gate are revised as:
Then the memory cell can be updated adaptively:
Consequently, the output gate and cell become:
3.3 Persistent Memory with Prior Knowledge
In many practical situations, some prior or domain knowledge, either about sample distribution or about data pattern, is known beforehand. As a matter of fact, the prior knowledge can provide much information and help to build more efficient persistent memories, which substantially benefits the training process.
In this subsection, we select an example in text modeling for illustration. The category information (e.g. sports, science) is often provided in advance in text modeling. To leverage prior knowledge, we set up multiple persistent memories for RNN. Assume buckets (categories) exist in the training data, an independent p-memory () is allocated for each bucket. Therefore, the memory accessing is extended to be:
The performance of the PRNN is evaluated on two different tasks: time series prediction task and language modeling task. Experiments are implemented in Tensorflowabadi2016tensorflow . Two different measures for calculating similarity are suggested in Section 3.1, and considering the popularity of cosine similarity, we select this measure in all experiments for consistency. It is claimed that for any baseline model (e.g. RNN or LSTM), we name the persistent memory augmented model with prefix P (e.g. PRNN or PLSTM), and name the prior knowledge based memory model with prefix PP (e.g. PPRNN or PPLSTM).
When tuning the hyper-parameters (e.g. the dimension and the slot number in p-memory), an independent validation set is randomly drawn from the training set (except for PTB, where an extra validation set is provided), and the model is trained on the remaining samples. When the hyper-parameters are determined, the model is trained again on entire training set. All experiments are run several times and the average results are reported.
4.1 Time Series Prediction
We conduct time series prediction experiments on two datasets: Power Consumption (PC) and Sales Forecast (SF).
Power Consumption (PC) Lichman:2013 . This dataset contains measurements of electric power consumption in one household over a period of 198 weeks222https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption. The global-active-power is aggregated into hourly averaged time series, and the prediction target is the global-active-power at every hour on the next day. The entire dataset is divided into two parts: training part (dates range [2007-04-08, 2010-11-19]) and testing part (dates range [2010-11-20, 2010-11-26]).
Sales Forecast (SF). This dataset is collected from one of the largest E-commerce platforms in the world, and contains four features (i.e. browse times, customer number, price and sales) of 1 million items from 1500 categories over 13 weeks. The target is to predict the total sales in the next week for each item. As the category information is provided, it can be utilized as the prior knowledge. The training set includes over 3 million samples, and the size of testing set is about 2 thousand (testing sample list is provided by large merchants).
We select the LSTM as the baseline model. The LSTM is first compared with the classical ARIMA model hamilton1994time . Then the p-memory is added to the LSTM to construct our approach. The hyper-parameters are tuned on two datasets as follows:
Power Consumption (PC). For each hour, the input sequence contains the power consumption on the last 56 days. At every step, the sequence considers values at three adjacent hours centering at the target hour. The goal is to predict global-active-power at every hour on the next day. The LSTM has a single layer with dimension of hidden units equals to 32. In terms of our approach, the p-memory augmented LSTM (PLSTM) is constructed by adding p-memory (with dimension
) into LSTM. The models are trained for 30 epochs.
Sales Forecast (SF). For each item, the input is a sequence of four features on recent 56 days, and the target is the total sales in the next week. The LSTM also has a single layer, in which the dimension of hidden units is 64. Similarly, a PLSTM model with a p-memory is constructed. In addition, since we have the category information as the prior knowledge, we construct the prior knowledge based memory model PLSTM (PPLSTM), and assign a p-memory to each category. The models are trained for 15 epochs.
All trainable parameters are initialized randomly from a uniform distribution within [-0.05, 0.05]. The parameters are updated through back propagation with Adam rule kinga2015method and the learning rate is 0.001.
The results are measured in Relative Mean Absolute Error (RMAE), and it is written as:
where is the number of testing samples, is the true value and is the prediction result. All RMAE results are shown in Table 2.
From Table 2, it is evident to see that our PLSTM model significantly outperforms LSTM and ARIMA, on both PC and SF datasets. Therefore, the advantages of our p-memory is fully demonstrated. Particularly, on the SF dataset, by utilizing the prior knowledge, the PPLSTM could further enhance the prediction accuracy, which indicates the strong adaptability of our approach as well.
4.2 Language Modeling
We conduct word-level prediction experiments on two datasets: Penn Treebank (PTB) and 20-Newsgroup (20NG).
Penn Treebank (PTB) marcus1993building . The PTB is a popular dataset in language modeling. In this experiment, we select the version provided by mikolov2010recurrent 333http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz. The dataset consists of 1 million words and has 10 thousand words in its vocabulary.
20-Newsgroup (20NG) lang1995newsweeder . This dataset is originally a benchmark for text categorization, where 20 thousand news documents are evenly categorized into 20 groups. In word-level prediction, the group information is considered as the prior knowledge. The preprocessed data can be found in 2007:phd-Ana-Cardoso-Cachopo 444http://ana.cachopo.org/datasets-for-single-label-text-categorization, and it consists of 5 million words and 74 thousand words are included in its vocabulary.
There are two different settings for the experiments: simple setting and complex setting. The purpose of the simple setting is to demonstrate the benefit of p-memory based on a simple LSTM model. The complex setting extends the advantages of our approach to more complicated models.
Simple setting. In this setting, each word is embedded into a 32-dimensional representation. The size of hidden units is 128 in the baseline LSTM model. The dimension of the p-memory in PLSTM is . On the 20NG dataset, since the group information is used as prior knowledge, we construct PPLSTM and allocate a p-memory for each group. Other settings, such as the parameter initialization, optimization method and learning rate, remain the same with those in time series prediction tasks. The model is trained for 20 epochs.
Complex setting. The “large” network architecture in zaremba2014recurrent is one of the state-of-art models and provides a strong baseline for language modeling tasks, and there is an open source implementation555https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py
. It contains many extensions, including multiple layers stacking, dropout, gradient clipping, learning rate decay and so on. In this setting, we select the “large” network (named by LARGE) as the baseline model. In terms of our approach, a p-memoryis added into LARGE to construct the persistent memory augmented LARGE (PLARGE) model. Similarly, the prior knowledge based PLARGE (PPLARGE) is established by introducing a particular p-memory for each group on the 20NG dataset.
With more sophisticated model zilly2016recurrent merity2016pointer or ensemble of multiple models, lower perplexities on the word-level prediction tasks can be achieved. We here simply focus on assessing the impact of p-memory when added to an existing architecture, rather than absolute state-of-the-art performance.
In the simple setting, the benefit from the p-memory is significant on both the PTB and 20NG datasets, especially on the 20NG dataset. One possible explanation is that the content in 20NG is semantically richer, and our model could easily distinguish various semantic patterns in the dataset. Similarly, with the assist from prior knowledge, the efficiency of p-memory is further improved.
The advantages of our approach remain in the complex setting. According to the results in Table 2, we can see that the LARGE model has already achieved a high accuracy for both PTB and 20NG datasets, and our approach is able to further enhance the model performance. Thus, the general superiority of our approach is proved under both simple setting and complex setting. In addition, it is also noted that another advantage of our approach appears in model convergence. By introducing the p-memory, the PLARGE model is able to provide a better convergence rate. Taking the PTB dataset as an example, during the training procedure, word-level perplexity on the validation set is shown in Figure 3. We can clearly see that the PLARGE model requires less epochs before it converges. Hence, beyond the superiority on high accuracy, the advantage on convergence performance also makes our approach appealing.
As mentioned before, the memory accessing can be interpreted from the mixture model perspective, and the p-memory stores the center of each cluster (or the principle pattern) in each slot. To gain more insights about memory matrix, we take the “Simple setting” on 20NG dataset as an example, and plot the change of average Euclidean distance among all centers during the training procedure in Figure 3. The green solid line denotes the word-level perplexity on testing set and the red dotted line is the average Euclidean distance. As we can see, all slots are close after initialization at the beginning of training, and they gradually become isolated as the model converges, which means each cluster center represents a distinct pattern in training data.
Conventional RNN is limited to adaptively process sequences with multiple patterns. In this paper, a novel PRNN approach is proposed and an external p-memory is introduced to memorize the principal patterns in training sequences. Through content-based accessing, the PRNN applies adaptive transition at each time step. The p-memory is updated by gradient descent, and the entire memory accessing can be interpreted as a mixture model. Moreover, the proposed approach can easily combine the prior knowledge of data. Experiments on time series task and language modeling task demonstrate the superiority and effectiveness of our PRNN method.
The proposed p-memory is a universal block, and we look forward to applying it to other types of neural networks, such as feed-forward networks and convolutional neural networks. Another interesting topic for further studies is the memory updating mechanism. The gradient descent based updating mechanism works very well in practice, and it simplifies the end-to-end training. However, it is still necessary to explore more appropriate updating mechanisms, and compare them with existing mechanisms comprehensively.
Appendix A Supplementary Material
a.1 Derivation of Updating Persistent Memory
According to the chain rule of calculus, we have:
is the gradient of final loss function on the p-memory accessing and can be calculated by:
The gradient of strength on the -th p-memory slot is:
So we only need to care about . When the similarity is calculated by Eq (5):
and when the similarity is calculated by Eq (6):
Then we get:
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
-  A. Cardoso-Cachopo. Improving methods for single-label text categorization. PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, 2007.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, et al. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio.
Gated feedback recurrent neural networks.
Proceedings of the International Conference on Machine Learning, pages 2067–2075, 2015.
-  A. Dario, A. Sundaram, A. Rishita, et al. Deep speech 2 : End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, pages 173–182, 2016.
-  I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
-  A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
-  Hamilton and D. James. Time series analysis, volume 2. Princeton University Press, 1994.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
-  S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3):400–401, 1987.
-  D. Kinga and J. B. Adam. A method for stochastic optimization. In International Conference on Learning Representations, 2015.
-  K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the International Conference on Machine Learning, pages 331–339, 1995.
-  M. Lichman. UCI machine learning repository, 2013.
-  M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313–330, 1993.
-  S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
-  T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, page 3, 2010.
-  A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning, pages 1842–1850, 2016.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pages 6000–6010. 2017.
-  J. Weston, S. Chopra, and A. Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
-  D. Williams and G. Hinton. Learning representations by back-propagating errors. Nature, 323(6088):533–538, 1986.
-  W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
C. Zhu, W. Linlin, and d. M. Gerard.
Multiple-weight recurrent neural networks.
Proceedings of the International Joint Conference on Artificial Intelligence, pages 1483–1489, 2017.
-  J. G. Zilly, R. K. Srivastava, J. Koutník, and J. Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.