1 Introduction
Recent years have witnessed the success of sequence machine learning models in tasks such as speech recognition
6638947 , machine translation NIPS2014_5346 ; Bahdanau2014 ; NIPS2017_7181 Rush2015 ; K161028 and music generation Eck2002. In particular, generative sequence modeling is usually framed as an unsupervised learning problem, that is the estimation of the joint probability distribution of variablelength sequences
. For example, in wordlevel language modeling, is usually a sentence and is the th word in the sentence Jelinek1980 ; Bengio2003 . In image modeling, is the image and is the value of the th pixel in the image Oord2016. At the core of most sequence models is the factorization of the joint distribution to conditional distributions:
(1) 
For instance, in gram model the conditional distribution only depends on the recent
elements in the sequence. In recurrent neural networks (RNNs), the conditional distribution implicitly depends on the entire history of the sequence through hidden states represented by fixedlength vectors. In selfattention networks like Transformers
NIPS2017_7181 , the conditional distribution explicitly depends on the entire history of the sequence.The way that the sequence history is exploited in the conditional distribution profoundly determines the temporal correlation in the joint probability distribution. It is obvious the gram model cannot capture dependence longer than time steps like unbounded syntactic movements. A formal mathematical statement of the above fact can be made through the temporal scaling property of the model: In the joint distribution generated by gram models, the mutual information of symbols decays exponentially in temporal distance. When the correlation or the mutual information between two symbols are small enough, the model cannot distinguish the dependence between them and the noise. Beyond the simple
gram model, it is known that both regular language and hidden Markov models (HMMs) have similar exponential temporal scaling behavior
Li1987 ; Lin2017 .On a seemingly separate note, there have been intense interests in the statistical analysis of natural sequences since the 1990s. It is found that the slow algebraic or powerlaw decay of mutual information is ubiquitous in natural sequences including human DNA sequences Li_1992 ; Peng1992 ; PhysRevLett.68.3805 , natural languages doi:10.1142/S0218348X93000083 ; Ebeling_1994 ; doi:10.1142/S0218348X02001257 , computer programs doi:10.1142/S0218348X93000083 ; Kokol1999 , music rhythms Levitin3716 ; 6791788 , stock prices 10.2307/2938368 ; DING199383 , etc.. The origin of the powerlaw scaling behavior is still debated and is not the focus of this paper. Nevertheless, it is clear the exponential temporal scaling in models such as gram models and HMMs sets a fundamental limitation on their ability to capture the longrange dependence in natural sequences.
It is then natural to ask the question what the temporal scaling behavior is in sequence model architectures such as RNNs and selfattention networks and why. In this paper, we study the mutual information scaling of RNNs and Transformers. We show that the mutual information decays exponentially in temporal distance, rigorously in linear RNNs and empirically in nonlinear RNNs including long shortterm memories (LSTMs) doi:10.1162/neco.1997.9.8.1735 . In contrast, longrange dependence, including the powerlaw decaying mutual information, can be captured efficiently by Transformers. This indicates Transformers are more suitable to model natural sequences with powerlaw longrange correlation. We also discuss the connection of these results with statistical mechanics. Finally, we notice there is discrepancy in the statistical property between training and validation sets in many natural language datasets. This nonuniformity problem may prevent sequence models from learning the longrange dependence.
2 Related Work
Expressive Power of RNNs
Essentially, this work studies the expressive power of RNNs and Transformers. Closely related works are Refs. NIPS2013_5166 ; Karpathy2015 ; P181027 , where different approaches or metrics are adopted to empirically study the ability of RNNs as language models to capture longrange temporal dependence. In Refs. NIPS2013_5166 ; Karpathy2015 , characterlevel RNNs and LSTMs are shown to be able to correctly close parentheses or braces that are far apart. In Ref. P181027 , ablation studies found wordlevel LSTMs have an effective context size of around 200 tokens, but only sharply distinguishes the recent 50 tokens.
Mutual Information Diagnosis
Mutual information flow in the network training is studied in Refs. Schwartzziv ; Goldfeld
. Using temporal scaling property to diagnose sequence modeling is rarely mentioned in the context of deep learning. The only works the author is aware of are Refs.
Lin2017 ; 10.1371/journal.pone.0189326 . In Ref. Lin2017 , it is argued that in theory deep LSTMs are effective in capturing longrange correlations. Although it is a tempting proposal, shallow LSTMs are empirically found to actually perform as well as, if not better, than deep LSTMs Melis2017 . The empirical study in this work on multiple datasets also confirm the same result. Ref. 10.1371/journal.pone.0189326 is an empirical study of natural language models and focuses mainly on the “onepoint” statistics like Zipf’s law. In the last section, it is mentioned that LSTMs can not reproduce the power decay of the autocorrelation function in natural languages, which is consistent with the findings of this work.To the best of the author’s knowledge, this work is the first that theoretically studies the mutual information scaling in RNNs. It is also the first work that systematically studies the mutual information scaling in RNNs and Transformers.
3 Mutual Information
The mutual information between two random variables is defined as
(2) 
It has many equivalent definitions such as , where is the entropy of the random variable . Roughly speaking, it measures the dependence between two random variables. Consider a discretetime random process . With the above definition of mutual information, the automutual information of the random process is defined as . The random process is assumed to be stationary such that the automutual information only depends on the temporal distance between the random variables. In this case, automutual information can be characterized solely by the time lag : . In the rest of this paper, we always assume the stationarity and adopt the above definition. We also use “mutual information” and “automutual information” interchangeably, and drop the subscript in when the underlying random variable is evident.
At least two notions of “expressive power” can be defined from the automutual information: (i) the scaling behavior with the temporal distance, e.g. whether decays algebraically or exponentially with ; (ii) the absolute magnitude of the mutual information for a given . In this paper, we will mainly focus on the first notion when is large, which ideally is determined by the intrinsic structure of the model. The second notion is not as universal and is also critically affected by the number of parameters in the model.
4 Recurrent Neural Networks
4.1 Linear RNNs as Gaussian Processes
We start with an analytical analysis of mutual information scaling in linear RNNs with Gaussian output. Consider the classical Elman RNN with the linear activation:
(3)  
(4) 
parameterizes the probability distribution from which is sampled, and is the hidden state. In the following, we assume
is sampled from a multivariate Gaussian distribution with mean
and covariance matrix proportional to the identity, i.e. . It follows from iteratively applying Equation (3) that(5) 
Since depends on the entire history of , , the random process specified by is not Markovian. Therefore, it is not obvious how mutual information decays with temporal distance. In the following, we sketch the proof that the mutual information in the above RNN decreases exponentially with time if the RNN does not simply memorize the initial condition. The full proof is presented in Appendix.
The hidden state is often initialized as . Under this initial condition, is multivariate Gaussian, and so does the joint distribution of the entire sequence . In this way, the random process is a Gaussian process. Since we are interested in the mutual information between and for some generic , without loss of generality we can set and let be a generic multivariate Gaussian distribution. We can also let be the distribution that is already averaged over the entire sequence. In any case, we will see the asymptotic behavior of the mutual information is almost independent of .
We are interested in the distribution , hence block covariance matrices and . can be derived recursively as
(6) 
which can be solved with generating functions. Define the formal power series . Its closedform expression is computed as
(7) 
The long time asymptotic behavior of can be analyzed by treating as a function on the complex plane and studying its singularities flajolet2009analytic . Because Equation (7) is a rational function of , it can be shown that elements in either decrease or increase exponentially with , and the exponent is bounded by only , and , independent of the initial condition . In the exponentially increasing case, simply remembers the initial condition, which is not desirable. Therefore, in any working network every element of decreases exponentially in .
The mutual information between and is computed as
(8) 
is timeindependent. can be proved to be nondegenerate in the limit, because is sampled conditionally from a Gaussian distribution with covariance matrix . Therefore, tends to a finite constant matrix when is large. Each element in decays exponentially with . In this way, elements in is exponentially small when increasing , which justifies the last equality in Equation (8). Because trace is a linear function, the mutual information also decreases exponentially with . This finishes the proof that in any linear Elman RNN with Gaussian output that does not simply memorize the initial condition, the mutual information decays exponentially with time. We note that adding bias terms in Equation (3) and (4) does not affect the conclusion because the mutual information of Gaussian random variables only depends on the covariance matrix, while the bias terms only affect their mean. We will talk more about this result at the Discussion section.
4.2 Nonlinear RNNs
4.2.1 Binary Sequence
We now study how the linear RNN result generalizes to nonlinear RNNs on symbolic sequences. Our first dataset is artificial binary sequences. This dataset is simple and clean, in the sense that it only contains two symbols and is strictly scaleless. The training set contains 10000 sequences of length 512, whose mutual information decays as
. During training, we do not truncate the backpropagation through time (BPTT). After training, we unconditionally generate 2000 sequences and estimate their mutual information. The generation algorithm of the dataset, along with experiment details, is reported in Appendix.
Vanilla RNN
It is very clear from the straight line in the semilog plot (inset of Figure 1(a)) that decays exponentially with in vanilla RNNs:
(9) 
where we have defined the “correlation length” .
If increases very rapidly with the width (hidden unit dimension of RNN layers) or the depth (number of RNN layers) of the network, the exponential decay will not bother us practically. However, this is not the case. The correlation length as a function of the network width is fitted in Figure 1(c). For small networks (), the correlation length increases logarithmically with the hidden unit dimension . When the network becomes large enough, the correlation length saturates to around . The almost invisible error bars suggest the goodness of the exponential fitting of the temporal mutual information. The correlation length as a function of the network depth is fitted in Figure 1(d), which increases linearly for shallow networks. Therefore, increasing the depth of the vanilla RNNs is more efficient in capturing the longrange temporal correlation then increasing the width. For relatively deep networks, the performance deteriorates probably due to the increased difficulty in training. Interestingly, the network in Figure 1(a) overestimates the shortrange correlation in order to compensate the rapid exponential decay in the long distance.
Lstm
LSTMs perform much better in capturing longrange correlations, although the exponential decay can still been seen clearly in very small LSTMs (Figure 1(b) inset). When , it is hard to distinguish the exponential decay from the algebraic decay. Nevertheless, we still fit the correlation length. Note that the fitting on the training set yields the baseline , which is comparable to the sequence length. The correlation length also increases linearly with width in small LSTMs and then saturates. The depth dependence is similar to that of vanilla RNNs too, which is consistent with previous studies that shallow LSTMs usually perform as well as, if not better, than deep LSTMs Melis2017 .
is used due to the estimation error in the long distance. The error bar represents the 95% confidence interval. (d) Same as (c) but as a function of the RNN depth. The width of all networks is 8.
To summarize, both vanilla RNNs and LSTMs show exponential decaying mutual information on the binary dataset. The correlation length of LSTMs has a much better scaling behavior than vanilla RNNs.
4.2.2 Natural Language
We extend the scaling analysis to the more realistic natural language dataset WikiText2 Merity2016 . Since vanilla RNNs cannot even capture the longrange dependence in the simple binary dataset, we only focus on LSTMs here. During training, the BPTT is truncated to 100 characters at character level and 50 tokens at word level. After training, we unconditionally generate a long sequence with 2MB characters and estimate its mutual information at character level.
Different from binary sequences, there exists multiple time scales in WikiText2. In the short distance (word level), the mutual information decays exponentially, potentially due to the arbitrariness of characters within a word. In the long distance (paragraph level), the mutual information follows a powerlaw decay . The qualitative behavior is mixed at the intermediate distance (sentence level).
Strikingly, there is significant discrepancy between the mutual information profile between the training and the validation set. Not only the mutual information on the validation set is much larger, the algebraic decay in the long distance is missing as well. The fact that we always pick the best model on the validation set for sequence generation may prevent the model from learning the longrange mutual information. Therefore, we should interpret results especially from wordlevel models cautiously. We also find similar nonuniformity in datasets such as Penn Treebank. See Appendix for more details.
Characterlevel LSTMs can capture the shortrange correlation quite well as long as the width is not too small (). In the intermediate distance, it seems from the inset of Figure 2(a) that all LSTMs show an exponential decaying mutual information. In large models where , the shortrange mutual information is overestimated to compensate the rapid decay in the intermediate distance. No powerlaw decay in the long distance is captured at all.
Wordlevel LSTMs can capture the mutual information up to the intermediate distance. There is no surprise the shortrange correlation is captured well as the word spelling is already encoded through word embedding. There is even longrange powerlaw dependence up to in the largest model with , although the mutual is overestimated approximately two times throughout all distances.
In this dataset, the mutual information of LSTMs always decays to some nonzero constant instead of to zero like that in binary sequences. We speculate that this is likely due to the shortrange memory effect, similar to the nondecaying mutual information in repetitive sequences. The simplest example is the binary sequence where 01 and 10 are repeated with probability and respectively, which can be generated by a periodic HMM. It is not hard to prove the mutual information is a constant for all temporal distances greater than one. See Appendix for a proof.
5 SelfAttention Networks
We now turn to the empirical study of the original Transformer model NIPS2017_7181 . In principle, the conditional distribution in Transformers explicitly depends on the entire sequence history. For complexity reasons, during the training the lookback history is usually truncated to the recent elements. In this sense, Transformers are like gram models with large . For the purpose of this paper, the truncation will not bother us, because we are only interested in the mutual information with . On WikiText2, is limited to 512 characters at character level and 128 tokens at word level. However, if one is really interested in , the result on the gram model suggests that the mutual information is bound to decay exponentially. The network width, which is the total hidden unit dimension, is defined as . For binary sequences, we use Transformers with four heads and for WikiText2, we use eight heads.
5.1 Binary Sequence
Transformers can very efficiently capture longrange dependence in binary sequences. A singlelayer Transformer of width can already capture the algebraic behavior quite well. Interestingly, the mutual information does not decay exponentially even in the simplest model. Moreover, the magnitude of the mutual information always coincides with the training data very well, while that of LSTMs is almost always slightly smaller. The correlation length fitting can be found in Appendix, although the mutual information does not fit quite well with the exponential function.
1). (a) Binary sequences; (b) WikiText2 at character level; (c) WikiText2 at word level; (d) GPT2 model trained on WebText with byte pair encoding (BPE)
sennrichetal2016neural .5.2 Natural Language
The mutual information of characterlevel Transformers already look very similar to that of wordlevel LSTMs. Both short and intermediate mutual information is captured and there is an overestimated powerlaw tail in the long distance for the single layer model. Even small wordlevel Transformers can track the mutual information very closely up to intermediate distances. Moreover, our three and fivelayer Transformer models have a slowly decaying powerlaw tail up to , although the power does not track the training data very well. As a bonus, we evaluate the current stateoftheart language model GPT2 Radford2019 , where the longrange algebraic decay is persistent in both small (117M parameters) and large (1542M parameters) models. Note that the magnitude of the mutual information is much larger than that in WikiText2, probably due to the better quality of the WebText dataset. This is beneficial to training, because it is more difficult for the network to distinguish the dependence from noise when the magnitude of the mutual information is small. Also, the discrepancy between the training and validation set of WikiText2 may also make learning longrange dependence harder. Therefore, we speculate the bad dataset quality is the main reason why our wordlevel Transformer model cannot learn the powerlaw behavior very well on WikiText2.
Finally, we observe the connection between the quality of the generated text and the mutual information scaling behavior. Shortrange mutual information is connected to the ability of spelling words correctly, and the intermediaterange mutual information is related to the ability to close “braces” in section titles correctly, or to consecutively generate two or three coherent sentences. The longrange mutual information is reflected by the long paragraphs sharing a single topic. Some text samples are presented in Appendix.
6 Discussion
RNNs encode the sequence history into a fixedlength and continuousvalued vector. In linear RNNs, after we “integrate out” the hidden state, the history dependence in Equation (5) takes a form that is exponential in time. In this way, the network always “forgets” the past exponentially fast, thus cannot capture the powerlaw dependence in the sequence. In nonlinear RNNs, although we cannot analytically “integrate out” the hidden state, the experimental results suggest that RNNs still forget the history exponentially fast.
It is very tempting to connect the above result to statistical mechanics. At thermal equilibrium, systems with powerlaw decaying mutual information or correlation are called “critical”. It is wellknown as van Hove’s theorem that in one dimension, systems with shortrange interactions cannot be critical, and the mutual information always decays exponentially VANHOVE1950137 ; ruelle1999statistical ; landau2013statistical ; Cuesta2004 . Importantly, here “short range” means the interaction strength has a finite range or decays exponentially in distance. On the other hand, onedimensional systems with longrange interactions can be critical. A classical example is the longrange Ising model, where the finitetemperature critical point exists when the interaction strength decays as , dyson1969 . At exactly the critical point, the mutual information decays algebraically, and in a very vast parameter space near the critical point, the mutual information decays very slowly so that no parameter finetuning is needed PhysRevE.91.052703 .
Linear RNNs resemble statistical mechanical systems with exponential decaying interactions. Therefore they cannot accommodate longrange correlations. Transformers exploit the entire sequence history explicitly. In fact, there is no natural notion of “distance” in Transformers, and the distance is artificially encoded using positional encoding layers. Because Transformers resemble statistical mechanical systems with longrange interactions, there is no surprise that they can capture longrange correlations well.
An advantage of the statistical mechanical argument is its robustness. Essentially, the result claims the universality class that a sequence model belongs to, which usually does not depend on many microscopic details. Analyzing nonlinear models and find their universality classes will be left as future work. However, one should be cautious that because the sampling is only conditioned on the past but not the future, the mapping between the Boltzmann distribution and the conditional distribution is straightforward only in limited cases. See Appendix for a simple example.
The implication of the above result is that selfattention networks are more efficient in capturing longrange temporal dependence. However, this ability comes at the cost of extra computational power. In order to generate a sequence of length , RNNs only need time while Transformers need (suppose no truncation on the window size). How to maintain longrange interactions while on the meantime reduce the computational complexity will be a challenging and interesting problem even from the perspective of statistical physics. It is also interesting to see how augmenting RNNs with memory can improve the longrange scaling behavior Graves2014 ; NIPS2015_5648 ; NIPS2015_5846 ; NIPS2015_5857 ; Grave2017 .
Last but not least, the theoretical study of linear RNNs only focuses on the intrinsic expressive power of its architecture. In reality, the practical expressive power is also heavily affect by how well the stochastic gradient descent algorithm performs on the model. The fact that Transformers are longrange interacted also helps backpropagation algorithm
chaptergradientflow2001 . It will be interesting to connect the statistical physics to the gradient flow dynamics.7 Conclusion
This paper demonstrates that RNNs are not efficient in capturing longrange temporal correlations both theoretically and empirically. Selfattention models like Transformers can capture longrange correlations much more efficiently than RNNs do, and reproduce the powerlaw mutual information in natural language texts.
We also notice the nonuniformity problem in popular natural language datasets. We believe a highquality dataset is essential for the network to learn longrange dependence.
We hope this work provides a new perspective in understanding the expressive power of sequence models and shed new light on improving both the architecture and the training dynamics of them.
Acknowledgments
HS thanks Yaodong Li, Tianxiao Shen, Jiawei Yan, Pengfei Zhang, Wei Hu and YiZhuang You for valuable discussions and suggestions.
References
 (1) A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649, 2013. https://doi.org/10.1109/ICASSP.2013.6638947.
 (2) I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” in Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), pp. 3104–3112, Curran Associates, Inc., 2014. http://papers.nips.cc/paper/5346sequencetosequencelearningwithneuralnetworks.pdf.
 (3) D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” 2014. https://arxiv.org/abs/1409.0473.
 (4) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 5998–6008, Curran Associates, Inc., 2017. http://papers.nips.cc/paper/7181attentionisallyouneed.pdf.

(5)
A. M. Rush, S. Chopra, and J. Weston, “A Neural Attention Model for
Abstractive Sentence Summarization,” in
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pp. 379–389, Association for Computational Linguistics, 2015. https://doi.org/10.18653/v1/D151044.  (6) R. Nallapati, B. Zhou, C. dos Santos, c. Gu̇lçehre, and B. Xiang, “Abstractive Text Summarization using Sequencetosequence RNNs and Beyond,” in Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290, Association for Computational Linguistics, 2016. https://doi.org/10.18653/v1/K161028.
 (7) D. Eck and J. Schmidhuber, “Finding temporal structure in music: blues improvisation with LSTM recurrent networks,” in Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 747–756, 2002. https://doi.org/10.1109/NNSP.2002.1030094.

(8)
F. Jelinek and R. Mercer, “Interpolated estimation of Markov source parameters from sparse data,” in
Proceedings of the Workshop on Pattern Recognition in Practice
, pp. 381–397, 1980.  (9) Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A Neural Probabilistic Language Model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003. http://jmlr.org/papers/v3/bengio03a.html.
 (10) A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel Recurrent Neural Networks,” 2016. https://arxiv.org/abs/1601.06759.
 (11) W. Li, “Power Spectra of Regular Languages and Cellular Automata,” Complex Systems, vol. 1, pp. 107–130, 1987. https://www.complexsystems.com/abstracts/v01_i01_a08/.
 (12) H. Lin and M. Tegmark, “Critical Behavior in Physics and Probabilistic Formal Languages,” Entropy, vol. 19, no. 7, p. 299, 2017. https://doi.org/10.3390/e19070299.
 (13) W. Li and K. Kaneko, “LongRange Correlation and Partial Spectrum in a Noncoding DNA Sequence,” Europhysics Letters (EPL), vol. 17, no. 7, pp. 655–660, 1992. https://doi.org/10.1209/02955075/17/7/014.
 (14) C.K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, F. Sciortino, M. Simons, and H. E. Stanley, “Longrange correlations in nucleotide sequences,” Nature, vol. 356, no. 6365, pp. 168–170, 1992. https://doi.org/10.1038/356168a0.
 (15) R. F. Voss, “Evolution of longrange fractal correlations and noise in DNA base sequences,” Phys. Rev. Lett., vol. 68, pp. 3805–3808, 1992. https:/doi.org/10.1103/PhysRevLett.68.3805.
 (16) A. SCHENKEL, J. ZHANG, and Y.C. ZHANG, “LONG RANGE CORRELATION IN HUMAN WRITINGS,” Fractals, vol. 01, no. 01, pp. 47–57, 1993. https://doi.org/10.1142/S0218348X93000083.
 (17) W. Ebeling and T. Pöschel, “Entropy and LongRange Correlations in Literary English,” Europhysics Letters (EPL), vol. 26, no. 4, pp. 241–246, 1994. https://doi.org/10.1209/02955075/26/4/001.
 (18) M. A. MONTEMURRO and P. A. PURY, “LONGRANGE FRACTAL CORRELATIONS IN LITERARY CORPORA,” Fractals, vol. 10, no. 04, pp. 451–461, 2002. https://doi.org/10.1142/S0218348X02001257.
 (19) P. Kokol, V. Podgorelec, M. Zorman, T. Kokol, and T. Njivar, “Computer and natural language texts—A comparison based on longrange correlations,” Journal of the American Society for Information Science, vol. 50, no. 14, pp. 1295–1301, 1999. https://doi.org/10.1002/(SICI)10974571(1999)50:14<1295::AIDASI4>3.0.CO;25.
 (20) D. J. Levitin, P. Chordia, and V. Menon, “Musical rhythm spectra from Bach to Joplin obey a power law,” Proceedings of the National Academy of Sciences, vol. 109, no. 10, pp. 3716–3720, 2012. https://doi.org/10.1073/pnas.1113828109.
 (21) B. Manaris, J. Romero, P. Machado, D. Krehbiel, T. Hirzel, W. Pharr, and R. B. Davis, “Zipf’s Law, Music Classification, and Aesthetics,” Computer Music Journal, vol. 29, no. 1, pp. 55–69, 2005. https://doi.org/10.1162/comj.2005.29.1.55.
 (22) A. W. Lo, “LongTerm Memory in Stock Market Prices,” Econometrica, vol. 59, no. 5, pp. 1279–1313, 1991. https://doi.org/10.2307/2938368.
 (23) Z. Ding, C. W. Granger, and R. F. Engle, “A long memory property of stock market returns and a new model,” Journal of Empirical Finance, vol. 1, no. 1, pp. 83 – 106, 1993. https://doi.org/10.1016/09275398(93)90006D.
 (24) S. Hochreiter and J. Schmidhuber, “Long ShortTerm Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735.
 (25) M. Hermans and B. Schrauwen, “Training and Analysing Deep Recurrent Neural Networks,” in Advances in Neural Information Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), pp. 190–198, Curran Associates, Inc., 2013. http://papers.nips.cc/paper/5166trainingandanalysingdeeprecurrentneuralnetworks.pdf.
 (26) A. Karpathy, J. Johnson, and L. FeiFei, “Visualizing and Understanding Recurrent Networks,” 2015. https://arxiv.org/abs/1506.02078.
 (27) U. Khandelwal, H. He, P. Qi, and D. Jurafsky, “Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 284–294, Association for Computational Linguistics, 2018. https://www.aclweb.org/anthology/P181027.
 (28) R. Schwartzziv and N. Tishby, “Opening the black box of Deep Neural Networks via Information,” 2017. https://arxiv.org/abs/1703.00810.
 (29) Z. Goldfeld, E. V. D. Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy, “Estimating Information Flow in Neural Networks,” 2018. https://arxiv.org/abs/1810.05728.
 (30) S. Takahashi and K. TanakaIshii, “Do neural nets learn statistical laws behind natural language?,” PLOS ONE, vol. 12, no. 12, 2017. https://doi.org/10.1371/journal.pone.0189326.
 (31) G. Melis, C. Dyer, and P. Blunsom, “On the State of the Art of Evaluation in Neural Language Models,” 2017. https://arxiv.org/abs/1707.05589.
 (32) P. Flajolet and R. Sedgewick, Analytic Combinatorics. Cambridge University Press, 2009.
 (33) S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer Sentinel Mixture Models,” 2016. https://arxiv.org/abs/1609.07843.

(34)
R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” in
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (Berlin, Germany), pp. 1715–1725, Association for Computational Linguistics, 2016. https://doi.org/10.18653/v1/P161162.  (35) A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019. https://d4mucfpksywv.cloudfront.net/betterlanguagemodels/language_models_are_unsupervised_multitask_learners.pdf.
 (36) L. van Hove, “Sur L’intégrale de Configuration Pour Les Systèmes De Particules À Une Dimension,” Physica, vol. 16, no. 2, pp. 137 – 143, 1950. https://doi.org/10.1016/00318914(50)900723.
 (37) D. Ruelle, Statistical Mechanics: Rigorous Results. World Scientific, 1999.
 (38) L. Landau and E. Lifshitz, Statistical Physics, Volume 5. Elsevier Science, 2013.

(39)
J. A. Cuesta and A. Sánchez, “General NonExistence Theorem for Phase Transitions in OneDimensional Systems with Short Range Interactions, and Physical Examples of Such Transitions,”
Journal of Statistical Physics, vol. 115, no. 3, pp. 869–893, 2004. https://doi.org/10.1023/B:JOSS.0000022373.63640.4e.  (40) F. J. Dyson, “Existence of a phasetransition in a onedimensional Ising ferromagnet,” Comm. Math. Phys., vol. 12, no. 2, pp. 91–107, 1969. https://projecteuclid.org:443/euclid.cmp/1103841344.
 (41) A. Colliva, R. Pellegrini, A. Testori, and M. Caselle, “Isingmodel description of longrange correlations in DNA sequences,” Phys. Rev. E, vol. 91, p. 052703, 2015. https://doi.org/10.1103/PhysRevE.91.052703.
 (42) A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines,” 2014. https://arxiv.org/abs/1410.5401.
 (43) E. Grefenstette, K. M. Hermann, M. Suleyman, and P. Blunsom, “Learning to Transduce with Unbounded Memory,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 1828–1836, Curran Associates, Inc., 2015. http://papers.nips.cc/paper/5648learningtotransducewithunboundedmemory.pdf.
 (44) S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “EndToEnd Memory Networks,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 2440–2448, Curran Associates, Inc., 2015. http://papers.nips.cc/paper/5846endtoendmemorynetworks.pdf.
 (45) A. Joulin and T. Mikolov, “Inferring Algorithmic Patterns with StackAugmented Recurrent Nets,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 190–198, Curran Associates, Inc., 2015. http://papers.nips.cc/paper/5857inferringalgorithmicpatternswithstackaugmentedrecurrentnets.pdf.
 (46) E. Grave, A. Joulin, and N. Usunier, “Improving Neural Language Models with a Continuous Cache,” 2017. https://arxiv.org/abs/1612.04426.
 (47) S. Hochreiter, Y. Bengio, and P. Frasconi, “Gradient Flow in Recurrent Nets: the Difficulty of Learning LongTerm Dependencies,” in Field Guide to Dynamical Recurrent Networks (J. Kolen and S. Kremer, eds.), IEEE Press, 2001. https://doi.org/10.1109/9780470544037.ch14.
 (48) P. Grassberger, “Entropy Estimates from Insufficient Samplings,” 2008. https://arxiv.org/abs/physics/0307138.
 (49) M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a Large Annotated Corpus of English: The Penn Treebank,” Computational Linguistics, vol. 19, no. 2, 1993. https://www.aclweb.org/anthology/J932004.
 (50) M. Mahoney, “Relationship of Wikipedia Text to Clean Text,” 2006. http://mattmahoney.net/dc/textdata.html.

(51)
G. Hinton, N. Srivastava, and K. Swersky, “Lecture 6e—RmsProp: Divide the gradient by a running average of its recent magnitude.” Coursera: Neural Networks for Machine Learning, 2012.

(52)
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(Y. W. Teh and M. Titterington, eds.), vol. 9 of Proceedings of Machine Learning Research, (Chia Laguna Resort, Sardinia, Italy), pp. 249–256, PMLR, 2010. http://proceedings.mlr.press/v9/glorot10a.html.  (53) A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” 2013. https://arxiv.org/abs/1312.6120.
 (54) O. Press and L. Wolf, “Using the Output Embedding to Improve Language Models,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, (Valencia, Spain), pp. 157–163, Association for Computational Linguistics, 2017. https://www.aclweb.org/anthology/E172025.
 (55) H. Inan, K. Khosravi, and R. Socher, “Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling,” 2016. https://arxiv.org/abs/1611.01462.
 (56) D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” 2014. https://arxiv.org/abs/1412.6980.
 (57) I. Loshchilov and F. Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts,” 2016. https://arxiv.org/abs/1608.03983.
 (58) S. Merity, N. S. Keskar, and R. Socher, “Regularizing and Optimizing LSTM Language Models,” aug 2017. https://arxiv.org/abs/1708.02182.
Appendix A Exponential Mutual Information in Linear RNNs with Gaussian Output
In this section, we present the full proof of exponential mutual information in linear Elman RNNs with Gaussian output. For the sake of selfcontainedness, some of the arguments in the main text will be repeated. The proof can be roughly divided into four steps:

Derive the recurrence relations of the block covariance matrix;

Solve the recurrence relation using generating functions;

Perform asymptotic analysis by studying the singularities of generating functions;

Compute the mutual information based on the asymptotic analysis.
Problem Setup
The classical Elman RNN with the linear activation is given by:
(10)  
(11) 
parameterizes the probability distribution from which is sampled, and is the hidden state. In this way, the shapes of the weight matrices are , , and . In the following, we assume is sampled from a multivariate Gaussian distribution with mean and covariance matrix proportional to the identity, i.e. . It follows from iteratively applying Equation (10) that . Therefore
(12) 
The joint probability distribution of the entire sequence factorizes as
(13) 
where
(14) 
Here we have assumed is also a multivariate Gaussian random variable.
The hidden state is often initialized as . Under this initial condition, is multivariate Gaussian, and so does the joint distribution of the entire sequence . In this way, the random process is a Gaussian process. Since we are interested in the mutual information between and for some generic , without loss of generality we can set and let be a generic multivariate Gaussian distribution. We can also let be the distribution that is already averaged over the entire sequence. In any case, we will see the asymptotic behavior of the mutual information is almost independent of .
Deriving Recurrence Relations
We are particularly interested in the distribution . Because marginalization of multivariate Gaussian random variable is still a multivariate Gaussian one, we only need to compute the block covariance matrix and . They can be derived easily by noting , where and is independent. In this way,
(15) 
where in the second line we have used the decomposition and the independence of , and in the last line we have used the fact that
(16) 
Similarly,
(17) 
As a sanity check, Equation (15) and (17
) can also be derived directly from the probability density function Equation (
13). and only depend on and for . First rewrite Equation (13) as the canonical form of multivariate Gaussian distribution:(18) 
where and is a symmetric matrix. The last row of is
(19) 
and the second column of is
(20) 
Since , the product of the last row of and the second column of should be zero. This gives
(21) 
Similarly, by considering the last row of and the last column of , we obtain
(22) 
Solving Recurrence Relations
We solve the linear recurrence relation of , given by Equation (15). Define the auxiliary sequence
(23) 
and the following three formal power series:
(24)  
(25)  
(26) 
With the identity
(27) 
Equation (15) is equivalent to
(28) 
or
(29) 
To this end, we assume the square matrix
bears an eigenvalue decomposition
, where with andis an orthogonal matrix. With this assumption, we can obtain the closed form expression of
and :(30) 
or in matrix form
(31) 
Similarly,
(32) 
or in matrix form
(33)  
(34) 
Insert Equation (31) and (34) into (29), we obtain the formal solution for :
(35) 
Asymptotic Analysis
The long time asymptotic behavior of can be analyzed by treating as a function on the complex plane and studying it singularities flajolet2009analytic . Since matrix inversion can be computed from Cramer’s rule, is a rational function of , and its singularities always occur as poles. Therefore, the asymptotic behavior of will be exponential in , whose exponent is determined by the position of the pole closet to the origin. The order of the pole determines the polynomial subexponential factor.
Because we are dealing with a matrix here, each element has its own set of poles. Denote the pole closest to the origin associated with as and its order as . In this way, when is large. Unless is exactly on the unit circle, which is of zero measure in the parameter space, the solution either increases or decreases exponentially with . Even if is on the unit circle, the power of the polynomial can only be an integer. If any element in increases exponentially with , simply remembers the initial condition, which is not desirable. Therefore, in any working network every element of decreases exponentially in .
The pole position is bounded by only , and , independent of the initial condition . For example, consider the term proportional to in Equation (35):
(36) 
where . Equation (36) is exactly the partial fraction decomposition of a rational function. (Of course, may be decomposed further.) In this way, is the pole closest to the origin, among all poles of and for all , unless for that particular is exactly zero, in which case the resulting pole will be further away from the origin, leading to a faster decay.
We then analyze the asymptotic behavior of . The only property that we need about it is nondegeneracy. Intuitively, this is true because it is sampled conditionally from a Gaussian distribution with covariance matrix . Formally, according to Courant minimax principle, the minimal eigenvalue of is given by
(37) 
where we have inserted Equation (17) and used the fact that is positive semidefinite. In this way, , where , is the eigenvalue of the covariance matrix and is an orthogonal matrix.
Mutual Information Computation
Let , be two multivariate Gaussian random variables with a joint distribution , where
(38) 
Use one of the definitions of the mutual information, , the entropy of multivariate Gaussian random variables and the Schur decomposition
(39) 
we have
(40) 
Using the above formula, the mutual information between and can then be computed as
(41) 
is timeindependent. Each element in decays exponentially with . Because the minimal eigenvalue of is bounded by , is welldefined and tends to a finite constant matrix . In this way, elements in is exponentially small in the large limit, which justifies the last equality in Equation (41). Because trace is a linear function, the mutual information decreases exponentially with . This finishes the proof that in any linear Elman RNN with Gaussian output that does not simply memorize the initial condition, the mutual information decays exponentially with time. The only technical assumption we made in the proof is that the covariance matrix decreases exponentially in time instead of exponentially, without which the network simply memories the initial condition.
Appendix B Binary Sequence Generation from Multivariate Gaussian
In this section, we report a method to sample fixedlength sequences of binary random variables with arbitrary designated mutual information. The method is used to generate the binary sequence dataset used in this paper.
Consider a bivariate normal random variable with mean zero and covariance matrix
(42) 
where
due to the nonnegativity of the variance. The condition that the covariance matrix is positive semidefinite is that
.Define the bivariate Bernoulli random variable , where the sign function applies elementwisely. The mutual information between and is
(43) 
where we have used the fact that for all . Although an analytical expression for general is viable, for simplicity, we take and such that . Straightforward integration yields
(44)  
(45) 
The mutual information as a function of is then
(46) 
As a sanity check, when or , due to the property of multivariate normal random variable, and become independent and so do and . This is consistent with the fact that . When or
Comments
There are no comments yet.