Mutual Information Scaling and Expressive Power of Sequence Models

05/10/2019
by   Huitao Shen, et al.
MIT
0

Sequence models assign probabilities to variable-length sequences such as natural language texts. The ability of sequence models to capture temporal dependence can be characterized by the temporal scaling of correlation and mutual information. In this paper, we study the mutual information of recurrent neural networks (RNNs) including long short-term memories and self-attention networks such as Transformers. Through a combination of theoretical study of linear RNNs and empirical study of nonlinear RNNs, we find their mutual information decays exponentially in temporal distance. On the other hand, Transformers can capture long-range mutual information more efficiently, making them preferable in modeling sequences with slow power-law mutual information, such as natural languages and stock prices. We discuss the connection of these results with statistical mechanics. We also point out the non-uniformity problem in many natural language datasets. We hope this work provides a new perspective in understanding the expressive power of sequence models and shed new light on improving the architecture of them.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/28/2019

Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer

In this work, we develop a novel regularizer to improve the learning of ...
07/02/2021

Minimizing couplings in renormalization by preserving short-range mutual information

The connections between renormalization in statistical mechanics and inf...
06/21/2016

Criticality in Formal Languages and Statistical Physics

We show that the mutual information between two symbols, as a function o...
05/23/2019

Quantifying Long Range Dependence in Language and User Behavior to improve RNNs

Characterizing temporal dependence patterns is a critical step in unders...
03/11/2021

Tensor networks and efficient descriptions of classical data

We investigate the potential of tensor network based machine learning me...
11/25/2020

Bounds for Algorithmic Mutual Information and a Unifilar Order Estimator

Inspired by Hilberg's hypothesis, which states that mutual information b...
12/08/2020

Mutual Information Decay Curves and Hyper-Parameter Grid Search Design for Recurrent Neural Architectures

We present an approach to design the grid searches for hyper-parameter o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years have witnessed the success of sequence machine learning models in tasks such as speech recognition

6638947 , machine translation NIPS2014_5346 ; Bahdanau2014 ; NIPS2017_7181

, text summarization

Rush2015 ; K16-1028 and music generation Eck2002

. In particular, generative sequence modeling is usually framed as an unsupervised learning problem, that is the estimation of the joint probability distribution of variable-length sequences

. For example, in word-level language modeling, is usually a sentence and is the -th word in the sentence Jelinek1980 ; Bengio2003 . In image modeling, is the image and is the value of the -th pixel in the image Oord2016

. At the core of most sequence models is the factorization of the joint distribution to conditional distributions:

(1)

For instance, in -gram model the conditional distribution only depends on the recent

elements in the sequence. In recurrent neural networks (RNNs), the conditional distribution implicitly depends on the entire history of the sequence through hidden states represented by fixed-length vectors. In self-attention networks like Transformers

NIPS2017_7181 , the conditional distribution explicitly depends on the entire history of the sequence.

The way that the sequence history is exploited in the conditional distribution profoundly determines the temporal correlation in the joint probability distribution. It is obvious the -gram model cannot capture dependence longer than time steps like unbounded syntactic movements. A formal mathematical statement of the above fact can be made through the temporal scaling property of the model: In the joint distribution generated by -gram models, the mutual information of symbols decays exponentially in temporal distance. When the correlation or the mutual information between two symbols are small enough, the model cannot distinguish the dependence between them and the noise. Beyond the simple

-gram model, it is known that both regular language and hidden Markov models (HMMs) have similar exponential temporal scaling behavior

Li1987 ; Lin2017 .

On a seemingly separate note, there have been intense interests in the statistical analysis of natural sequences since the 1990s. It is found that the slow algebraic or power-law decay of mutual information is ubiquitous in natural sequences including human DNA sequences Li_1992 ; Peng1992 ; PhysRevLett.68.3805 , natural languages doi:10.1142/S0218348X93000083 ; Ebeling_1994 ; doi:10.1142/S0218348X02001257 , computer programs doi:10.1142/S0218348X93000083 ; Kokol1999 , music rhythms Levitin3716 ; 6791788 , stock prices 10.2307/2938368 ; DING199383 , etc.. The origin of the power-law scaling behavior is still debated and is not the focus of this paper. Nevertheless, it is clear the exponential temporal scaling in models such as -gram models and HMMs sets a fundamental limitation on their ability to capture the long-range dependence in natural sequences.

It is then natural to ask the question what the temporal scaling behavior is in sequence model architectures such as RNNs and self-attention networks and why. In this paper, we study the mutual information scaling of RNNs and Transformers. We show that the mutual information decays exponentially in temporal distance, rigorously in linear RNNs and empirically in nonlinear RNNs including long short-term memories (LSTMs) doi:10.1162/neco.1997.9.8.1735 . In contrast, long-range dependence, including the power-law decaying mutual information, can be captured efficiently by Transformers. This indicates Transformers are more suitable to model natural sequences with power-law long-range correlation. We also discuss the connection of these results with statistical mechanics. Finally, we notice there is discrepancy in the statistical property between training and validation sets in many natural language datasets. This non-uniformity problem may prevent sequence models from learning the long-range dependence.

2 Related Work

Expressive Power of RNNs

Essentially, this work studies the expressive power of RNNs and Transformers. Closely related works are Refs. NIPS2013_5166 ; Karpathy2015 ; P18-1027 , where different approaches or metrics are adopted to empirically study the ability of RNNs as language models to capture long-range temporal dependence. In Refs. NIPS2013_5166 ; Karpathy2015 , character-level RNNs and LSTMs are shown to be able to correctly close parentheses or braces that are far apart. In Ref. P18-1027 , ablation studies found word-level LSTMs have an effective context size of around 200 tokens, but only sharply distinguishes the recent 50 tokens.

Mutual Information Diagnosis

Mutual information flow in the network training is studied in Refs. Schwartz-ziv ; Goldfeld

. Using temporal scaling property to diagnose sequence modeling is rarely mentioned in the context of deep learning. The only works the author is aware of are Refs. 

Lin2017 ; 10.1371/journal.pone.0189326 . In Ref. Lin2017 , it is argued that in theory deep LSTMs are effective in capturing long-range correlations. Although it is a tempting proposal, shallow LSTMs are empirically found to actually perform as well as, if not better, than deep LSTMs Melis2017 . The empirical study in this work on multiple datasets also confirm the same result. Ref. 10.1371/journal.pone.0189326 is an empirical study of natural language models and focuses mainly on the “one-point” statistics like Zipf’s law. In the last section, it is mentioned that LSTMs can not reproduce the power decay of the autocorrelation function in natural languages, which is consistent with the findings of this work.

To the best of the author’s knowledge, this work is the first that theoretically studies the mutual information scaling in RNNs. It is also the first work that systematically studies the mutual information scaling in RNNs and Transformers.

3 Mutual Information

The mutual information between two random variables is defined as

(2)

It has many equivalent definitions such as , where is the entropy of the random variable . Roughly speaking, it measures the dependence between two random variables. Consider a discrete-time random process . With the above definition of mutual information, the auto-mutual information of the random process is defined as . The random process is assumed to be stationary such that the auto-mutual information only depends on the temporal distance between the random variables. In this case, auto-mutual information can be characterized solely by the time lag : . In the rest of this paper, we always assume the stationarity and adopt the above definition. We also use “mutual information” and “auto-mutual information” interchangeably, and drop the subscript in when the underlying random variable is evident.

At least two notions of “expressive power” can be defined from the auto-mutual information: (i) the scaling behavior with the temporal distance, e.g. whether decays algebraically or exponentially with ; (ii) the absolute magnitude of the mutual information for a given . In this paper, we will mainly focus on the first notion when is large, which ideally is determined by the intrinsic structure of the model. The second notion is not as universal and is also critically affected by the number of parameters in the model.

4 Recurrent Neural Networks

4.1 Linear RNNs as Gaussian Processes

We start with an analytical analysis of mutual information scaling in linear RNNs with Gaussian output. Consider the classical Elman RNN with the linear activation:

(3)
(4)

parameterizes the probability distribution from which is sampled, and is the hidden state. In the following, we assume

is sampled from a multivariate Gaussian distribution with mean

and covariance matrix proportional to the identity, i.e. . It follows from iteratively applying Equation (3) that

(5)

Since depends on the entire history of , , the random process specified by is not Markovian. Therefore, it is not obvious how mutual information decays with temporal distance. In the following, we sketch the proof that the mutual information in the above RNN decreases exponentially with time if the RNN does not simply memorize the initial condition. The full proof is presented in Appendix.

The hidden state is often initialized as . Under this initial condition, is multivariate Gaussian, and so does the joint distribution of the entire sequence . In this way, the random process is a Gaussian process. Since we are interested in the mutual information between and for some generic , without loss of generality we can set and let be a generic multivariate Gaussian distribution. We can also let be the distribution that is already averaged over the entire sequence. In any case, we will see the asymptotic behavior of the mutual information is almost independent of .

We are interested in the distribution , hence block covariance matrices and . can be derived recursively as

(6)

which can be solved with generating functions. Define the formal power series . Its closed-form expression is computed as

(7)

The long time asymptotic behavior of can be analyzed by treating as a function on the complex plane and studying its singularities flajolet2009analytic . Because Equation (7) is a rational function of , it can be shown that elements in either decrease or increase exponentially with , and the exponent is bounded by only , and , independent of the initial condition . In the exponentially increasing case, simply remembers the initial condition, which is not desirable. Therefore, in any working network every element of decreases exponentially in .

The mutual information between and is computed as

(8)

is time-independent. can be proved to be non-degenerate in the limit, because is sampled conditionally from a Gaussian distribution with covariance matrix . Therefore, tends to a finite constant matrix when is large. Each element in decays exponentially with . In this way, elements in is exponentially small when increasing , which justifies the last equality in Equation (8). Because trace is a linear function, the mutual information also decreases exponentially with . This finishes the proof that in any linear Elman RNN with Gaussian output that does not simply memorize the initial condition, the mutual information decays exponentially with time. We note that adding bias terms in Equation (3) and (4) does not affect the conclusion because the mutual information of Gaussian random variables only depends on the covariance matrix, while the bias terms only affect their mean. We will talk more about this result at the Discussion section.

4.2 Nonlinear RNNs

4.2.1 Binary Sequence

We now study how the linear RNN result generalizes to nonlinear RNNs on symbolic sequences. Our first dataset is artificial binary sequences. This dataset is simple and clean, in the sense that it only contains two symbols and is strictly scaleless. The training set contains 10000 sequences of length 512, whose mutual information decays as

. During training, we do not truncate the backpropagation through time (BPTT). After training, we unconditionally generate 2000 sequences and estimate their mutual information. The generation algorithm of the dataset, along with experiment details, is reported in Appendix.

Vanilla RNN

It is very clear from the straight line in the semi-log plot (inset of Figure 1(a)) that decays exponentially with in vanilla RNNs:

(9)

where we have defined the “correlation length” .

If increases very rapidly with the width (hidden unit dimension of RNN layers) or the depth (number of RNN layers) of the network, the exponential decay will not bother us practically. However, this is not the case. The correlation length as a function of the network width is fitted in Figure 1(c). For small networks (), the correlation length increases logarithmically with the hidden unit dimension . When the network becomes large enough, the correlation length saturates to around . The almost invisible error bars suggest the goodness of the exponential fitting of the temporal mutual information. The correlation length as a function of the network depth is fitted in Figure 1(d), which increases linearly for shallow networks. Therefore, increasing the depth of the vanilla RNNs is more efficient in capturing the long-range temporal correlation then increasing the width. For relatively deep networks, the performance deteriorates probably due to the increased difficulty in training. Interestingly, the network in Figure 1(a) overestimates the short-range correlation in order to compensate the rapid exponential decay in the long distance.

Lstm

LSTMs perform much better in capturing long-range correlations, although the exponential decay can still been seen clearly in very small LSTMs (Figure 1(b) inset). When , it is hard to distinguish the exponential decay from the algebraic decay. Nevertheless, we still fit the correlation length. Note that the fitting on the training set yields the baseline , which is comparable to the sequence length. The correlation length also increases linearly with width in small LSTMs and then saturates. The depth dependence is similar to that of vanilla RNNs too, which is consistent with previous studies that shallow LSTMs usually perform as well as, if not better, than deep LSTMs Melis2017 .

Figure 1: (a) Estimated mutual information of unconditionally generated sequences from vanilla RNNs on binary sequences. The legend denotes width(depth, if depth1); (b) Same as (a) but for LSTMs; (c) Fitted correlation length as a function of the RNN width. The depth of all networks is one. Only data points greater than

is used due to the estimation error in the long distance. The error bar represents the 95% confidence interval. (d) Same as (c) but as a function of the RNN depth. The width of all networks is 8.

To summarize, both vanilla RNNs and LSTMs show exponential decaying mutual information on the binary dataset. The correlation length of LSTMs has a much better scaling behavior than vanilla RNNs.

4.2.2 Natural Language

We extend the scaling analysis to the more realistic natural language dataset WikiText-2 Merity2016 . Since vanilla RNNs cannot even capture the long-range dependence in the simple binary dataset, we only focus on LSTMs here. During training, the BPTT is truncated to 100 characters at character level and 50 tokens at word level. After training, we unconditionally generate a long sequence with 2MB characters and estimate its mutual information at character level.

Different from binary sequences, there exists multiple time scales in WikiText-2. In the short distance (word level), the mutual information decays exponentially, potentially due to the arbitrariness of characters within a word. In the long distance (paragraph level), the mutual information follows a power-law decay . The qualitative behavior is mixed at the intermediate distance (sentence level).

Strikingly, there is significant discrepancy between the mutual information profile between the training and the validation set. Not only the mutual information on the validation set is much larger, the algebraic decay in the long distance is missing as well. The fact that we always pick the best model on the validation set for sequence generation may prevent the model from learning the long-range mutual information. Therefore, we should interpret results especially from word-level models cautiously. We also find similar non-uniformity in datasets such as Penn Treebank. See Appendix for more details.

Figure 2: Estimated mutual information of unconditionally generated sequences from LSTMs on WikiText-2. The legend denotes width. The depth of all networks is one. (a) Character-level LSTM; (b) Word-level LSTM.

Character-level LSTMs can capture the short-range correlation quite well as long as the width is not too small (). In the intermediate distance, it seems from the inset of Figure 2(a) that all LSTMs show an exponential decaying mutual information. In large models where , the short-range mutual information is overestimated to compensate the rapid decay in the intermediate distance. No power-law decay in the long distance is captured at all.

Word-level LSTMs can capture the mutual information up to the intermediate distance. There is no surprise the short-range correlation is captured well as the word spelling is already encoded through word embedding. There is even long-range power-law dependence up to in the largest model with , although the mutual is overestimated approximately two times throughout all distances.

In this dataset, the mutual information of LSTMs always decays to some nonzero constant instead of to zero like that in binary sequences. We speculate that this is likely due to the short-range memory effect, similar to the non-decaying mutual information in repetitive sequences. The simplest example is the binary sequence where 01 and 10 are repeated with probability and respectively, which can be generated by a periodic HMM. It is not hard to prove the mutual information is a constant for all temporal distances greater than one. See Appendix for a proof.

5 Self-Attention Networks

We now turn to the empirical study of the original Transformer model NIPS2017_7181 . In principle, the conditional distribution in Transformers explicitly depends on the entire sequence history. For complexity reasons, during the training the look-back history is usually truncated to the recent elements. In this sense, Transformers are like -gram models with large . For the purpose of this paper, the truncation will not bother us, because we are only interested in the mutual information with . On WikiText-2, is limited to 512 characters at character level and 128 tokens at word level. However, if one is really interested in , the result on the -gram model suggests that the mutual information is bound to decay exponentially. The network width, which is the total hidden unit dimension, is defined as . For binary sequences, we use Transformers with four heads and for WikiText-2, we use eight heads.

5.1 Binary Sequence

Transformers can very efficiently capture long-range dependence in binary sequences. A single-layer Transformer of width can already capture the algebraic behavior quite well. Interestingly, the mutual information does not decay exponentially even in the simplest model. Moreover, the magnitude of the mutual information always coincides with the training data very well, while that of LSTMs is almost always slightly smaller. The correlation length fitting can be found in Appendix, although the mutual information does not fit quite well with the exponential function.

Figure 3: Estimated mutual information of unconditionally generated sequences from Transformers on various datasets. The legend denotes width(depth, if depth

1). (a) Binary sequences; (b) WikiText-2 at character level; (c) WikiText-2 at word level; (d) GPT-2 model trained on WebText with byte pair encoding (BPE)

sennrich-etal-2016-neural .

5.2 Natural Language

The mutual information of character-level Transformers already look very similar to that of word-level LSTMs. Both short and intermediate mutual information is captured and there is an overestimated power-law tail in the long distance for the single layer model. Even small word-level Transformers can track the mutual information very closely up to intermediate distances. Moreover, our three and five-layer Transformer models have a slowly decaying power-law tail up to , although the power does not track the training data very well. As a bonus, we evaluate the current state-of-the-art language model GPT-2 Radford2019 , where the long-range algebraic decay is persistent in both small (117M parameters) and large (1542M parameters) models. Note that the magnitude of the mutual information is much larger than that in WikiText-2, probably due to the better quality of the WebText dataset. This is beneficial to training, because it is more difficult for the network to distinguish the dependence from noise when the magnitude of the mutual information is small. Also, the discrepancy between the training and validation set of WikiText-2 may also make learning long-range dependence harder. Therefore, we speculate the bad dataset quality is the main reason why our word-level Transformer model cannot learn the power-law behavior very well on WikiText-2.

Finally, we observe the connection between the quality of the generated text and the mutual information scaling behavior. Short-range mutual information is connected to the ability of spelling words correctly, and the intermediate-range mutual information is related to the ability to close “braces” in section titles correctly, or to consecutively generate two or three coherent sentences. The long-range mutual information is reflected by the long paragraphs sharing a single topic. Some text samples are presented in Appendix.

6 Discussion

RNNs encode the sequence history into a fixed-length and continuous-valued vector. In linear RNNs, after we “integrate out” the hidden state, the history dependence in Equation (5) takes a form that is exponential in time. In this way, the network always “forgets” the past exponentially fast, thus cannot capture the power-law dependence in the sequence. In nonlinear RNNs, although we cannot analytically “integrate out” the hidden state, the experimental results suggest that RNNs still forget the history exponentially fast.

It is very tempting to connect the above result to statistical mechanics. At thermal equilibrium, systems with power-law decaying mutual information or correlation are called “critical”. It is well-known as van Hove’s theorem that in one dimension, systems with short-range interactions cannot be critical, and the mutual information always decays exponentially VANHOVE1950137 ; ruelle1999statistical ; landau2013statistical ; Cuesta2004 . Importantly, here “short range” means the interaction strength has a finite range or decays exponentially in distance. On the other hand, one-dimensional systems with long-range interactions can be critical. A classical example is the long-range Ising model, where the finite-temperature critical point exists when the interaction strength decays as , dyson1969 . At exactly the critical point, the mutual information decays algebraically, and in a very vast parameter space near the critical point, the mutual information decays very slowly so that no parameter fine-tuning is needed PhysRevE.91.052703 .

Linear RNNs resemble statistical mechanical systems with exponential decaying interactions. Therefore they cannot accommodate long-range correlations. Transformers exploit the entire sequence history explicitly. In fact, there is no natural notion of “distance” in Transformers, and the distance is artificially encoded using positional encoding layers. Because Transformers resemble statistical mechanical systems with long-range interactions, there is no surprise that they can capture long-range correlations well.

An advantage of the statistical mechanical argument is its robustness. Essentially, the result claims the universality class that a sequence model belongs to, which usually does not depend on many microscopic details. Analyzing nonlinear models and find their universality classes will be left as future work. However, one should be cautious that because the sampling is only conditioned on the past but not the future, the mapping between the Boltzmann distribution and the conditional distribution is straightforward only in limited cases. See Appendix for a simple example.

The implication of the above result is that self-attention networks are more efficient in capturing long-range temporal dependence. However, this ability comes at the cost of extra computational power. In order to generate a sequence of length , RNNs only need time while Transformers need (suppose no truncation on the window size). How to maintain long-range interactions while on the meantime reduce the computational complexity will be a challenging and interesting problem even from the perspective of statistical physics. It is also interesting to see how augmenting RNNs with memory can improve the long-range scaling behavior Graves2014 ; NIPS2015_5648 ; NIPS2015_5846 ; NIPS2015_5857 ; Grave2017 .

Last but not least, the theoretical study of linear RNNs only focuses on the intrinsic expressive power of its architecture. In reality, the practical expressive power is also heavily affect by how well the stochastic gradient descent algorithm performs on the model. The fact that Transformers are long-range interacted also helps backpropagation algorithm

chapter-gradient-flow-2001 . It will be interesting to connect the statistical physics to the gradient flow dynamics.

7 Conclusion

This paper demonstrates that RNNs are not efficient in capturing long-range temporal correlations both theoretically and empirically. Self-attention models like Transformers can capture long-range correlations much more efficiently than RNNs do, and reproduce the power-law mutual information in natural language texts.

We also notice the non-uniformity problem in popular natural language datasets. We believe a high-quality dataset is essential for the network to learn long-range dependence.

We hope this work provides a new perspective in understanding the expressive power of sequence models and shed new light on improving both the architecture and the training dynamics of them.

Acknowledgments

HS thanks Yaodong Li, Tianxiao Shen, Jiawei Yan, Pengfei Zhang, Wei Hu and Yi-Zhuang You for valuable discussions and suggestions.

References

  • (1) A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649, 2013. https://doi.org/10.1109/ICASSP.2013.6638947.
  • (2) I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” in Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), pp. 3104–3112, Curran Associates, Inc., 2014. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf.
  • (3) D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” 2014. https://arxiv.org/abs/1409.0473.
  • (4) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 5998–6008, Curran Associates, Inc., 2017. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
  • (5) A. M. Rush, S. Chopra, and J. Weston, “A Neural Attention Model for Abstractive Sentence Summarization,” in

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    , pp. 379–389, Association for Computational Linguistics, 2015.
    https://doi.org/10.18653/v1/D15-1044.
  • (6) R. Nallapati, B. Zhou, C. dos Santos, c. Gu̇lçehre, and B. Xiang, “Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond,” in Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290, Association for Computational Linguistics, 2016. https://doi.org/10.18653/v1/K16-1028.
  • (7) D. Eck and J. Schmidhuber, “Finding temporal structure in music: blues improvisation with LSTM recurrent networks,” in Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 747–756, 2002. https://doi.org/10.1109/NNSP.2002.1030094.
  • (8)

    F. Jelinek and R. Mercer, “Interpolated estimation of Markov source parameters from sparse data,” in

    Proceedings of the Workshop on Pattern Recognition in Practice

    , pp. 381–397, 1980.
  • (9) Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A Neural Probabilistic Language Model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003. http://jmlr.org/papers/v3/bengio03a.html.
  • (10) A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel Recurrent Neural Networks,” 2016. https://arxiv.org/abs/1601.06759.
  • (11) W. Li, “Power Spectra of Regular Languages and Cellular Automata,” Complex Systems, vol. 1, pp. 107–130, 1987. https://www.complex-systems.com/abstracts/v01_i01_a08/.
  • (12) H. Lin and M. Tegmark, “Critical Behavior in Physics and Probabilistic Formal Languages,” Entropy, vol. 19, no. 7, p. 299, 2017. https://doi.org/10.3390/e19070299.
  • (13) W. Li and K. Kaneko, “Long-Range Correlation and Partial Spectrum in a Noncoding DNA Sequence,” Europhysics Letters (EPL), vol. 17, no. 7, pp. 655–660, 1992. https://doi.org/10.1209/0295-5075/17/7/014.
  • (14) C.-K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, F. Sciortino, M. Simons, and H. E. Stanley, “Long-range correlations in nucleotide sequences,” Nature, vol. 356, no. 6365, pp. 168–170, 1992. https://doi.org/10.1038/356168a0.
  • (15) R. F. Voss, “Evolution of long-range fractal correlations and noise in DNA base sequences,” Phys. Rev. Lett., vol. 68, pp. 3805–3808, 1992. https:/doi.org/10.1103/PhysRevLett.68.3805.
  • (16) A. SCHENKEL, J. ZHANG, and Y.-C. ZHANG, “LONG RANGE CORRELATION IN HUMAN WRITINGS,” Fractals, vol. 01, no. 01, pp. 47–57, 1993. https://doi.org/10.1142/S0218348X93000083.
  • (17) W. Ebeling and T. Pöschel, “Entropy and Long-Range Correlations in Literary English,” Europhysics Letters (EPL), vol. 26, no. 4, pp. 241–246, 1994. https://doi.org/10.1209/0295-5075/26/4/001.
  • (18) M. A. MONTEMURRO and P. A. PURY, “LONG-RANGE FRACTAL CORRELATIONS IN LITERARY CORPORA,” Fractals, vol. 10, no. 04, pp. 451–461, 2002. https://doi.org/10.1142/S0218348X02001257.
  • (19) P. Kokol, V. Podgorelec, M. Zorman, T. Kokol, and T. Njivar, “Computer and natural language texts—A comparison based on long-range correlations,” Journal of the American Society for Information Science, vol. 50, no. 14, pp. 1295–1301, 1999. https://doi.org/10.1002/(SICI)1097-4571(1999)50:14<1295::AID-ASI4>3.0.CO;2-5.
  • (20) D. J. Levitin, P. Chordia, and V. Menon, “Musical rhythm spectra from Bach to Joplin obey a power law,” Proceedings of the National Academy of Sciences, vol. 109, no. 10, pp. 3716–3720, 2012. https://doi.org/10.1073/pnas.1113828109.
  • (21) B. Manaris, J. Romero, P. Machado, D. Krehbiel, T. Hirzel, W. Pharr, and R. B. Davis, “Zipf’s Law, Music Classification, and Aesthetics,” Computer Music Journal, vol. 29, no. 1, pp. 55–69, 2005. https://doi.org/10.1162/comj.2005.29.1.55.
  • (22) A. W. Lo, “Long-Term Memory in Stock Market Prices,” Econometrica, vol. 59, no. 5, pp. 1279–1313, 1991. https://doi.org/10.2307/2938368.
  • (23) Z. Ding, C. W. Granger, and R. F. Engle, “A long memory property of stock market returns and a new model,” Journal of Empirical Finance, vol. 1, no. 1, pp. 83 – 106, 1993. https://doi.org/10.1016/0927-5398(93)90006-D.
  • (24) S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735.
  • (25) M. Hermans and B. Schrauwen, “Training and Analysing Deep Recurrent Neural Networks,” in Advances in Neural Information Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), pp. 190–198, Curran Associates, Inc., 2013. http://papers.nips.cc/paper/5166-training-and-analysing-deep-recurrent-neural-networks.pdf.
  • (26) A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and Understanding Recurrent Networks,” 2015. https://arxiv.org/abs/1506.02078.
  • (27) U. Khandelwal, H. He, P. Qi, and D. Jurafsky, “Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 284–294, Association for Computational Linguistics, 2018. https://www.aclweb.org/anthology/P18-1027.
  • (28) R. Schwartz-ziv and N. Tishby, “Opening the black box of Deep Neural Networks via Information,” 2017. https://arxiv.org/abs/1703.00810.
  • (29) Z. Goldfeld, E. V. D. Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy, “Estimating Information Flow in Neural Networks,” 2018. https://arxiv.org/abs/1810.05728.
  • (30) S. Takahashi and K. Tanaka-Ishii, “Do neural nets learn statistical laws behind natural language?,” PLOS ONE, vol. 12, no. 12, 2017. https://doi.org/10.1371/journal.pone.0189326.
  • (31) G. Melis, C. Dyer, and P. Blunsom, “On the State of the Art of Evaluation in Neural Language Models,” 2017. https://arxiv.org/abs/1707.05589.
  • (32) P. Flajolet and R. Sedgewick, Analytic Combinatorics. Cambridge University Press, 2009.
  • (33) S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer Sentinel Mixture Models,” 2016. https://arxiv.org/abs/1609.07843.
  • (34)

    R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” in

    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (Berlin, Germany), pp. 1715–1725, Association for Computational Linguistics, 2016.
    https://doi.org/10.18653/v1/P16-1162.
  • (35) A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019. https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
  • (36) L. van Hove, “Sur L’intégrale de Configuration Pour Les Systèmes De Particules À Une Dimension,” Physica, vol. 16, no. 2, pp. 137 – 143, 1950. https://doi.org/10.1016/0031-8914(50)90072-3.
  • (37) D. Ruelle, Statistical Mechanics: Rigorous Results. World Scientific, 1999.
  • (38) L. Landau and E. Lifshitz, Statistical Physics, Volume 5. Elsevier Science, 2013.
  • (39)

    J. A. Cuesta and A. Sánchez, “General Non-Existence Theorem for Phase Transitions in One-Dimensional Systems with Short Range Interactions, and Physical Examples of Such Transitions,”

    Journal of Statistical Physics, vol. 115, no. 3, pp. 869–893, 2004.
    https://doi.org/10.1023/B:JOSS.0000022373.63640.4e.
  • (40) F. J. Dyson, “Existence of a phase-transition in a one-dimensional Ising ferromagnet,” Comm. Math. Phys., vol. 12, no. 2, pp. 91–107, 1969. https://projecteuclid.org:443/euclid.cmp/1103841344.
  • (41) A. Colliva, R. Pellegrini, A. Testori, and M. Caselle, “Ising-model description of long-range correlations in DNA sequences,” Phys. Rev. E, vol. 91, p. 052703, 2015. https://doi.org/10.1103/PhysRevE.91.052703.
  • (42) A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines,” 2014. https://arxiv.org/abs/1410.5401.
  • (43) E. Grefenstette, K. M. Hermann, M. Suleyman, and P. Blunsom, “Learning to Transduce with Unbounded Memory,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 1828–1836, Curran Associates, Inc., 2015. http://papers.nips.cc/paper/5648-learning-to-transduce-with-unbounded-memory.pdf.
  • (44) S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “End-To-End Memory Networks,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 2440–2448, Curran Associates, Inc., 2015. http://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf.
  • (45) A. Joulin and T. Mikolov, “Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 190–198, Curran Associates, Inc., 2015. http://papers.nips.cc/paper/5857-inferring-algorithmic-patterns-with-stack-augmented-recurrent-nets.pdf.
  • (46) E. Grave, A. Joulin, and N. Usunier, “Improving Neural Language Models with a Continuous Cache,” 2017. https://arxiv.org/abs/1612.04426.
  • (47) S. Hochreiter, Y. Bengio, and P. Frasconi, “Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies,” in Field Guide to Dynamical Recurrent Networks (J. Kolen and S. Kremer, eds.), IEEE Press, 2001. https://doi.org/10.1109/9780470544037.ch14.
  • (48) P. Grassberger, “Entropy Estimates from Insufficient Samplings,” 2008. https://arxiv.org/abs/physics/0307138.
  • (49) M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a Large Annotated Corpus of English: The Penn Treebank,” Computational Linguistics, vol. 19, no. 2, 1993. https://www.aclweb.org/anthology/J93-2004.
  • (50) M. Mahoney, “Relationship of Wikipedia Text to Clean Text,” 2006. http://mattmahoney.net/dc/textdata.html.
  • (51)

    G. Hinton, N. Srivastava, and K. Swersky, “Lecture 6e—RmsProp: Divide the gradient by a running average of its recent magnitude.” Coursera: Neural Networks for Machine Learning, 2012.

  • (52) X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    (Y. W. Teh and M. Titterington, eds.), vol. 9 of Proceedings of Machine Learning Research, (Chia Laguna Resort, Sardinia, Italy), pp. 249–256, PMLR, 2010.
    http://proceedings.mlr.press/v9/glorot10a.html.
  • (53) A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” 2013. https://arxiv.org/abs/1312.6120.
  • (54) O. Press and L. Wolf, “Using the Output Embedding to Improve Language Models,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, (Valencia, Spain), pp. 157–163, Association for Computational Linguistics, 2017. https://www.aclweb.org/anthology/E17-2025.
  • (55) H. Inan, K. Khosravi, and R. Socher, “Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling,” 2016. https://arxiv.org/abs/1611.01462.
  • (56) D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” 2014. https://arxiv.org/abs/1412.6980.
  • (57) I. Loshchilov and F. Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts,” 2016. https://arxiv.org/abs/1608.03983.
  • (58) S. Merity, N. S. Keskar, and R. Socher, “Regularizing and Optimizing LSTM Language Models,” aug 2017. https://arxiv.org/abs/1708.02182.

Appendix A Exponential Mutual Information in Linear RNNs with Gaussian Output

In this section, we present the full proof of exponential mutual information in linear Elman RNNs with Gaussian output. For the sake of self-containedness, some of the arguments in the main text will be repeated. The proof can be roughly divided into four steps:

  1. Derive the recurrence relations of the block covariance matrix;

  2. Solve the recurrence relation using generating functions;

  3. Perform asymptotic analysis by studying the singularities of generating functions;

  4. Compute the mutual information based on the asymptotic analysis.

Problem Setup

The classical Elman RNN with the linear activation is given by:

(10)
(11)

parameterizes the probability distribution from which is sampled, and is the hidden state. In this way, the shapes of the weight matrices are , , and . In the following, we assume is sampled from a multivariate Gaussian distribution with mean and covariance matrix proportional to the identity, i.e. . It follows from iteratively applying Equation (10) that . Therefore

(12)

The joint probability distribution of the entire sequence factorizes as

(13)

where

(14)

Here we have assumed is also a multivariate Gaussian random variable.

The hidden state is often initialized as . Under this initial condition, is multivariate Gaussian, and so does the joint distribution of the entire sequence . In this way, the random process is a Gaussian process. Since we are interested in the mutual information between and for some generic , without loss of generality we can set and let be a generic multivariate Gaussian distribution. We can also let be the distribution that is already averaged over the entire sequence. In any case, we will see the asymptotic behavior of the mutual information is almost independent of .

Deriving Recurrence Relations

We are particularly interested in the distribution . Because marginalization of multivariate Gaussian random variable is still a multivariate Gaussian one, we only need to compute the block covariance matrix and . They can be derived easily by noting , where and is independent. In this way,

(15)

where in the second line we have used the decomposition and the independence of , and in the last line we have used the fact that

(16)

Similarly,

(17)

As a sanity check, Equation (15) and (17

) can also be derived directly from the probability density function Equation (

13). and only depend on and for . First rewrite Equation (13) as the canonical form of multivariate Gaussian distribution:

(18)

where and is a symmetric matrix. The last row of is

(19)

and the second column of is

(20)

Since , the product of the last row of and the second column of should be zero. This gives

(21)

Similarly, by considering the last row of and the last column of , we obtain

(22)
Solving Recurrence Relations

We solve the linear recurrence relation of , given by Equation (15). Define the auxiliary sequence

(23)

and the following three formal power series:

(24)
(25)
(26)

With the identity

(27)

Equation (15) is equivalent to

(28)

or

(29)

To this end, we assume the square matrix

bears an eigenvalue decomposition

, where with and

is an orthogonal matrix. With this assumption, we can obtain the closed form expression of

and :

(30)

or in matrix form

(31)

Similarly,

(32)

or in matrix form

(33)
(34)

Insert Equation (31) and (34) into (29), we obtain the formal solution for :

(35)
Asymptotic Analysis

The long time asymptotic behavior of can be analyzed by treating as a function on the complex plane and studying it singularities flajolet2009analytic . Since matrix inversion can be computed from Cramer’s rule, is a rational function of , and its singularities always occur as poles. Therefore, the asymptotic behavior of will be exponential in , whose exponent is determined by the position of the pole closet to the origin. The order of the pole determines the polynomial sub-exponential factor.

Because we are dealing with a matrix here, each element has its own set of poles. Denote the pole closest to the origin associated with as and its order as . In this way, when is large. Unless is exactly on the unit circle, which is of zero measure in the parameter space, the solution either increases or decreases exponentially with . Even if is on the unit circle, the power of the polynomial can only be an integer. If any element in increases exponentially with , simply remembers the initial condition, which is not desirable. Therefore, in any working network every element of decreases exponentially in .

The pole position is bounded by only , and , independent of the initial condition . For example, consider the term proportional to in Equation (35):

(36)

where . Equation (36) is exactly the partial fraction decomposition of a rational function. (Of course, may be decomposed further.) In this way, is the pole closest to the origin, among all poles of and for all , unless for that particular is exactly zero, in which case the resulting pole will be further away from the origin, leading to a faster decay.

We then analyze the asymptotic behavior of . The only property that we need about it is non-degeneracy. Intuitively, this is true because it is sampled conditionally from a Gaussian distribution with covariance matrix . Formally, according to Courant minimax principle, the minimal eigenvalue of is given by

(37)

where we have inserted Equation (17) and used the fact that is positive semi-definite. In this way, , where , is the eigenvalue of the covariance matrix and is an orthogonal matrix.

Mutual Information Computation

Let , be two multivariate Gaussian random variables with a joint distribution , where

(38)

Use one of the definitions of the mutual information, , the entropy of multivariate Gaussian random variables and the Schur decomposition

(39)

we have

(40)

Using the above formula, the mutual information between and can then be computed as

(41)

is time-independent. Each element in decays exponentially with . Because the minimal eigenvalue of is bounded by , is well-defined and tends to a finite constant matrix . In this way, elements in is exponentially small in the large limit, which justifies the last equality in Equation (41). Because trace is a linear function, the mutual information decreases exponentially with . This finishes the proof that in any linear Elman RNN with Gaussian output that does not simply memorize the initial condition, the mutual information decays exponentially with time. The only technical assumption we made in the proof is that the covariance matrix decreases exponentially in time instead of exponentially, without which the network simply memories the initial condition.

We note that adding bias terms in Equation (3) and (4) does not affect the conclusion because the mutual information of Gaussian random variables only depends on the covariance matrix, while the bias terms only affect their mean.

Appendix B Binary Sequence Generation from Multivariate Gaussian

In this section, we report a method to sample fixed-length sequences of binary random variables with arbitrary designated mutual information. The method is used to generate the binary sequence dataset used in this paper.

Consider a bivariate normal random variable with mean zero and covariance matrix

(42)

where

due to the non-negativity of the variance. The condition that the covariance matrix is positive semi-definite is that

.

Define the bivariate Bernoulli random variable , where the sign function applies element-wisely. The mutual information between and is

(43)

where we have used the fact that for all . Although an analytical expression for general is viable, for simplicity, we take and such that . Straightforward integration yields

(44)
(45)

The mutual information as a function of is then

(46)

As a sanity check, when or , due to the property of multivariate normal random variable, and become independent and so do and . This is consistent with the fact that . When or