Mutual Information Scaling and Expressive Power of Sequence Models

05/10/2019
by   Huitao Shen, et al.
0

Sequence models assign probabilities to variable-length sequences such as natural language texts. The ability of sequence models to capture temporal dependence can be characterized by the temporal scaling of correlation and mutual information. In this paper, we study the mutual information of recurrent neural networks (RNNs) including long short-term memories and self-attention networks such as Transformers. Through a combination of theoretical study of linear RNNs and empirical study of nonlinear RNNs, we find their mutual information decays exponentially in temporal distance. On the other hand, Transformers can capture long-range mutual information more efficiently, making them preferable in modeling sequences with slow power-law mutual information, such as natural languages and stock prices. We discuss the connection of these results with statistical mechanics. We also point out the non-uniformity problem in many natural language datasets. We hope this work provides a new perspective in understanding the expressive power of sequence models and shed new light on improving the architecture of them.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2019

Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer

In this work, we develop a novel regularizer to improve the learning of ...
research
01/30/2023

Quantifying and maximizing the information flux in recurrent neural networks

Free-running Recurrent Neural Networks (RNNs), especially probabilistic ...
research
07/02/2021

Minimizing couplings in renormalization by preserving short-range mutual information

The connections between renormalization in statistical mechanics and inf...
research
06/21/2016

Criticality in Formal Languages and Statistical Physics

We show that the mutual information between two symbols, as a function o...
research
05/23/2019

Quantifying Long Range Dependence in Language and User Behavior to improve RNNs

Characterizing temporal dependence patterns is a critical step in unders...
research
11/25/2020

Bounds for Algorithmic Mutual Information and a Unifilar Order Estimator

Inspired by Hilberg's hypothesis, which states that mutual information b...
research
12/08/2020

Mutual Information Decay Curves and Hyper-Parameter Grid Search Design for Recurrent Neural Architectures

We present an approach to design the grid searches for hyper-parameter o...

Please sign up or login with your details

Forgot password? Click here to reset