Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer

05/28/2019 ∙ by Yanshuai Cao, et al. ∙ Borealis AI 0

In this work, we develop a novel regularizer to improve the learning of long-range dependency of sequence data. Applied on language modelling, our regularizer expresses the inductive bias that sequence variables should have high mutual information even though the model might not see abundant observations for complex long-range dependency. We show how the `next sentence prediction (classification)' heuristic can be derived in a principled way from our mutual information estimation framework, and be further extended to maximize the mutual information of sequence variables. The proposed approach not only is effective at increasing the mutual information of segments under the learned model but more importantly, leads to a higher likelihood on holdout data, and improved generation quality.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning long-range dependency in sequential data such as text is challenging, and the difficulty has mostly been attributed to the vanishing gradient problem in autoregressive neural networks such as RNNs

hochreiter2001gradient . There is a vast literature trying to solve this gradient flow problem through better architecture hochreiter2001gradient ; mikolov2014learning ; vaswani2017attention , better optimization martens2011learning or better initialization le2015simple . On the other hand, there is an orthogonal issue that has received less attention: statistical dependency over a short span is usually abundant in data, e.g., bigrams, common phrases and idioms; on the other hand, long-range dependency typically involves more complex or abstract relationships of a large number of tokens (high order interactions

). In other words, there is a sampling mismatch between observations supporting local correlations versus evidence for high order interaction, while the latter requires more samples to learn from at the first place because they involve more variables. We conjecture that in addition to the gradient flow issue, this problem of sparse sampling of high order statistical relations renders learning long-range dependency hard in natural language processing.

Take language modelling for example: with a vocabulary of size , the number of possible sequences grows as with sequence length

. Neural language models use distributed representation to overcome this issue

bengio2003neural , as not all sequences form plausible natural language utterances, and there is shared semantics and compositionality in different texts. However, the parametrization does not change the fundamental fact that in the training data, there is an abundance of observation for local patterns, but much sparser observations for the different high-level ideas. As language evolved to express the endless possibilities of the world, even among the set of “plausible” long sequences, a training set can only cover a small fraction. Therefore, there is an inherent imbalance of sampling between short and long range dependencies. As such, because it is a data sparsity issue at the core, it cannot be completely solved by better architecture or optimization.

The natural remedy facing limited data is to regularize the model using prior knowledge. In this work, we propose a novel approach for incorporating into the usual maximum likelihood objective the additional prior that long-range dependency exists in texts. We achieve this by bootstrapping a lower bound on the mutual information (MI) over groups of variables (segments or sentences) and subsequently applying the bound to encourage high MI. Both the bootstrapping and application of the bound improves long-range dependency learning: first, the bootstrap step helps the neural network’s hidden representation to recognize evidence for high mutual information that exists in the

data distribution; second, the information lower bound value as the reward encourages the model distribution to exhibit high mutual information as well. We apply the proposed method for language modelling, although the general framework could apply to other problems as well.

Our work offers a new perspective on why the heuristic of next sentence prediction used in previous works trinh2018learning ; devlin2018bert are useful auxiliary tasks, while revealing missing ingredients, which we complete in the proposed algorithm. We demonstrate improved perplexity on two established benchmarks, reflecting the positive regularizing effect. We also show that our proposed method can help the model generate higher-quality samples with more diversity measured by reversed perplexity zhao2018adversarially and more dependency measured by an empirical lower bound of mutual information.

2 Background

2.1 MLE Language Model and Sparsely Observed High Order Dependency

A language model (LM) assigns a probability to a sequence of tokens (characters, bytes, or words). Let

denote token variables, a LM

factorizes the joint distribution of

’s into a product of conditionals from left to right, leveraging the inherent order of text , where denotes all token variables with index less than , and . Let be an observed sequence of tokens as training data, sampled from data distribution . Learning simply maximizes the log likelihood of the observations with respect to the parameters of (we will use the notation and interchangeably.):


As requires to focus its probability mass on observed subsequent tokens given its preceding ones, maximum likelihood does have the ability to enforce long-range dependencies of sequence variables. However, Eq. 1 hides issues about high order interactions where a relatively smaller fraction of the valid outcomes are observed. To see this, take a partition of the sequence variables into , where , and , then Eq. 1 is equivalent to:

Now we can see that as in the case of a single next token prediction, MLE prefers to commit its prediction to the particular observed sequence(s) of , but this observed set is too sparse for the much larger configuration space. We propose to use MI as a way to express the belief that there is some dependency between and without committing to particular instantiated predictions.

2.2 Regularizing Mutual Information

Mutual information (MI) is a measure of how much does observing one random variable reveal about another (and vice versa). It is zero if and only if the two are independent. The MI

between two random variables and

(scalars or vectors) is the Kullback-Leibler (KL) divergence between the joint

and product of marginal distributions of the two random variables:


For text data, and can be sentences or segments of tokens (potentially extending over sentence boundaries). As MI is defined with respect to the distribution, rather than the particular observed values, it enables us to enforce dependency without committing to instantiated predictions.

We can also write as the difference between entropy and conditional entropy:


Hence, high MI can be achieved by minimizing conditional entropy or maximizing marginal entropy (or both). Unlike MLE which can only maximize MI by reducing the conditional entropy, a MI regularizer has the option to encourage long-range dependency without forcing to commit its prediction to observed sequence(s), but by increasing the marginal entropy .

Note that the definition in Eq. 2 and Eq. 3 depend on the distribution used, so under the data and model distributions ( and ), the MI is not the same in general. Henceforth, we will make the distinction of and in our notations.

cannot be directly computed due to lack of functional form of

. For autoregressive models such as RNN, evaluating

is computationally intractable since it needs summation over all possible sequences. Hence, we will instead lower bound and in a computationally tractable way.

3 Boostrapping a Mutual Information Regularizer


Mutual information lower bound: learn to classify the correct next sentence from a randomly sampled one: essentially the next sentence prediction task, which was previously considered a heuristic

devlin2018bert .
(b) Importance-Weighted RAML: sample another nearby sentence (), and maximize the conditional log likelihood of it given but with an appropraite weight, which is calculated using the MI estimator from Fig. 0(a).
Figure 1: Overview of the two key components of the proposed approach

Our operating assumption is that longer segments in the data should have high with each other; and our goal is for sequence variables under model to have similarly high .

On the high level, our method adds some regularization terms to the MLE objective Eq. 1, in two separate phases. The illustration in Fig. 0(a)-0(b) capture the core of our proposal. In the first phase, we bootstrap a MI lower bound by doing next sentence prediction, which is a binary classification of the correct next sentence versus a randomly samled sentence. After some switching condition is met, we proceed to the second phase where the MI estimator is also used to produce reward for optimizing directly using reward augmented maximum likelihood.

In order to compute the proposed regularizers, we add a small discriminator net (parametrized by ) on top of the RNN’s hidden features (parametrized by ). The discriminator will then look at pairs of segments or sequence, the ’s in Fig. 0(a), trying to distinguish pairs following some joint distribution (’s with dependency) versus product of marginals (independent ’s).

The discriminator serves the MI regularization in both phases. For the first phase, Sec. 3.1 will show that making this bound tight automatically forces the hidden representation of to preserve as much MI as possible, making the model good at recognizing related information. After the RNN and discriminator are sufficiently well trained, the learned parameters can then be applied to MI under distribution, to get a lower bound . This leads to the second phase, where in addition to continue to optimize , we use as reward to encourage high MI under . This has a more direct regularizing effect than .

Directly optimizing requires sampling from and learning by policy gradient (or other gradient estimators). However, sequential sampling from

is slow while deep RL converges slowly due to high variance. Hence, we explore an alternative, the reward augmented maximum likelihood (RAML)

norouzi2016reward . Because RAML does not directly support our MI bound as the reward, we develop a modification via importance reweighting in Sec.3.2.3. The overall algorithm is summarized in Alg. 1.

3.1 Phase-I: Next Sentence Prediction Bootstraps a Lower Bound of

As previously mentioned, cannot be directly computed, but can be lower bounded in a number of ways, for example, via the MINE lower bound belghazi2018mine :


where is a parametrized test function trying to distinguish samples of the joint distribution from those from the product of marginals. can be any function and optimizing makes the bound tighter. Hence, we compose some intermediary hidden layer representation of the RNN with a discriminator , in order to form the test function :


For brevity, we will write and henceforth.

In this work, we take and of to be consecutive pair of sentences. Other pairs could also be regularized in theory, such as consecutive segments, or pairs of sentences at special positions in a document, like the first sentence of consecutive paragraphs.

Eq. 4

can be optimized using noise contrastive estimation, by turning it into a binary classification problem as in

hjelm2018learning . To sample positive examples from , we draw for some sentence indexed and , . To sample negatives from the product of marginals , we take , and sample where randomly drawn from the training corpus. Fig. 0(a) depicts our overall approach to bootstrap this lower bound. As pointed out by hjelm2018learning , when the goal is to maximize the MI rather than estimating its particular value, one can use a proxy that has better gradient property than :


where . remains a lower bound for any parameters.

3.1.1 Regularizing Effect on Model

To understand how does maximizing regularize the model , note that the MI between the encodings is a lower bound on the MI of the raw inputs, by the Data Processing Inequality cover2012elements . In other words, (proof in Suppl. Appendix. A.1). Because is also the test function for the joint versus product of marginals on the random variables and , we have , i.e. the MI of features is sandwiched between the MI of data and our parametric lower bound.

Therefore, while is a fixed value for the data, estimating a bound for by optimizing both and pushes the hidden representation to capture as much data MI as possible. Viewed from a different angle, it is equivalent to estimating a bound for the MI between and , (using the add-on discriminator ), and then optimize the -model features and to have high mutual information.

Intuitively, this step encourages ’s to recognize related information in the data. In the next section, we will develop a method to directly optimize .

3.2 Phase-II: Directly Optimizing

As mentioned, the regularization effect of Phase-I is indirect, as the expectation is with respect to the data distribution . We now discuss how to directly and efficiently optimize .

To this end, after sufficient training from Phase-I, we take the learned parameters to initialize the lower bound . Optimizing poses a series of challenges which we will tackle in the next subsections (Sec. 3.2.1-3.2.3). We emphasize that during Phase-II, we still optimize from Phase-I, but just with an additional regularization term, which together approximate for .

3.2.1 Difficulty with optimizing

Because the MINE bound holds for any parameters, we can instead use the binary classification form to optimize the parameters, similar to what we do for and as done in hjelm2018learning . The proxy objective has the form: where,


To optimize with respect to , the gradient has two terms , where


uses policy gradient (i.e. likelihood ratio estimator) with being the policy while and being the reward (and penalty). can be variance-reduced by control-variate methods, e.g. rennie2017self .

However, deep RL is known to converge slowly due to high variance, our trials confirm the difficulty in this particular case. Furthermore, sampling from is generally slow for autoregressive models as it cannot be easily parallelized. These two issues compounded means that we would like to avoid sampling from . To this end, we develop a modification of the reward augmented maximum likelihood (RAML) norouzi2016reward , which avoids the high variance and slow -sampling.

For the part (Eq. 8), if we simply replace the distributions with in the expectation, we recover the Phase-I regularizer Eq. 6, which we can use to approximate . The bias of this approximation is:


which becomes small as the maximum likelihood learning progresses, because in both terms, the total variation distance is bounded by via Pinsker’s inequality Tsybakov:2008:INE:1522486 .

3.2.2 IW-RAML: RAML background

RAML can be viewed as optimizing the reverse direction of KL divergence comparing to the entropy-regularized policy gradient RL objective. We will leave the details of RAML to the Appendix. A.2 and refer readers to the worknorouzi2016reward . For our purpose here, the important information is that the RAML gradient with the policy gradient are:


where is the exponentiated pay-off distribution defined as:


is a reward function that measures some similarity of with respect to the ground truth (e.g. negative edit-distance). RAML gradient Eq. 20 samples from a stationary distribution, while policy gradient Eq. 21 samples from the changing distribution. Furthermore, by definition, samples from has higher chance for high reward, while samples relies on exploration. For these reasons, RAML has much lower variance than RL.

3.2.3 IW-RAML: MI Reward

Unfortunately, sampling from can only be done efficiently for some special classes of reward such as the edit-distance used in norouzi2016reward . Here, we would like to use the learned MI estimator, more specifically the classifier scores as the reward. Assume is the sentence following in the corpus, then for any other , the reward is:


In the illustration Fig. 0(b), would be and , and another is sampled to be evaluated. could also be any other sentence/segment not in the dataset.

As the deep-neural-net-computed scores lack the simple structure of edit-distance that can be exploited for efficient sampling from , direct application of RAML to the MI reward is not possible. We will instead develop an efficient alternative based on importance sampling.

Intuitively, a sentence that is near

in the text would tend to be more related to it, and vice versa. Therefore, we can use a geometric distribution based at the index of

as the proposal distribution, as illustrated in Fig. 0(b). Let have sentence/segment index , then



is a hyperparameter (we set to

without tuning it). Other proposals are also possible. With as the proposal, our importance weighted RAML (IW-RAML) gradient is then:


Because the reward in Eq. 14 is shift-standardized with respect to the discriminator score at , we assume that the normalization constant in Eq. 18 does not vary heavily for different , so that we can perform self-normalizing importance sampling by averaging across the mini-batches.

3.2.4 IW-RAML: Bias-Variance Trade-off

A side benefit of introducing is to re-establish the stationarity of the sampling distribution in the RAML gradient estimator. Because the reward function Eq. 14 depends on , the exponentiated pay-off distribution is no longer stationary like in the original RAML with simple reward norouzi2016reward , but we re-gain stationarity through the fixed proposal , keeping the variance low. Stationarity of the sampling distribution is one of the reasons for the lower variance in RAML.

Choosing IW-RAML over RL is a bias-variance trade-off. The RL objective gradient in Eq. 8-9 is the unbiased one, and IW-RAML as introduced has a few biases: using the opposite direction of the KL divergence (analyzed in norouzi2016reward ); dropping the softplus nonlinearity in reward definition 14; distribution support of being smaller than . Each of these approximations introduces some bias, but the overall variance is significantly reduced as the empirical analysis in Sec. 5.3 shows.

1:   Input: batch size , dataset , proposal distribution , maximum number of iterations .
2:   phase-two := false
3:   for  do
4:       Compute LM objective from Eq. 1 and its gradient; #

5:       Sample a mini-batch of consecutive sentences from as samples from ;
6:       Sample another mini-batch of from to form as samples from ;
7:       Extract features , and and compute according to Eq. 6 and its gradient; #

8:       if phase-two then
9:           Sample a mini-batch of from according to , each with corresponding .
10:           Compute IW-RAML gradients according to Eq. 16, with , , and . #

11:       end if
12:       Add gradient contributions from



and update parameters and .
13:       if not phase-two meeting switch condition then
14:           phase-two := true
15:       end if
16:   end for
Algorithm 1 Language Model Learning with BMI regularizer

4 Related Work

Long Range Dependency and Gradient Flow  

Capturing long-range dependency has been a major challenge in sequence learning. Most works have focused on the gradient flow in backpropagation through time (BPTT). The LSTM architecture

lstm1997 was invented to address the very problem of vanishing and exploding gradient in RNN hochreiter2001gradient . There is a vast literature on improving the gradient flow with new architectural modification or regularization mikolov2014learning ; koutnik2014clockwork ; wu2016multiplicative ; li2018independently . Seq-to-seq with attention or memory bahdanau2014neural ; cho2015describing ; sukhbaatar2015end ; joulin2015inferring is a major neural architecture advance that improves the gradient flow by shortening the path that relevant information needs to traverse in the computation graph. The recent invention of the Transformer architecture vaswani2017attention , and the subsequent large scale pre-training successes devlin2018bert ; radford2018improving ; gpt2 are further examples of better architecture improving gradient flow.

Regularization via Auxiliary Tasks   Closer to our method are works that use auxiliary prediction tasks as regularization trinh2018learning ; devlin2018bert . trinh2018learning uses an auxiliary task of predicting some random future or past subsequence with reconstruction loss. Their focus is still on vanishing/exploding gradient and issues caused by BPTT. Their method is justified empirically and it is unclear if the auxiliary task losses are compatible with maximum likelihood objective of language modelling, which they did not experiment on. devlin2018bert adds a “next sentence prediction” task to its masked language model objective, which tries to classify if a sentence is the correct next one or randomly sampled. This task is the same as our Phase-I for learning the lower bound , but we are the first to draw the theoretical connection to mutual information, explaining its regularization effect on the model (Sec. 3.1.1), and applying the bootstrapped MI bound for more direct regularization in Phase-II is completely novel in our method.

Language Modeling with Extra Context   Modeling long range dependency is crucial to language models, since capturing the larger context effectively can help predict the next token. In order to capture this dependency, there are some works that feed an additional representation of larger context into the network including additional block, document or corpus level topic or discourse information mikolov2012context ; wang2015larger ; dieng2016topicrnn ; wang2017topic . Our work is orthogonal to them and can be combined.

5 Experiments

We experiment on two widely-used benchmarks on word-level language modeling, Penn Treebank (PTB) mikolov2012context and WikiText-2 (WT2) merity2016pointer . We choose the recent state-of-the-art model among RNN-based models on these two benchmarks, AWD-LSTM-MoS yang2017breaking as our baseline.

We compare the baseline with the same model adding variants of our proposed regularizer, Bootstrapping Mutual Information (BMI) regularizer: (1) BMI-base: apply Phase-I throughout the training; (2) BMI-full: apply Phase-I till we learn a good enough then apply both Phase-I and Phase-II. Here, we adopt the same switching condition from SGD to ASGDpolyak1992acceleration in training RNN language model firstly proposed by merity2017regularizing to switch from Phase-I to Phase-II.

Experimental Setup 

We apply the max-pooling over the hidden states for all the layers in LSTM and concatenate them as our

-encoding. We use a one-layer feedforward network with the features similar to conneau-EtAl:2017:EMNLP2017 as for our test function whose number of hidden units is . The ADAM kingma2014adam optimizer with learning rate and weight decay of is applied on , while is optimized in the same way as in merity2017regularizing ; yang2017breaking with SGD then ASGD polyak1992acceleration . All the above hyperparameters are chosen by validation perplexity on PTB and applied directly to WT2. The weight of the regularizer term is set to for PTB and for WT2 chosen by validation perplexity on their respective datasets. The remaining architecture and hyperparameters follow exactly the same as the code released by yang2017breaking . As mentioned previously, we set the temperature hyperparameter in RAML to , and hyperparameter of importance sample proposal to , both without tuning.

5.1 Perplexity and Reverse Perplexity

Table 2 presents the main results of language modeling. We evaluate the baseline and variants of our approach with and without finetune described in the baseline paper yang2017breaking . In all settings, the models with BMI outperforms the baseline, and BMI-full (with IW-RAML) yields further improvement on top of BMI-base (without IW-RAML).

Following zhao2018adversarially , we use reverse perplexity to measure the diversity aspect of generation quality. We generate a chunk of text with

tokens from each model, train a second RNN language model (RNN-LM) on the generated text; then evaluate the perplexity of the held-out data from PTB and WikiText2 under the second language model. Note that the second RNN-LM is a regular LM trained from scratch and used for evaluation only. As shown in Table

2, the models with BMI regularizer improve the reverse perplexity over the baseline by a significant margin, indicating better generation diversity, which is to be expected as MI regularizer encourages higher marginal entropy (in addition to lower conditional entropy).

Fig. 2 shows the learning curves of each model on both datasets after switching to ASGD as mentioned earlier in Experiment Setup. The validation perplexities of BMI models decrease faster than the baseline AWD-LSTM-MoS. In addition, BMI-full is also consistently better than BMI-base and can further decrease the perplexity after BMI-base and AWD-LSTM-MoS stop decreasing.

5.2 Empirical MI on generations

To verify that BMI indeed increased , we measure the sample MI of generated texts as well as the training corpus. MI of long sequence pairs cannot be directly computed from samples, we instead estimate lower bounds by learning evaluation discriminators, on the generated text. is completely separate from the learned model, and is much smaller in size. We train ’s using the proxy objective in Eq. 6 and early-stop based on the MINE lower bound Eq. 4 on validation set, then report the MINE bound value on the test set. This estimated lower bound essentially measures the degree of dependency. Table 2 shows that BMI generations exhibit higher MI than those of the baseline AWD-LSTM-MoS, while BMI-full improves over BMI-base.

5.3 Analysis: RL vs. IW-RAML variance

Fig. 3 compares the gradient variance under RL and IW-RAML on PTB. The gradient variance for each parameter is estimated over iterations after the initial learning stops and switches to ASGD; the ratio of variance of the corresponding parameters is then aggregated into the histogram. For RL, we use policy gradient with self-critical baseline for variance reduction rennie2017self . Only gradient contributions from the regularizers are measured, while the language model MLE objective is excluded.

The histogram shows that the RL variance is more than times larger than IW-RAML on average, and almost all of the parameters having higher gradient variance under RL. A significant portion also has - orders of magnitude higher variance under RL than under IW-RAML. For this reason, policy gradient RL does not contribute to learning when applied in Phase-II in our trials.

PPL Reverse PPL PPL Reverse PPL
Model Valid Test Valid Test Valid Test Valid Test
AWD-LSTM-MoS 58.08 55.97 82.88 77.57 66.01 63.33 93.52 88.79
BMI-base 57.16 55.02 80.64 75.31 64.24 61.67 90.95 86.31
BMI-full 56.85 54.65 78.46 73.73 63.86 61.37 90.20 85.11
AWD-LSTM-MoS (ft.) 56.54 54.44 80.29 75.51 63.88 61.45 91.32 85.69
BMI-base (ft.) 56.05 53.97 78.04 73.35 63.14 60.61 89.09 84.01
BMI-full (ft.) 55.61 53.67 75.81 71.81 62.99 60.51 88.27 83.43
Table 1: Perplexity and reverse perplexity on PTB and WT2.
(a) PTB
(b) WT2
Figure 2: Learning curve for validation perplexity on PTB and WT2 after switching.
Table 2: Estimated MI (lower bounds) of and , two random segments of length separated by tokens. Estimations using -fold cross-validation and testing. Generations PTB WT2 AWD-LSTM-MoS 0.25 0.03 0.76 0.03 BMI-base 0.47 0.03 0.88 0.05 BMI-full 0.48 0.03 1.01 0.06 Real Data 1.18 0.08 2.14 0.07 Figure 3: Grad variance ratio (RL IW-RAML)

6 Conclusion

We have proposed a principled mutual information regularizer for improving long-range dependency in sequence modelling. To the best of our knowledge, this is the first work to recognize and address the sparse sampling of high order interactions as an issue hindering long-range dependency learning, orthogonal from the gradient flow problem.


Appendix A Appendix


Proof: We apply the Data Processing Inequality (DPI) [5] twice:

. The first inequality hold due to the DPI applied on the markov chain

; then the second one on .

Note: the Markov chains are not additional assumption, but merely a statement that does not dependent on when is given (similarly for the first Markov chain).

a.2 RAML Background

The key idea behind RAML is to observe that the entropy-regularized policy gradient RL objective can be written as (up to constant and scaling):


where is the exponentiated pay-off distribution defined as:


is a reward function that measures some similarity of with respect to the ground truth (e.g. negative edit-distance). Whereas in RAML [21], one optimizes the KL in the reverse direction:


It was shown that these two losses have the same global extremum and when away from it their gap is bounded under some conditions [21]. Compare the RAML gradient with the policy gradient:


RAML gradient samples from a stationary distribution, while policy gradient samples from the changing distribution. Furthermore, samples from has higher chance of landing in configurations of high reward by definition, while samples relies on random exploration to discover sequences with high reward. For these reasons, RAML has much lower variance than RL.

a.3 Additional Experiment Details

All experiments are conducted on single (1080Ti) GPUs with PyTorch.

We manually tune the following hyperparameters based on validation perplexity: the BMI regularizer weights in ; hidden state size is chosen from , Adam learning rate from .