1 Introduction
Learning longrange dependency in sequential data such as text is challenging, and the difficulty has mostly been attributed to the vanishing gradient problem in autoregressive neural networks such as RNNs
hochreiter2001gradient . There is a vast literature trying to solve this gradient flow problem through better architecture hochreiter2001gradient ; mikolov2014learning ; vaswani2017attention , better optimization martens2011learning or better initialization le2015simple . On the other hand, there is an orthogonal issue that has received less attention: statistical dependency over a short span is usually abundant in data, e.g., bigrams, common phrases and idioms; on the other hand, longrange dependency typically involves more complex or abstract relationships of a large number of tokens (high order interactions). In other words, there is a sampling mismatch between observations supporting local correlations versus evidence for high order interaction, while the latter requires more samples to learn from at the first place because they involve more variables. We conjecture that in addition to the gradient flow issue, this problem of sparse sampling of high order statistical relations renders learning longrange dependency hard in natural language processing.
Take language modelling for example: with a vocabulary of size , the number of possible sequences grows as with sequence length
. Neural language models use distributed representation to overcome this issue
bengio2003neural , as not all sequences form plausible natural language utterances, and there is shared semantics and compositionality in different texts. However, the parametrization does not change the fundamental fact that in the training data, there is an abundance of observation for local patterns, but much sparser observations for the different highlevel ideas. As language evolved to express the endless possibilities of the world, even among the set of “plausible” long sequences, a training set can only cover a small fraction. Therefore, there is an inherent imbalance of sampling between short and long range dependencies. As such, because it is a data sparsity issue at the core, it cannot be completely solved by better architecture or optimization.The natural remedy facing limited data is to regularize the model using prior knowledge. In this work, we propose a novel approach for incorporating into the usual maximum likelihood objective the additional prior that longrange dependency exists in texts. We achieve this by bootstrapping a lower bound on the mutual information (MI) over groups of variables (segments or sentences) and subsequently applying the bound to encourage high MI. Both the bootstrapping and application of the bound improves longrange dependency learning: first, the bootstrap step helps the neural network’s hidden representation to recognize evidence for high mutual information that exists in the
data distribution; second, the information lower bound value as the reward encourages the model distribution to exhibit high mutual information as well. We apply the proposed method for language modelling, although the general framework could apply to other problems as well.Our work offers a new perspective on why the heuristic of next sentence prediction used in previous works trinh2018learning ; devlin2018bert are useful auxiliary tasks, while revealing missing ingredients, which we complete in the proposed algorithm. We demonstrate improved perplexity on two established benchmarks, reflecting the positive regularizing effect. We also show that our proposed method can help the model generate higherquality samples with more diversity measured by reversed perplexity zhao2018adversarially and more dependency measured by an empirical lower bound of mutual information.
2 Background
2.1 MLE Language Model and Sparsely Observed High Order Dependency
A language model (LM) assigns a probability to a sequence of tokens (characters, bytes, or words). Let
denote token variables, a LMfactorizes the joint distribution of
’s into a product of conditionals from left to right, leveraging the inherent order of text , where denotes all token variables with index less than , and . Let be an observed sequence of tokens as training data, sampled from data distribution . Learning simply maximizes the log likelihood of the observations with respect to the parameters of (we will use the notation and interchangeably.):(1) 
As requires to focus its probability mass on observed subsequent tokens given its preceding ones, maximum likelihood does have the ability to enforce longrange dependencies of sequence variables. However, Eq. 1 hides issues about high order interactions where a relatively smaller fraction of the valid outcomes are observed. To see this, take a partition of the sequence variables into , where , and , then Eq. 1 is equivalent to:
Now we can see that as in the case of a single next token prediction, MLE prefers to commit its prediction to the particular observed sequence(s) of , but this observed set is too sparse for the much larger configuration space. We propose to use MI as a way to express the belief that there is some dependency between and without committing to particular instantiated predictions.
2.2 Regularizing Mutual Information
Mutual information (MI) is a measure of how much does observing one random variable reveal about another (and vice versa). It is zero if and only if the two are independent. The MI
between two random variables and(scalars or vectors) is the KullbackLeibler (KL) divergence between the joint
and product of marginal distributions of the two random variables:(2) 
For text data, and can be sentences or segments of tokens (potentially extending over sentence boundaries). As MI is defined with respect to the distribution, rather than the particular observed values, it enables us to enforce dependency without committing to instantiated predictions.
We can also write as the difference between entropy and conditional entropy:
(3) 
Hence, high MI can be achieved by minimizing conditional entropy or maximizing marginal entropy (or both). Unlike MLE which can only maximize MI by reducing the conditional entropy, a MI regularizer has the option to encourage longrange dependency without forcing to commit its prediction to observed sequence(s), but by increasing the marginal entropy .
Note that the definition in Eq. 2 and Eq. 3 depend on the distribution used, so under the data and model distributions ( and ), the MI is not the same in general. Henceforth, we will make the distinction of and in our notations.
cannot be directly computed due to lack of functional form of
. For autoregressive models such as RNN, evaluating
is computationally intractable since it needs summation over all possible sequences. Hence, we will instead lower bound and in a computationally tractable way.3 Boostrapping a Mutual Information Regularizer
Our operating assumption is that longer segments in the data should have high with each other; and our goal is for sequence variables under model to have similarly high .
On the high level, our method adds some regularization terms to the MLE objective Eq. 1, in two separate phases. The illustration in Fig. 0(a)0(b) capture the core of our proposal. In the first phase, we bootstrap a MI lower bound by doing next sentence prediction, which is a binary classification of the correct next sentence versus a randomly samled sentence. After some switching condition is met, we proceed to the second phase where the MI estimator is also used to produce reward for optimizing directly using reward augmented maximum likelihood.
In order to compute the proposed regularizers, we add a small discriminator net (parametrized by ) on top of the RNN’s hidden features (parametrized by ). The discriminator will then look at pairs of segments or sequence, the ’s in Fig. 0(a), trying to distinguish pairs following some joint distribution (’s with dependency) versus product of marginals (independent ’s).
The discriminator serves the MI regularization in both phases. For the first phase, Sec. 3.1 will show that making this bound tight automatically forces the hidden representation of to preserve as much MI as possible, making the model good at recognizing related information. After the RNN and discriminator are sufficiently well trained, the learned parameters can then be applied to MI under distribution, to get a lower bound . This leads to the second phase, where in addition to continue to optimize , we use as reward to encourage high MI under . This has a more direct regularizing effect than .
Directly optimizing requires sampling from and learning by policy gradient (or other gradient estimators). However, sequential sampling from
is slow while deep RL converges slowly due to high variance. Hence, we explore an alternative, the reward augmented maximum likelihood (RAML)
norouzi2016reward . Because RAML does not directly support our MI bound as the reward, we develop a modification via importance reweighting in Sec.3.2.3. The overall algorithm is summarized in Alg. 1.3.1 PhaseI: Next Sentence Prediction Bootstraps a Lower Bound of
As previously mentioned, cannot be directly computed, but can be lower bounded in a number of ways, for example, via the MINE lower bound belghazi2018mine :
(4) 
where is a parametrized test function trying to distinguish samples of the joint distribution from those from the product of marginals. can be any function and optimizing makes the bound tighter. Hence, we compose some intermediary hidden layer representation of the RNN with a discriminator , in order to form the test function :
(5) 
For brevity, we will write and henceforth.
In this work, we take and of to be consecutive pair of sentences. Other pairs could also be regularized in theory, such as consecutive segments, or pairs of sentences at special positions in a document, like the first sentence of consecutive paragraphs.
Eq. 4
can be optimized using noise contrastive estimation, by turning it into a binary classification problem as in
hjelm2018learning . To sample positive examples from , we draw for some sentence indexed and , . To sample negatives from the product of marginals , we take , and sample where randomly drawn from the training corpus. Fig. 0(a) depicts our overall approach to bootstrap this lower bound. As pointed out by hjelm2018learning , when the goal is to maximize the MI rather than estimating its particular value, one can use a proxy that has better gradient property than :(6) 
where . remains a lower bound for any parameters.
3.1.1 Regularizing Effect on Model
To understand how does maximizing regularize the model , note that the MI between the encodings is a lower bound on the MI of the raw inputs, by the Data Processing Inequality cover2012elements . In other words, (proof in Suppl. Appendix. A.1). Because is also the test function for the joint versus product of marginals on the random variables and , we have , i.e. the MI of features is sandwiched between the MI of data and our parametric lower bound.
Therefore, while is a fixed value for the data, estimating a bound for by optimizing both and pushes the hidden representation to capture as much data MI as possible. Viewed from a different angle, it is equivalent to estimating a bound for the MI between and , (using the addon discriminator ), and then optimize the model features and to have high mutual information.
Intuitively, this step encourages ’s to recognize related information in the data. In the next section, we will develop a method to directly optimize .
3.2 PhaseII: Directly Optimizing
As mentioned, the regularization effect of PhaseI is indirect, as the expectation is with respect to the data distribution . We now discuss how to directly and efficiently optimize .
To this end, after sufficient training from PhaseI, we take the learned parameters to initialize the lower bound . Optimizing poses a series of challenges which we will tackle in the next subsections (Sec. 3.2.13.2.3). We emphasize that during PhaseII, we still optimize from PhaseI, but just with an additional regularization term, which together approximate for .
3.2.1 Difficulty with optimizing
Because the MINE bound holds for any parameters, we can instead use the binary classification form to optimize the parameters, similar to what we do for and as done in hjelm2018learning . The proxy objective has the form: where,
(7) 
To optimize with respect to , the gradient has two terms , where
(8)  
(9) 
uses policy gradient (i.e. likelihood ratio estimator) with being the policy while and being the reward (and penalty). can be variancereduced by controlvariate methods, e.g. rennie2017self .
However, deep RL is known to converge slowly due to high variance, our trials confirm the difficulty in this particular case. Furthermore, sampling from is generally slow for autoregressive models as it cannot be easily parallelized. These two issues compounded means that we would like to avoid sampling from . To this end, we develop a modification of the reward augmented maximum likelihood (RAML) norouzi2016reward , which avoids the high variance and slow sampling.
For the part (Eq. 8), if we simply replace the distributions with in the expectation, we recover the PhaseI regularizer Eq. 6, which we can use to approximate . The bias of this approximation is:
(10) 
which becomes small as the maximum likelihood learning progresses, because in both terms, the total variation distance is bounded by via Pinsker’s inequality Tsybakov:2008:INE:1522486 .
3.2.2 IWRAML: RAML background
RAML can be viewed as optimizing the reverse direction of KL divergence comparing to the entropyregularized policy gradient RL objective. We will leave the details of RAML to the Appendix. A.2 and refer readers to the worknorouzi2016reward . For our purpose here, the important information is that the RAML gradient with the policy gradient are:
(11)  
(12) 
where is the exponentiated payoff distribution defined as:
(13) 
is a reward function that measures some similarity of with respect to the ground truth (e.g. negative editdistance). RAML gradient Eq. 20 samples from a stationary distribution, while policy gradient Eq. 21 samples from the changing distribution. Furthermore, by definition, samples from has higher chance for high reward, while samples relies on exploration. For these reasons, RAML has much lower variance than RL.
3.2.3 IWRAML: MI Reward
Unfortunately, sampling from can only be done efficiently for some special classes of reward such as the editdistance used in norouzi2016reward . Here, we would like to use the learned MI estimator, more specifically the classifier scores as the reward. Assume is the sentence following in the corpus, then for any other , the reward is:
(14) 
In the illustration Fig. 0(b), would be and , and another is sampled to be evaluated. could also be any other sentence/segment not in the dataset.
As the deepneuralnetcomputed scores lack the simple structure of editdistance that can be exploited for efficient sampling from , direct application of RAML to the MI reward is not possible. We will instead develop an efficient alternative based on importance sampling.
Intuitively, a sentence that is near
in the text would tend to be more related to it, and vice versa. Therefore, we can use a geometric distribution based at the index of
as the proposal distribution, as illustrated in Fig. 0(b). Let have sentence/segment index , then(15) 
where
is a hyperparameter (we set to
without tuning it). Other proposals are also possible. With as the proposal, our importance weighted RAML (IWRAML) gradient is then:(16) 
Because the reward in Eq. 14 is shiftstandardized with respect to the discriminator score at , we assume that the normalization constant in Eq. 18 does not vary heavily for different , so that we can perform selfnormalizing importance sampling by averaging across the minibatches.
3.2.4 IWRAML: BiasVariance Tradeoff
A side benefit of introducing is to reestablish the stationarity of the sampling distribution in the RAML gradient estimator. Because the reward function Eq. 14 depends on , the exponentiated payoff distribution is no longer stationary like in the original RAML with simple reward norouzi2016reward , but we regain stationarity through the fixed proposal , keeping the variance low. Stationarity of the sampling distribution is one of the reasons for the lower variance in RAML.
Choosing IWRAML over RL is a biasvariance tradeoff. The RL objective gradient in Eq. 89 is the unbiased one, and IWRAML as introduced has a few biases: using the opposite direction of the KL divergence (analyzed in norouzi2016reward ); dropping the softplus nonlinearity in reward definition 14; distribution support of being smaller than . Each of these approximations introduces some bias, but the overall variance is significantly reduced as the empirical analysis in Sec. 5.3 shows.
4 Related Work
Long Range Dependency and Gradient Flow
Capturing longrange dependency has been a major challenge in sequence learning. Most works have focused on the gradient flow in backpropagation through time (BPTT). The LSTM architecture
lstm1997 was invented to address the very problem of vanishing and exploding gradient in RNN hochreiter2001gradient . There is a vast literature on improving the gradient flow with new architectural modification or regularization mikolov2014learning ; koutnik2014clockwork ; wu2016multiplicative ; li2018independently . Seqtoseq with attention or memory bahdanau2014neural ; cho2015describing ; sukhbaatar2015end ; joulin2015inferring is a major neural architecture advance that improves the gradient flow by shortening the path that relevant information needs to traverse in the computation graph. The recent invention of the Transformer architecture vaswani2017attention , and the subsequent large scale pretraining successes devlin2018bert ; radford2018improving ; gpt2 are further examples of better architecture improving gradient flow.Regularization via Auxiliary Tasks Closer to our method are works that use auxiliary prediction tasks as regularization trinh2018learning ; devlin2018bert . trinh2018learning uses an auxiliary task of predicting some random future or past subsequence with reconstruction loss. Their focus is still on vanishing/exploding gradient and issues caused by BPTT. Their method is justified empirically and it is unclear if the auxiliary task losses are compatible with maximum likelihood objective of language modelling, which they did not experiment on. devlin2018bert adds a “next sentence prediction” task to its masked language model objective, which tries to classify if a sentence is the correct next one or randomly sampled. This task is the same as our PhaseI for learning the lower bound , but we are the first to draw the theoretical connection to mutual information, explaining its regularization effect on the model (Sec. 3.1.1), and applying the bootstrapped MI bound for more direct regularization in PhaseII is completely novel in our method.
Language Modeling with Extra Context Modeling long range dependency is crucial to language models, since capturing the larger context effectively can help predict the next token. In order to capture this dependency, there are some works that feed an additional representation of larger context into the network including additional block, document or corpus level topic or discourse information mikolov2012context ; wang2015larger ; dieng2016topicrnn ; wang2017topic . Our work is orthogonal to them and can be combined.
5 Experiments
We experiment on two widelyused benchmarks on wordlevel language modeling, Penn Treebank (PTB) mikolov2012context and WikiText2 (WT2) merity2016pointer . We choose the recent stateoftheart model among RNNbased models on these two benchmarks, AWDLSTMMoS yang2017breaking as our baseline.
We compare the baseline with the same model adding variants of our proposed regularizer, Bootstrapping Mutual Information (BMI) regularizer: (1) BMIbase: apply PhaseI throughout the training; (2) BMIfull: apply PhaseI till we learn a good enough then apply both PhaseI and PhaseII. Here, we adopt the same switching condition from SGD to ASGDpolyak1992acceleration in training RNN language model firstly proposed by merity2017regularizing to switch from PhaseI to PhaseII.
Experimental Setup
We apply the maxpooling over the hidden states for all the layers in LSTM and concatenate them as our
encoding. We use a onelayer feedforward network with the features similar to conneauEtAl:2017:EMNLP2017 as for our test function whose number of hidden units is . The ADAM kingma2014adam optimizer with learning rate and weight decay of is applied on , while is optimized in the same way as in merity2017regularizing ; yang2017breaking with SGD then ASGD polyak1992acceleration . All the above hyperparameters are chosen by validation perplexity on PTB and applied directly to WT2. The weight of the regularizer term is set to for PTB and for WT2 chosen by validation perplexity on their respective datasets. The remaining architecture and hyperparameters follow exactly the same as the code released by yang2017breaking . As mentioned previously, we set the temperature hyperparameter in RAML to , and hyperparameter of importance sample proposal to , both without tuning.5.1 Perplexity and Reverse Perplexity
Table 2 presents the main results of language modeling. We evaluate the baseline and variants of our approach with and without finetune described in the baseline paper yang2017breaking . In all settings, the models with BMI outperforms the baseline, and BMIfull (with IWRAML) yields further improvement on top of BMIbase (without IWRAML).
Following zhao2018adversarially , we use reverse perplexity to measure the diversity aspect of generation quality. We generate a chunk of text with
tokens from each model, train a second RNN language model (RNNLM) on the generated text; then evaluate the perplexity of the heldout data from PTB and WikiText2 under the second language model. Note that the second RNNLM is a regular LM trained from scratch and used for evaluation only. As shown in Table
2, the models with BMI regularizer improve the reverse perplexity over the baseline by a significant margin, indicating better generation diversity, which is to be expected as MI regularizer encourages higher marginal entropy (in addition to lower conditional entropy).Fig. 2 shows the learning curves of each model on both datasets after switching to ASGD as mentioned earlier in Experiment Setup. The validation perplexities of BMI models decrease faster than the baseline AWDLSTMMoS. In addition, BMIfull is also consistently better than BMIbase and can further decrease the perplexity after BMIbase and AWDLSTMMoS stop decreasing.
5.2 Empirical MI on generations
To verify that BMI indeed increased , we measure the sample MI of generated texts as well as the training corpus. MI of long sequence pairs cannot be directly computed from samples, we instead estimate lower bounds by learning evaluation discriminators, on the generated text. is completely separate from the learned model, and is much smaller in size. We train ’s using the proxy objective in Eq. 6 and earlystop based on the MINE lower bound Eq. 4 on validation set, then report the MINE bound value on the test set. This estimated lower bound essentially measures the degree of dependency. Table 2 shows that BMI generations exhibit higher MI than those of the baseline AWDLSTMMoS, while BMIfull improves over BMIbase.
5.3 Analysis: RL vs. IWRAML variance
Fig. 3 compares the gradient variance under RL and IWRAML on PTB. The gradient variance for each parameter is estimated over iterations after the initial learning stops and switches to ASGD; the ratio of variance of the corresponding parameters is then aggregated into the histogram. For RL, we use policy gradient with selfcritical baseline for variance reduction rennie2017self . Only gradient contributions from the regularizers are measured, while the language model MLE objective is excluded.
The histogram shows that the RL variance is more than times larger than IWRAML on average, and almost all of the parameters having higher gradient variance under RL. A significant portion also has  orders of magnitude higher variance under RL than under IWRAML. For this reason, policy gradient RL does not contribute to learning when applied in PhaseII in our trials.
PTB  WT2  

PPL  Reverse PPL  PPL  Reverse PPL  
Model  Valid  Test  Valid  Test  Valid  Test  Valid  Test 
AWDLSTMMoS  58.08  55.97  82.88  77.57  66.01  63.33  93.52  88.79 
BMIbase  57.16  55.02  80.64  75.31  64.24  61.67  90.95  86.31 
BMIfull  56.85  54.65  78.46  73.73  63.86  61.37  90.20  85.11 
AWDLSTMMoS (ft.)  56.54  54.44  80.29  75.51  63.88  61.45  91.32  85.69 
BMIbase (ft.)  56.05  53.97  78.04  73.35  63.14  60.61  89.09  84.01 
BMIfull (ft.)  55.61  53.67  75.81  71.81  62.99  60.51  88.27  83.43 
6 Conclusion
We have proposed a principled mutual information regularizer for improving longrange dependency in sequence modelling. To the best of our knowledge, this is the first work to recognize and address the sparse sampling of high order interactions as an issue hindering longrange dependency learning, orthogonal from the gradient flow problem.
References
 [1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [2] I. Belghazi, S. Rajeswar, A. Baratin, R. D. Hjelm, and A. Courville. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.

[3]
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin.
A neural probabilistic language model.
Journal of machine learning research
, 3(Feb):1137–1155, 2003.  [4] K. Cho, A. Courville, and Y. Bengio. Describing multimedia content using attentionbased encoderdecoder networks. IEEE Transactions on Multimedia, 17(11):1875–1886, 2015.
 [5] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
 [6] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 [7] A. B. Dieng, C. Wang, J. Gao, and J. Paisley. Topicrnn: A recurrent neural network with longrange semantic dependency. arXiv preprint arXiv:1611.01702, 2016.
 [8] R. D. Hjelm, A. Fedorov, S. LavoieMarchildon, K. Grewal, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.

[9]
S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber.
Gradient flow in recurrent nets: the difficulty of learning
longterm dependencies, volume 1.
A field guide to dynamical recurrent neural networks. IEEE Press, 2001.
 [10] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9:1735–80, 12 1997.
 [11] A. Joulin and T. Mikolov. Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in neural information processing systems, pages 190–198, 2015.
 [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [13] J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber. A clockwork rnn. arXiv preprint arXiv:1402.3511, 2014.
 [14] Q. V. Le, N. Jaitly, and G. E. Hinton. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.

[15]
S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao.
Independently recurrent neural network (indrnn): Building a longer
and deeper rnn.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5457–5466, 2018.  [16] J. Martens and I. Sutskever. Learning recurrent neural networks with hessianfree optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 1033–1040. Citeseer, 2011.
 [17] S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
 [18] S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 [19] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato. Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753, 2014.
 [20] T. Mikolov and G. Zweig. Context dependent recurrent neural network language model. SLT, 12(234239):8, 2012.
 [21] M. Norouzi, S. Bengio, N. Jaitly, M. Schuster, Y. Wu, D. Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pages 1723–1731, 2016.
 [22] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
 [23] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pretraining. OpenAI Blog, 2018.
 [24] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2018.
 [25] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Selfcritical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7008–7024, 2017.
 [26] S. Sukhbaatar, J. Weston, R. Fergus, et al. Endtoend memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015.
 [27] T. H. Trinh, A. M. Dai, T. Luong, and Q. V. Le. Learning longerterm dependencies in rnns with auxiliary losses. arXiv preprint arXiv:1803.00144, 2018.
 [28] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008.
 [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 [30] T. Wang and K. Cho. Largercontext language modelling. arXiv preprint arXiv:1511.03729, 2015.
 [31] W. Wang, Z. Gan, W. Wang, D. Shen, J. Huang, W. Ping, S. Satheesh, and L. Carin. Topic compositional neural language model. arXiv preprint arXiv:1712.09783, 2017.
 [32] Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, and R. R. Salakhutdinov. On multiplicative integration with recurrent neural networks. In Advances in neural information processing systems, pages 2856–2864, 2016.
 [33] Z. Yang, Z. Dai, R. Salakhutdinov, and W. W. Cohen. Breaking the softmax bottleneck: A highrank rnn language model. arXiv preprint arXiv:1711.03953, 2017.

[34]
J. Zhao, Y. Kim, K. Zhang, A. Rush, and Y. LeCun.
Adversarially regularized autoencoders.
Proceddings of the 35th International Conference on Machine Learning, 2018.
Appendix A Appendix
a.1
Proof: We apply the Data Processing Inequality (DPI) [5] twice:
. The first inequality hold due to the DPI applied on the markov chain
; then the second one on .Note: the Markov chains are not additional assumption, but merely a statement that does not dependent on when is given (similarly for the first Markov chain).
a.2 RAML Background
The key idea behind RAML is to observe that the entropyregularized policy gradient RL objective can be written as (up to constant and scaling):
(17) 
where is the exponentiated payoff distribution defined as:
(18) 
is a reward function that measures some similarity of with respect to the ground truth (e.g. negative editdistance). Whereas in RAML [21], one optimizes the KL in the reverse direction:
(19) 
It was shown that these two losses have the same global extremum and when away from it their gap is bounded under some conditions [21]. Compare the RAML gradient with the policy gradient:
(20)  
(21) 
RAML gradient samples from a stationary distribution, while policy gradient samples from the changing distribution. Furthermore, samples from has higher chance of landing in configurations of high reward by definition, while samples relies on random exploration to discover sequences with high reward. For these reasons, RAML has much lower variance than RL.
a.3 Additional Experiment Details
All experiments are conducted on single (1080Ti) GPUs with PyTorch.
We manually tune the following hyperparameters based on validation perplexity: the BMI regularizer weights in ; hidden state size is chosen from , Adam learning rate from .
Comments
There are no comments yet.