1 Introduction
Sequence prediction has been widely used in tasks where the outputs are sequentially structured and mutually dependent. Recently, massive explorations in this area have been made to solve practical problems, such as machine translation Bahdanau et al. (2014); Ma et al. (2017); Norouzi et al. (2016), syntactic parsing Vinyals et al. (2015), spelling correction Bahdanau et al. (2014), image captioning Xu et al. (2015) and speech recognition Chorowski et al. (2015). Armed with modern computation power, deep LSTM Hochreiter and Schmidhuber (1997) or GRU Chung et al. (2014) based neural sequence prediction models have achieved the stateoftheart performance.
The typical training algorithm for sequence prediction is Maximum Likelihood Estimation (MLE), which maximizes the likelihood of the target sequences conditioned on the source ones:
(1) 
Despite the popularity of MLE or teacher forcing Doya (1992) in neural sequence prediction tasks, two general issues are always haunting: 1). data sparsity and 2). tendency for overfitting, with which can both harm model generalization.
To combat data sparsity, different strategies have been proposed. Most of them try to take advantage of monolingual data Sennrich et al. (2015); Zhang and Zong (2016); Cheng et al. (2016). Others try to modify the ground truth target based on derived rules to get more similar examples for training Norouzi et al. (2016); Ma et al. (2017). To alleviate overfitting, regularization techniques, such as confidence penalization Pereyra et al. (2017) and posterior regularization Zhang et al. (2017), are proposed recently.
As shown in Figure 1, we propose a novel learning architecture, titled Generative Bridging Network (GBN), to combine both of the benefits from synthetic data and regularization. Within the architecture, the bridge module (bridge) first transforms the pointwise ground truth into a bridge distribution, which can be viewed as a target proposer from whom more target examples are drawn to train the generator. By introducing different constraints, the bridge can be set or trained to possess specific property, with which the drawn samples can augment targetside data (alleviate data sparsity) while regularizing the training (avoid overfitting) of the generator network (generator).
In this paper, we introduce three different constraints to build three bridge modules. Together with the generator network, three GBN systems are constructed: 1). a uniform GBN, instantiating the constraint as a uniform distribution to penalize confidence; 2). a languagemodel GBN, instantiating the constraint as a pretrained neural language model to increase language smoothness; 3). a coaching GBN, instantiating the constraint as the generator’s output distribution to seek a closetogenerator distribution, which enables the bridge to draw easytolearn samples for the generator to learn. Without any constraint, our GBN degrades to MLE. The uniform GBN is proved to minimize KLdivergence with a socalled payoff distribution as in reward augmented maximum likelihood or RAML
Norouzi et al. (2016).Experiments are conducted on two sequence prediction tasks, namely machine translation and abstractive text summarization. On both of them, our proposed GBNs can significantly improve task performance, compared with strong baselines. Among them, the coaching GBN achieves the best. Samples from these three different bridges are demonstrated to confirm the expected impacts they have on the training of the generator. In summary, our contributions are:

A novel GBN architecture is proposed for sequence prediction to alleviate the data sparsity and overfitting problems, where the bridge module and the generator network are integrated and jointly trained.

Different constraints are introduced to build GBN variants: uniform GBN, languagemodel GBN and coaching GBN. Our GBN architecture is proved to be a generalized form of both MLE and RAML.

All proposed GBN variants outperform the MLE baselines on machine translation and abstractive text summarization. Similar relative improvements are achieved compared to recent stateoftheart methods in the translation task. We also demonstrate the advantage of our GBNs qualitatively by comparing ground truth and samples from bridges.
2 Generative Bridging Network
In this section, we first give a conceptual interpretation of our novel learning architecture which is sketched in Figure 2. Since data augmentation and regularization are two golden solutions for tackling data sparsity and overfitting issues. We are willing to design an architecture which can integrate both of their benefits. The basic idea is to use a socalled bridge which transforms to an easytosample distribution, and then use this distribution (samples) to train and meanwhile regularize the sequence prediction model (the generator).
The bridge is viewed as a conditional distribution^{1}^{1}1 should be treated as an index of the bridge distribution, so it is not necessarily the parameters to be learned. to get more target s given so as to construct more training pairs . In the meantime, we could inject (empirical) prior knowledge into the bridge through its optimization objective which is inspired by the design of the payoff distribution in RAML. We formulate the optimization objective with two parts in Equation (2): a) an expected similarity score computed through a similarity score function interpolated with b) a knowledge injection constraint^{2}^{2}2Note that, in our paper, we specify to be KLdivergence between the bridge distribution and certain constraint distribution , however, we believe mathematical form of is not restricted, which could motivate further development. where controls the strength of the regularization, formally, we write the objective function as follows:
(2) 
Minimizing it empowers the bridge distribution not only to concentrate its mass around the ground truth but also to adopt certain hope property from . With the constructed bridge distribution, we optimize the generator network to match its output distribution towards the bridge distribution by minimizing their KLdivergence:
(3) 
In practice, the KLdivergence is approximated through sampling process detailed in Sec. 2.3. As a matter of fact, the bridge is the crux of the integration: it synthesizes new targets to alleviate data sparsity and then uses the synthetic data as regularization to overcome overfitting. Thus a regularizationbysyntheticexample approach, which is very similar to the priorincorporationbyvirtualexample method Niyogi et al. (1998).
2.1 Generator Network
Our generator network is parameterized with the commonly used encoderdecoder architecture Bahdanau et al. (2014); Cho et al. (2014). The encoder is used to encode the input sequence
to a sequence of hidden states, based on which an attention mechanism is leveraged to compute context vectors at the decoding stage. The context vector together with previous decoder’s hidden state and previously predicted label are used, at each time step, to compute the next hidden state and predict an output label.
As claimed in Equation (3), the generator network is not trained to maximize the likelihood of the ground truth but tries best to match the bridge distribution, which is a delegate of the ground truth. We use gradient descent to optimize the KLdivergence with respect to the generator:
(4) 
The optimization process can be viewed as the generator maximizing the likelihood of samples drawn from the bridge. This may alleviate data sparsity and overfitting by posing more unseen scenarios to the generator and may help the generator generalize better in test time.
2.2 Bridge Module^{3}^{3}3Although we name it bridge module, we explicitly learn it with the generator when a closedform static solution exists in terms of Equation (5). Otherwise, we will adopt an encoderdecoder to construct a dynamic bridge network.
Our bridge module is designed to transform a single target example to a bridge distribution . We design its optimization target in Equation (2) to consist of two terms, namely, a concentration requirement and a constraint. The constraint is instantiated as KLdivergence between the bridge and a contraint distribution . We transform Equation (2) as follows, which is convenient for mathematical manipulation later:
(5) 
is a predefined score function which measures similarity between and and peaks when , while reshapes the bridge distribution. More specifically, the first term ensures that the bridge should concentrate around the ground truth
, and the second introduces willing property which can help regularize the generator. The hyperparameter
can be interpreted as a temperature which scales the score function. In the following bridge specifications, the score function is instantiated according to Sec. 3.1.1. Delta Bridge
The delta bridge can be seen as the simplest case where or no constraint is imposed. The bridge seeks to minimize . The optimal solution is when the bridge only samples , thus the Dirac delta distribution is described as follows:
(6) 
This exactly corresponds to MLE, where only examples in the dataset are used to train the generator. We regard this case as our baseline.
2. Uniform Bridge
The uniform bridge adopts a uniform distribution as constraint. This bridge motivates to include noise into target example, which is similar to label smoothing Szegedy et al. (2016)
. The loss function can be written as:
(7) 
We can rewrite it as follows by adding a constant to not change the optimization result:
(8) 
This bridge is static for having a closedform solution:
(9) 
where is the partition function. Note that our uniform bridge corresponds to the payoff distribution described in RAML Norouzi et al. (2016).
3. Languagemodel (LM) Bridge
The LM bridge utilizes a pretrained neural language model as constraint, which motivates to propose target examples conforming to language fluency.
(10) 
Similar to the uniform bridge case, we can rewrite the loss function to a KLdivergence:
(11) 
Thus, the LM bridge is also static and can be seen as an extension of the uniform bridge, where the exponentiated similarity score is reweighted by a pretrained LM score, and renormalized:
(12) 
where is the partition function. The above equation looks just like the payoff distribution, whereas an additional factor is considered.
4. Coaching Bridge
The coaching bridge utilizes the generator’s output distribution as constraint, which motivates to generate training samples which are easy to be understood by the generator, so as to relieve its learning burden. The coaching bridge follows the same spirit as the coach proposed in ImitationviaCoaching He et al. (2012)
, which, in reinforcement learning vocabulary, advocates to guide the policy (generator) with easytolearn action trajectories and let it gradually approach the oracle when the optimal action is hard to achieve.
(13) 
Since the KL constraint is a moving target when the generator is updated, the coaching bridge should not remain static. Therefore, we perform iterative optimization to train the bridge and the generator jointly. Formally, the derivatives for the coaching bridge are written as follows:
(14) 
The first term corresponds to the policy gradient algorithm described in REINFORCE Williams (1992), where the coefficient corresponds to reward function. Due to the mutual dependence between bridge module and generator network, we design an iterative training strategy, i.e. the two networks take turns to update their own parameters treating the other as fixed.
2.3 Training
The training of the above three variants is illustrated in Figure 3. Since the proposed bridges can be divided into static ones, which only require pretraining, and dynamic ones, which require continual training with the generator, we describe their training process in details respectively.
2.3.1 StratifiedSampled Training
Since closedformed optimal distributions can be found for uniform/LM GBNs, we only need to draw samples from the static bridge distributions to train our sequence generator. Unfortunately, due to the intractability of these bridge distributions, direct sampling is infeasible. Therefore, we follow Norouzi et al. (2016); Ma et al. (2017) and adopt stratified sampling to approximate the direct sampling process. Given a sentence , we first sample an edit distance , and then randomly select positions to replace the original tokens. The difference between the uniform and the LM bridge lies in that the uniform bridge replaces labels by drawing substitutions from a uniform distribution, while LM bridge takes the history as condition and draws substitutions from its stepwise distribution.
2.3.2 Iterative Training
Since the KLconstraint is a moving target for the coaching bridge, an iterative training strategy is designed to alternately update both the generator and the bridge (Algorithm 1). We first pretrain both the generator and the bridge and then start to alternately update their parameters. Figure 4 intuitively demonstrates the intertwined optimization effects over the coaching bridge and the generator. We hypothesize that iterative training with easytolearn guidance could benefit gradient update, thus result in better local minimum.
3 Experiment
We select machine translation and abstractive text summarization as benchmarks to verify our GBN framework.
3.1 Similarity Score Function
In our experiments, instead of directly using BLEU or ROUGE as reward to guide the bridge network’s policy search, we design a simple surrogate ngram matching reward as follows:
(15) 
represents the ngram matching score between and . In order to alleviate reward sparsity at sequence level, we further decompose the global reward as a series of local rewards at every time step. Formally, we write the stepwise reward as follows:
(16) 
where represents the occurrence of subsequence in whole sequence . Specifically, if a certain subsequence from appears less times than in the reference , receives reward. Formally, we rewrite the steplevel gradient for each sampled as follows:
(17) 
3.2 Machine Translation
Dataset
We follow Ranzato et al. (2015); Bahdanau et al. (2016) and select GermanEnglish machine translation track of the IWSLT 2014 evaluation campaign. The corpus contains sentencewise aligned subtitles of TED and TEDx talks. We use Moses toolkit Koehn et al. (2007)
and remove sentences longer than 50 words as well as lowercasing. The evaluation metric is BLEU
Papineni et al. (2002) computed via the multibleu.perl.Methods  Baseline  Model  
MIXER  20.10  21.81 +1.71  
BSO  24.03  26.36 +2.33  
AC  27.56  28.53 +0.97  
SoftmaxQ  27.66  28.77 +1.11  

29.10  29.80 +0.70  

29.90 +0.80  

29.98 +0.88  

30.15 +1.05  

30.18 +1.08 
System Setting
We use a unified GRUbased RNN Chung et al. (2014) for both the generator and the coaching bridge. In order to compare with existing papers, we use a similar system setting with 512 RNN hidden units and 256 as embedding size. We use attentive encoderdecoder to build our system Bahdanau et al. (2014). During training, we apply ADADELTA Zeiler (2012) with and to optimize parameters of the generator and the coaching bridge. During decoding, a beam size of 8 is used to approximate the full search space. An important hyperparameter for our experiments is the temperature . For the uniform/LM bridge, we follow Norouzi et al. (2016) to adopt an optimal temperature . And for the coaching bridge, we test hyperparameters from . Besides comparing with our finetuned baseline, other systems for comparison of relative BLEU improvement are: MIXER Ranzato et al. (2015), BSO Wiseman and Rush (2016), AC Bahdanau et al. (2016), SoftmaxQ Ma et al. (2017).
Results
The experimental results are summarized in Table 1. We can observe that our finetuned MLE baseline (29.10) is already overcompeting other systems and our proposed GBN can yield a further improvement. We also observe that LM GBN and coaching GBN have both achieved better performance than Uniform GBN, which confirms that better regularization effects are achieved, and the generators become more robust and generalize better. We draw the learning curve of both the bridge and the generator in Figure 5 to demonstrate how they cooperate during training. We can easily observe the interaction between them: as the generator makes progress, the coaching bridge also improves itself to propose harsher targets for the generator to learn.
3.3 Abstractive Text Summarization
Dataset
We follow the previous works by Rush et al. (2015); Zhou et al. (2017) and use the same corpus from Annotated English Gigaword dataset Napoles et al. (2012). In order to be comparable, we use the same script ^{5}^{5}5https://github.com/facebookarchive/NAMAS released by Rush et al. (2015) to preprocess and extract the training and validation sets. For the test set, we use the English Gigaword, released by Rush et al. (2015), and evaluate our system through ROUGE Lin (2004). Following previous works, we employ ROUGE1, ROUGE2, and ROUGEL as the evaluation metrics in the reported experimental results.
Methods  RG1  RG2  RGL  
ABS  29.55  11.32  26.42  
ABS+  29.76  11.88  26.96  
LuongNMT  33.10  14.45  30.71  
SAEASS  36.15  17.54  33.63  
seq2seq+att  34.04  15.95  31.68  

34.10  16.70  31.75  

34.32  16.88  31.89  

34.49  16.70  31.95  

34.83  16.83  32.25  

35.26  17.22  32.67 
System Setting
We follow Zhou et al. (2017); Rush et al. (2015) to set input and output vocabularies to 119,504 and 68,883 respectively, and we also set the word embedding size to 300 and all GRU hidden state size to 512. Then we adopt dropout Srivastava et al. (2014)
with probability
strategy in our output layer. We use attentionbased sequencetosequence model Bahdanau et al. (2014); Cho et al. (2014) as our baseline and reproduce the results of the baseline reported in Zhou et al. (2017). As stated, the attentive encoderdecode architecture can already outperform existing ABS/ABS+ systems Rush et al. (2015). In coaching GBN, due to the fact that the input of abstractive summarization contains more information than the summary target , directly training the bridge to understand the generator is infeasible. Therefore, we redesign the coaching bridge to receive both source and target input and we enlarge its vocabulary size to 88,883 to encompass more information about the source side. In Uniform/LM GBN experiments, we also fix the hyperparameter as the optimal setting.Results
The experimental results are summarized in Table 2. We can observe a significant improvement via our GBN systems. Similarly, the coaching GBN system achieves the strongest performance among all, which again reflects our assumption that more sophisticated regularization can benefit generator’s training. We draw the learning curve of the coaching GBN in Figure 6 to demonstrate how the bridge and the generator promote each other.
4 Analysis
By introducing different constraints into the bridge module, the bridge distribution will propose different training samples for the generator to learn. From Table 3, we can observe that most samples still reserve their original meaning. The uniform bridge simply performs random replacement without considering any linguistic constraint. The LM bridge strives to smooth reference sentence with highfrequent words. And the coaching bridge simplifies difficult expressions to relieve generator’s learning burden. From our experimental results, the more rational and aggressive diversification from the coaching GBN clearly benefits generator the most and helps the generator generalize to more unseen scenarios.
5 Related Literature
5.1 Data Augmentation and Selftraining
In order to resolve the data sparsity problem in Neural Machine Translation (NMT), many works have been conducted to augment the dataset. The most popular strategy is via selflearning, which incorporates the selfgenerated data directly into training.
Zhang and Zong (2016) and Sennrich et al. (2015) both use selflearning to leverage massive monolingual data for NMT training. Our bridge can take advantage of the parallel training data only, instead of external monolingual ones to synthesize new training data.5.2 Reward Augmented Maximum Likelihood
Reward augmented maximum likelihood or RAML Norouzi et al. (2016) proposes to integrate tasklevel reward into MLE training by using an exponentiated payoff distribution. KL divergence between the payoff distribution and the generator’s output distribution are minimized to achieve an optimal tasklevel reward. Following this work, Ma et al. (2017) introduces softmax QDistribution to interpret RAML and reveals its relation with Bayesian decision theory. These two works both alleviate data sparsity problem by augmenting target examples based on the ground truth. Our method draws inspiration from them but seeks to propose the more general Generative Bridging Network, which can transform the ground truth into different bridge distributions, from where samples are drawn will account for different interpretable factors.
5.3 Coaching
Our coaching GBN system is inspired by imitation learning by coaching
He et al. (2012). Instead of directly behavior cloning the oracle, they advocate learning hope actions as targets from a coach which is interpolated between learner’s policy and the environment loss. As the learner makes progress, the targets provided by the coach will become harsher to gradually improve the learner. Similarly, our proposed coaching GBN is motivated to construct an easytolearn bridge distribution which lies in between the ground truth and the generator. Our experimental results confirm its effectiveness to relieve the learning burden.System  Uniform GBN  
Property  Random Replacement  
Reference  the question is , is it worth it ?  
Bridge  the question lemon , was it worth it ?  
System  Languagemodel GBN  
Property  Word Replacement  
Reference  now how can this help us ?  
Bridge  so how can this help us ?  
System  Coaching GBN  
Property  Reordering  
Reference  i need to have a health care lexicon . 

Bridge  i need a lexicon for health care .  
Property  Simplification  
Reference 


Bridge  most of us learned to bind our shoes . 
6 Conclusion
In this paper, we present the Generative Bridging Network (GBN) to overcome data sparsity and overfitting issues with Maximum Likelihood Estimation in neural sequence prediction. Our implemented systems prove to significantly improve the performance, compared with strong baselines. We believe the concept of bridge distribution can be applicable to a wide range of distribution matching tasks in probabilistic learning. In the future, we intend to explore more about GBN’s applications as well as its provable computational and statistical guarantees.
References
 Bahdanau et al. (2016) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. An actorcritic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086 .
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
 Cheng et al. (2016) Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Semisupervised learning for neural machine translation. arXiv preprint arXiv:1606.04596 .
 Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078 .
 Chorowski et al. (2015) Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attentionbased models for speech recognition. In Advances in Neural Information Processing Systems. pages 577–585.
 Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 .

Doya (1992)
Kenji Doya. 1992.
Bifurcations in the learning of recurrent neural networks.
In Circuits and Systems, 1992. ISCAS’92. Proceedings., 1992 IEEE International Symposium on. IEEE, volume 6, pages 2777–2780.  He et al. (2012) He He, Jason Eisner, and Hal Daume. 2012. Imitation learning by coaching. In Advances in Neural Information Processing Systems. pages 3149–3157.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, pages 177–180.
 Lin (2004) ChinYew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL04 workshop. Barcelona, Spain, volume 8.
 Ma et al. (2017) Xuezhe Ma, Pengcheng Yin, Jingzhou Liu, Graham Neubig, and Eduard Hovy. 2017. Softmax qdistribution estimation for structured prediction: A theoretical interpretation for raml. arXiv preprint arXiv:1705.07136 .
 Napoles et al. (2012) Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Webscale Knowledge Extraction. Association for Computational Linguistics, pages 95–100.

Niyogi et al. (1998)
Partha Niyogi, Federico Girosi, and Tomaso Poggio. 1998.
Incorporating prior information in machine learning by creating virtual examples.
Proceedings of the IEEE 86(11):2196–2209.  Norouzi et al. (2016) Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. 2016. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems. pages 1723–1731.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 311–318.
 Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. 2017. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548 .
 Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 .
 Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685 .
 Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 .
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929–1958.

Szegedy et al. (2016)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. 2016.
Rethinking the inception architecture for computer vision.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. pages 2818–2826.  Vinyals et al. (2015) Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in Neural Information Processing Systems. pages 2773–2781.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8(34):229–256.
 Wiseman and Rush (2016) Sam Wiseman and Alexander M Rush. 2016. Sequencetosequence learning as beamsearch optimization. arXiv preprint arXiv:1606.02960 .
 Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. pages 2048–2057.
 Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 .
 Zhang et al. (2017) Jiacheng Zhang, Yang Liu, Huanbo Luan, Jingfang Xu, and Maosong Sun. 2017. Prior knowledge integration for neural machine translation using posterior regularization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 1514–1523.
 Zhang and Zong (2016) Jiajun Zhang and Chengqing Zong. 2016. Exploiting sourceside monolingual data in neural machine translation. In EMNLP. pages 1535–1545.
 Zhou et al. (2017) Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. 2017. Selective encoding for abstractive sentence summarization. arXiv preprint arXiv:1704.07073 .
Appendix A Supplemental Material
This part first provides detailed derivation of Equation (8) and (11) from Equation (7) and (10), since our uniform bridge distribution and languagemodel bridge distribution have closedform solutions given a fixed uniform distribution and a language model as constraints. Then, we give explanation of Equation (13), the objective function of coaching bridge, where the constraint is the inverse KL compared with previous two bridges and then give detailed derivation of the gradient update Equation (14).
Derivation of Equation (8)
(18) 
Here, the related constant is needed to transform a unnormalized similarity score to a probability:
(19) 
Derivation of Equation (11)
(20) 
Here, the related constant is needed to transform a unnormalized weighted similarity score to a probability:
(21) 
Explanation of Equation (13)
This equation is the objective function of our coaching bridge, which uses an inverse KL term^{6}^{6}6That is the use of instead of . as part of its objective. The use of inverse KL is out of the consideration of computational stability. The reasons are twofold: 1). the inverse KL will do not change the effect of the constraint; 2). the inverse KL requires sampling from the generator and uses those samples as the target to train the bridge, which has the same gradient update ad MLE, so we do not need to consider baseline tricks in Reinforcement Learning implementation.
Gradient derivation of Equation (13)
(22) 
Comments
There are no comments yet.