In recent years, encoder-decoder based models Bahdanau et al. (2015); Vaswani et al. (2017); Xu et al. (2015) have become the fundamental instrument for sequence-to-sequence learning, especially in tasks that involves natural language. The attention mechanism Luong et al. (2015); Bahdanau et al. (2015) proves essential for the encoder-decoder based models to efficiently draw useful source information from the encoder. However, in the conventional, the representations of source sequence will become more coarse-grained from fine-grained with more encoder layers.
Recently, the Transformer Vaswani et al. (2017), which is based solely on attention mechanisms and eliminates recurrence entirely, has been proposed and has established the state-of-the-art in multiple NLP tasks. As shown in Figure 1 (a), for those representations output by different encoder layers, the common practice for the decoder in Transformer is to draw information only from the final encoder layer. While coarse-grained representations are good at expressing the overall meaning of the sentence, they are less precise in the finer details, which affects the understanding of long sentences which could be abundant in concrete, detailed information that is unsuitable for coarse-grained representations to abstract over (see Figure 1 (b)). In Table 1
, we give an example of the target sentences translated by different methods from the German-to-English translation task using Transformer. As we observe, the target sentence generated by the Base model reveals two common shortcomings in neural sequence-to-sequence models: 1) the generated text is unfaithful to the source (e.g.,three compared to dreizehn “thirteen”); and 2) repeated texts are generated (e.g., at the age of three) See et al. (2017). They can both be attributed to the lack of detailed and accurate information, i.e., fine-grained representations. In such cases, it is more than beneficial to pass information of other levels to the decoder so that it can use source information that is more precise such that in decoding, the decoder can resort to multi-views of the source sequences for better generation, which is also shown in Table 1
. Therefore, our work concerns the use of source representations of different levels in the multi-layer encoder-decoder models, which has not been well studied in natural language processing.
|Source||aber ich hatte das große glück , ihn sehr früh , mit dreizehn jahren , kennenzulernen und so war ich schon zu meiner schulzeit auf seinen kursen .|
|Target||but i had the good fortune to meet him at a very young age , when i was thirteen , and so i always attended his courses while i was at school .|
|Base||but i was very lucky to meet him at the age of three at the age of three , and so at the time , i was on his classes .|
|Ours||but i was very lucky to meet him at the age of thirteen, and so at the time while i was at school , i always on his courses .|
In this paper, in order to explore whether the use of source representations of different granularity is instrumental in the encoder-decoder framework, we investigate multiple cross-view decoding strategies (see Figure 2). We propose to reuse the representations of different encoder layers as views of the source sequence and directly inject the information into the existing attention computation. In spite of this, we find it implausible to entirely substitute the final layer representations due to the problem, which we call the short circuit phenomenon, where the top encoder layers will be shorted and contribute less to the decoding process. As a result, the ability of the multi-layer encoder to represent the source sequence is impaired, which in turn leads to performance degradation. To address the problem, we further propose soft integration and continued learning to stabilize the multi-layer training without loss of representational ability. Our cross-view decoding approach requires minimal parameter increase and can be easily integrated into the existing encoder-decoder based models.
Through our experiments on the machine translation and two strong baselines, i.e., Transformer Vaswani et al. (2017) and DynamicConv Dou et al. (2019), which are the previous state-of-the-art in machine translation, we find that one of the proposed variant, i.e., granularity consistent attention (GCA) (see Figure 2 (a)), promotes the performance substantially. We speculate the reason is that the GCA builds connections between the corresponding layers in the encoder and the decoder, so that the first decoder layer pays attention to the global information, i.e., coarse-grained representations, of the source sequence, which is instrumental in language modeling, while the last decoder layer pays attention to the fine-grained representations of the source sequence, which is helpful to generate more precise words. In fact, a similar connection pattern has been considered in biomedical image segmentation, i.e., the U-Net Ronneberger et al. (2015), which applies a skip connection between layers of similar representational granularity in the pyramid-like encoder and the inverted pyramid-like decoder. To the best of our knowledge, this is the first work that routes the source representations of different levels to the decoder layers for cross-view decoding in a multi-layer encoder-decoder models of NLP.
Overall, our contributions are as follows:
Instead of the conventional single-view decoding in sequence-to-sequence learning, we propose to consider representations of the source sequences from different encoder layers as multiple diverse-grained views for cross-view decoding. We investigate multiple cross-view decoding strategies that encourage the decoder to take the full advantage of the expressive power of the encoder.
To enable an efficient training with cross-view decoding and address the short circuit phenomenon, we propose to adopt soft integration and continued learning, which can be easily implemented on existing models.
Experiments and analyses demonstrate that our approach works for representative models, i.e., Transformer and DynamicConv, and tasks, including machine translation, in natural language processing. Specifically, we advances the state-of-the-art of the machine translation task with almost negligible parameter increase
We first briefly review the convectional encoder-decoder model and then introduce the proposed cross-view decoding strategies realized on the Transformer model.
2.1 Encoder-Decoder Model and Transformer
In the conventional encoder-decoder model, the encoder encodes the source sequence with stacked encoder layers, defined as
By repeating the same process for times, the encoder outputs the representations of the source sequence. Transformer Vaswani et al. (2017) proposes Multi-Head Attention (MHA) and Feed-Forward Network (FFN) to implement the encoder layers, that is,
Then, the conventional decoder generates the target sequence with stacked decoder layers. Typically, only the representations from the final encoder layer serve as the bridge between the source and the target, via the attention mechanism. Unlike the encoder, the decoder has an extra objective that handles the intermediate representations from the encoder, defined as
For Transformer, the decoder achieves the extra objective by an extra multi-head attention:
where x and y stand for the query and the key/value, respectively.
Figure 1 (a) shows a conceptual illustration of the conventional Transformer model, focusing on the attention patterns. As we can see, it only makes use of the coarse-grained representations at the last encoder layer in decoding, which may fail to capture accurate source information that is better framed by representations from first encoder layers, i.e., fine-grained representations. The effect is more obvious for longer source text, because it is difficult for the model to find the correct source information in the coarse-grained representations, which can make the generated target sentence unfaithful to the original sentence.
2.2 Cross-View Decoding
In order to encourage the decoder to make full use of the source sequence information from the global and local perspectives, we enhance the decoding process with multiple views of the source sequences with different granularity, such that the expressive power of the model could be fully taken advantage of. Particularly, we proposed to conduct layer-wise cross-view decoding (see Figure 2), for each decoder layer, a different combination of the source views is considered, encouraging the decoder to make adequate, efficient use of the source sequence information from the global and local perspectives.
Taking the generality into consideration, the cross-view decoding is implemented in a non-intrusive way:
where represents the source representations of different granularity from the encoder layers. The function stands for the layer-wise cross-view decoding strategy, which is elaborated in the following.
2.2.1 Cross-View Decoding Strategies
In the multi-layer setting, each layer adds another layer of abstraction over the previous layer. It is believed that the representations in the encoder become more and more coarse-grained describing global context. However, in the decoder, the process is more complex, since each decoder layer receives both the source information and target information, which opens up the question how to properly incorporate different source views. We systematically investigate diverse strategies of routing the source views, which is introduced (Figure 2) in the following:
Granularity Consistent Attention (GCA):
Figure 2 (a) illustrates the proposal. In this strategy, each decoding step is seen as a realization process, where an abstract, coarse-grained idea turns into concrete, fine-grained words gradually through the layers. Hence, the granularity consistent attention keeps the granularity of source views in order with the decoder layers. Therefore, the GCA can be defined as .
Granularity Parallel Attention (GPA):
Opposite to the GCA strategy that keeps the granularity order in attention, we reverse the source granularity order in attention. In this strategy, the decoding is regarded the same as the encoding, where each layer abstracts together the source and the target sequences, hence Granularity Parallel Attention (GPA), which is defined as .
Fine-Grained Attention (FGA):
Opposite to the conventional Transformer (see Figure 1) that only draws information from the final encoder layer, i.e., coarse-grained representations, we experiment with adding only the information of the first layer of encoder, i.e., fine-grained representations. The strategy is named as the Fine-Grained Attention (FGA), which is defined as .
Full Matching Attention (FMA):
We also consider trying to route all source views into each decoder layer, i.e., Full Matching Attention (FGA). In implementation, we combine them by linear transformations, and define the FGA strategy as. Note that, it is also layer-wise, because each decoder layer uses different linear transformation parameterized by and .
Adaptive Matching Attention (AMA):
Finally, we apply an attention mechanism to help inject information of various granularity levels adaptively, i.e., Adaptive Matching Attention (AMA). We build an independent vector for each encoder layer to predict the attention weight. We define the AMA as, where is computed by the attention mechanism and .
It should be noted that, in our experiments, we find that all variants are viable in practice and can promote the performance substantially, which validate our motivation and corroborate the effectiveness of our approach. Although all those variants can enable cross-view decoding, the GCA strategy is deemed more in line with the characteristics of the multi-layer encoding and decoding process, and the information is provided more efficiently. For other variants, they may route unneeded information to the decoder layers and cause difficulty in learning.
2.2.2 Short Circuit Phenomenon
However, if we apply the cross-view decoding to the model directly, the top encoder layers of the model may be shorted (Short Circuit Phenomenon). To explain the phenomenon, we take the GCA (see Figure 2 (a)) as an example. Intuitively, there is a direct connection between the first encoder-layer and the last decoder-layer shorter than any other connection between the encoder layers and the decoder layers, leading to the problem that the gradient information can pass through the remaining layers of the encoder directly to the first encoder layer. In consequence, the ability of the multi-layer encoder to represent the source sequence is impaired, which in turn leads to performance degradation.
Formally, the short circuit phenomenon
can be explained by the following analysis. We denote the loss function in the conventional model as. In the conventional model, the and denote the parameters of the encoder-layer and the decoder-layer (), respectively. We take the most significant et al. (1989), we can get the gradient of the encoder-layer in the conventional model and similarly for models with GCA:
As we can see, the norm of could be significantly smaller than the norm of , which indicates the top encoder layer of the conventional model has a greater impact on the final output than that of the GCA, that is, the top encoder layer of GCA has less effect on the final output, i.e., the layer is shorted. This causes the information to flow directly from the bottom encoder layer, and less through the top encoder layer. Our experimental results also show the direct application of cross-view decoding will cause performance degradation.
We adopt two mitigations, i.e., soft integration and continued learning, to address this structural problems.
To properly train the encoder stack, the final encoder layer needs comprehensive error signals. The simple way to achieve that is to incorporate the additional source views on top of the representations of the final encoder layer. Instead of a hard replacement, this is a soft integration of cross-view decoding: which means the decoder of the final model is updated as follows:
where LN stands for Layer Normalization Ba et al. (2016)
. The layer normalization is needed to keep the scale of the representations for stability in deep neural networks.
For the conventional encoder-decoder models, once trained, their encoder should be capable of describing the source sequence in different granularity. Such ability is crucial to the cross-view decoding and due to the augmentative nature of the proposal, which means the model structure can be extended seamlessly upon the original model, we propose to conduct continued training based on the trained conventional models. In practice, the conventional model is first trained normally, and then, the cross-view decoding is applied to further fine-tune the attention structure. It also means that the previously trained and publicly available models can be re-used to implement our approach, greatly reducing training cost Dodge et al. (2020); Li et al. (2020c); Bachlechner et al. (2020).
|Transformer Vaswani et al. (2017)||28.4||41.0||-|
|Layer-wise Coordination He et al. (2018)||29.0||-||35.1|
|Fixup Zhang et al. (2019)||29.3||-||34.5|
|Deep Representations Dou et al. (2018)||29.2||-||-|
|Transformer Ott et al. (2018)||29.3||43.2||-|
|Evolved Transformer So et al. (2019)||29.8||41.3||-|
|DynamicConv Wu et al. (2019)||29.7||43.2||35.2|
|+ Cross-View Dec.||29.9 (+0.9)||42.6 (+1.5)||35.9 (+1.2)|
|+ Cross-View Dec.||29.8 (+0.5)||43.5 (+0.4)||36.2 (+1.0)|
denotes statistically significant results (t-test with). As a whole, the proposed cross-view decoding with granularity consistent attention significantly improves the baselines.
Our main experiments focus on machine translation Bahdanau et al. (2015); Garg et al. (2020); Sun et al. (2020b); Li et al. (2020b); Sun et al. (2020a); Zhao et al. (2019a, b), which is arguably the most important sequence-to-sequence learning task in natural language processing. We report results using the granularity consistent attention (GCA) for cross-view decoding, which is the best performed strategy in our preliminary experiments.
As the proposal only relates to the injection of different mix of source representations and is augmentative to the existing models, we keep the inner structure of the baselines untouched and preserve the original settings. For soft integration, we initialize the original attention structures with the parameters of the re-implemented baseline models. For continued learning, we further fine-tune the full model with the number of training steps used to re-implement the baseline model. We experiment on Transformer Vaswani et al. (2017) and DynamicConv Wu et al. (2019)
, where the former has six blocks, each of which consists of the dot-product based multi-head attention layer and the feed-forward layer. The latter is based on the convolution neural network, and built on lightweight convolutions by predicting a different kernel at every time-step, similar to the attention weights computed by self-attention. Especially, the published work DynamicConvWu et al. (2019) established a state-of-the-art in WMT EN-DE and EN-FR translation tasks in comparable settings, i.e., model did not use much larger extra dataset for training, as in BackTranslate Edunov et al. (2018); Dou et al. (2020). For the re-implementation of the DynamicConv, we use the configuration of six blocks for both encoder and decoder. For detailed introduction to the tasks and our implementation details, please refer to Appendix A.
We report results on three benchmarks, including two large WMT-2014 datasets, i.e., English-German (EN-DE) and English-French (EN-FR), and a small IWSLT-2014 dataset, i.e., German-English (DE-EN). As shown in Table 2. For three datasets, our approach outperforms all the baselines. Based on the Transformer, we promote the baseline by 0.9, 1.5 and 1.2 BLEU score for the EN-DE, EN-FR and DE-EN, respectively. More encouragingly, based on the DynamicConv, which is the previous state-of-the-art, our approach sets a new state-of-the-art performance on three datasets, achieving 29.8, 43.5 and 36.2 BLEU score on EN-DE, EN-FR and DE-EN respectively. The improvements on various datasets demonstrate the effectiveness of the proposal. Since GCA simply reroutes the original source information without conducting further transformations, it also suggests that the expressive power of the existing models are overlooked and providing the decoder with the representation of the appropriate granularity has great potential in improving the parameter efficiency.
In this section, we conduct analysis from different perspectives to better understand our approach. Unless otherwise specified, the analysis is conducted using Transformer-Base.
Analysis on Cross-View Decoding Strategies.
The results of the granularity consistent attention (GCA) strategy and other variants, i.e., granularity parallel attention (GPA), fine-grained attention (FGA), full matching attention (FMA) and adaptive matching attention (AMA), on DE-EN are summarized as Table 4. As we can see, all the considered strategies can improve the performance, of which GCA stands out the most. GPA could not provide the fine-grained representations for the last decoder layer. For FGA, it only provides a single view of the source sequence. FMA combines all views and may introduce redundant noise, while the AMA strategy has a hard time in learning proper weights for different views. Those results further demonstrate the GCA strategy is more in line with the characteristics of the multi-layer encoding and decoding process, and the information is provided more efficiently.
|Baseline + GCA||33.6|
|Baseline + GCA + Soft Integration||34.0|
|Baseline + GCA + Continued Learning||34.9|
In order to analyze the performance of our approach on sentences of different lengths, we group sentences of similar lengths together and calculate the BLEU score for each group. As shown in Figure 3, our approach is superior to the baseline in all length segments on DE-EN and EN-DE datasets. It is worth noticing that the proposal is always better than the baseline, and the longer the sentences, the more the improvements. Intuitively, it is hard for the global representation from the final encoder layer to retain all the detailed input information, especially for longer sentences. However, in conventional encoder-decoder model, the decoder is only equipped with a single view of the source sequence, which causes a dilemma that although both global and local information are important, only one can be used. In contrast, we can avoid the dilemma by adopting cross-view decoding, which injects fine-grained representations to the decoder and keeps the original global representation at the same time.
In this section, we conduct the ablation analysis to investigate the contribution of each strategy in the proposal (see Table 4).
Effect of Soft Integration. The introduction of soft integration strategy, i.e., keeping the conventional model unchanged, promotes the performance of the Baseline + GCA. As expected, the soft integration strategy is essential to stabilize the propagation of information and ease the optimization of model. However, the performance of Baseline + GCA + Soft Integration is still lower than the baseline model. The reason is that cross-view decoding implies that the encoder is capable of generating representations of different granularity. However, the granularity consistent attention strategy, even in an augmenting manner, will allow the error signals to bypass the encoder top layers, and make the top layers insufficiently trained, which in turn affects the decoder’s ability of capturing source semantics. To fully overcome the problem, a possible solution is that the encoders should be adequately learned first before the integration of cross-view information in decoding.
|Methods||Abstractive Summarization||Image Captioning|
|+ Cross-View Dec.||39.9 (+0.6)||18.0 (+0.7)||36.8 (+0.6)||22.6 (+1.4)||129.4 (+4.5)|
Experimental results of text summarization and image captioning. Theis defined similarly.
Effect of Continued Learning. As we can see, when the model lacks the continued learning strategy (i.e., Baseline w/ GCA), it experiences significant performance degradation. The reason is that the proposal depends on the hierarchical representation of conventional base model, and the chaotic representations in Baseline w/ GCA will prevent the model from capturing diverse representations of source sequence. Besides, the results also validate the motivations of the initialization strategy.
Overall, those mitigations are essential to the cross-view decoding approach, which also make the proposal fully augmentative, so that we can easily integrate it into existing systems. It also eliminates the need to train the whole model together from scratch, if pre-trained parameters are available.
Analysis on Longer Training. To verify whether the improvements are simply due to longer training that comes with the continued learning in our approach, we further fine-tune the baseline model, i.e., Transformer, using the same number of training steps as our full model, and measure the performance of the best single model. The results show that DE-EN dataset has not been further improved, and there are only improvements of 0.1 and 0.3 BLEU score with respect to EN-DE and EN-FR, when the baseline models are trained further. While our proposal enables the baseline model to achieve 1.2 BLEU, 0.9 BLEU and 1.5 BLEU gains on DE-EN, EN-DE and EN-FR datasets, respectively (see Table 2).
It is interesting to see whether the proposed approach works for other sequence-to-sequence tasks. We further conduct experiments on the CNN-Daily Mail dataset for abstractive summarization Hermann et al. (2015); Çelikyilmaz et al. (2018); Lin et al. (2018); Ma et al. (2018) and the COCO dataset for image captioning Chen et al. (2015); Anderson et al. (2018); Liu et al. (2018, 2019a, 2019b). For detailed introduction to the two tasks and our implementation details, please refer to Appendix A.
Abstractive Summarization. Table 5 shows the results in terms of Rouge-1, 2 and Rouge-L Lin (2004). As we can see, the proposal achieves an advantage over the Transformer, which indicates that our approach generalizes well to tasks with much longer source sequences, which are around 400 words to other tasks and is effective in dealing with longer sequences. In fact, in such scenarios, where the summary should be concise but also accurate in detail, the GCA naturally streamlines the decoding process.
Image Captioning. This task combines image understanding and language generation and is a cross-modal task compared to machine translation and text summarization. The source sequence contains non-ordered region-of-interest features Anderson et al. (2018). Table 5 report the results on test set in terms of SPICE Anderson et al. (2016) and CIDEr Vedantam et al. (2015), which are specifically designed to evaluate image captioning systems. As we can see, our approach further improves the performance of baseline to 22.6 SPICE score and 129.4 CIDEr score, which is competitive with the state-of-the-art model AoANet Huang et al. (2019) (22.4 SPICE score and 129.8 CIDEr score). The results suggest that the proposal can be applied to a wide range of sequence generation tasks, no matter what the source representation is, which demonstrates the universality of our approach.
In Figure 4, we list some examples on IWSLT DE-EN and COCO image captioning task to analyze how our approach improves the baseline. The examples show that our approach does not alter the structure of the output sentence significantly compared to the baselines. The reason is that our approach can be seen as an extension for fine-tuning existing models. However, our approach can capture more detailed information about the source sequence, which helps improve the quality of the generated sentences in terms of details. For example, in machine translation, our approach enables the model to use words that are more precise, including verb forms, and singulars/plurals, especially when the baseline is unable to choose a proper word to continue the sentence, e.g., repetition. For the image captioning, our approach helps the model to generate more detailed captions in colors (e.g., “pink” umbrella) and attributes (e.g., “rainy” street), for each object.
5 Related Work
depend on the encoder-decoder framework to map a source sequence to a target sequence, such as in machine translation and abstractive summarization. The encoder network computes intermediate representations for the source sequence and the decoder network defines a probability distribution over target sentences given that intermediate representation. Specifically, to allow a more efficient use of the source sequences, a series of attention methods have been proposedBahdanau et al. (2015); Vaswani et al. (2017); Xu et al. (2015); Luong et al. (2015)
to directly provide the decoder with source information. Especially, the recent advent of fully-attentive models, e.g., TransformerVaswani et al. (2017), in which no recurrence is required, has been proposed and has successfully applied to multiple tasks, e.g., neural machine translation. The work on attention reveals that attention is efficient, necessary, and powerful at combining information from diverse sources. Despite their dominance in the last few years, little work has been done discussing the effect of the connection between the encoder and the decoder in sequence-to-sequence learning.
Using Deep Representations. In natural language processing, several efforts Peters et al. (2018); Shen et al. (2018); Wang et al. (2018); Dou et al. (2018); Bapna et al. (2018); Dou et al. (2019); Li et al. (2020a) have investigated strategies to make the best use of deep representations among layers, e.g., using linear combination Peters et al. (2018), dense connection Shen et al. (2018) and hierarchical layer aggregation Dou et al. (2018). However, they focused on the information within the encoder or the decoder, and excluded the effect of the information flow from the encoder to the decoder Dou et al. (2018, 2019); Li et al. (2020a), which is the main consideration of this work. In fact, a related study by He et al. (2018) has proposed a layer-wise attention approach in order to build the encoder and the decoder gradually layer by layer, yet the complete model still adopts the conventional pattern that the decoder’s last layer receives information only from the encoder’s last layer. In short, this is the first systematic study on injecting various kinds of encoder representations to the decoder layers using different attention connection. Of the five strategies considered, which all performs better than the commonly-used strategy, the proposed granularity consistent attention stands out and is unique to previous work.
In computer vision, injecting deep representation information across layers to better fuse semantic and spatial information has achieved great success in promoting various downstream tasksHuang et al. (2017); Yu et al. (2018); Ronneberger et al. (2015). Especially, in image segmentation Ronneberger et al. (2015); Long et al. (2015), U-Net Ronneberger et al. (2015) considered the connection between the encoder and the decoder. If only the resulting method is concerned, the granularity consistent attention strategy shares the same spirit with U-Net. Nonetheless, the different sizes of the feature maps in computer vision make the granularity consistent attention version the only sensible way for encoder-decoder-based segmentation models. Thus, the U-Net as a specific segmentation method, only considers one kind of connection, while we study different kinds of connection suitable for NLP.
In this work, we focus on enhancing the information transfer between the encoder and the decoder for sequence-to-sequence learning, through injecting fine-grained source representations into the generation process. We propose the layer-wise cross-view decoding approach to route source representations of different granularity to different decoder layers. Two migrations are devised to address the short circuit phenomenon coming with the cross-view decoding, which also makes the proposal augmentative to existing models. Out of several cross-view strategies, we find that the granularity consistent attention strategy for context attention shows the best improvements. Experiments on the machine translation task verify the effectiveness of our approach. In particular, it outperforms the DynamicConv model which is the previous state-of-the-art in machine translation. The analyses demonstrate that the use of different types of representations from the encoder, which provides views of different granularity of the source sequence, is instrumental in exerting the expressive power of the encoder-decoder models.
- SPICE: Semantic propositional image caption evaluation. In ECCV, Cited by: §A.3, §4.
- Bottom-up and top-down attention for image captioning and VQA. In CVPR, Cited by: §A.3, §4, §4.
- Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §2.2.2.
- ReZero is all you need: fast convergence at large depth. arXiv preprint arXiv:2003.04887. Cited by: §2.2.2.
- Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §1, §3, §5.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop, Cited by: §A.3.
- Training deeper neural machine translation models with transparent attention. In EMNLP, Cited by: §5.
- Deep communicating agents for abstractive summarization. In NAACL-HLT, Cited by: §4.
- Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §A.3, §4.
- Fine-tuning pretrained language models: weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305. Cited by: §2.2.2.
- Dynamic data selection and weighting for iterative back-translation. arXiv preprint arXiv:2004.03672. Cited by: §3.
- Exploiting deep representations for neural machine translation. In EMNLP, Cited by: Table 2, §5.
- Dynamic layer aggregation for neural machine translation with routing-by-agreement. In AAAI, Cited by: §1, §5.
- Understanding back-translation at scale. In EMNLP, Cited by: §3.
- Controllable abstractive summarization. In NMT@ACL 2018, Cited by: §A.2.
- Echo state neural machine translation. arXiv preprint arXiv:2002.11847. Cited by: §3.
- Layer-wise coordination between encoder and decoder for neural machine translation. In NeurIPS, Cited by: Table 2, §5.
- Teaching machines to read and comprehend. In NIPS, Cited by: §A.2, §4.
- Densely connected convolutional networks. In CVPR, Cited by: §5.
- Attention on attention for image captioning. In ICCV, Cited by: §4.
- Deep visual-semantic alignments for generating image descriptions. In CVPR, Cited by: §A.3.
OpenNMT: open-source toolkit for neural machine translation. In ACL, Cited by: §A.2.
- Backpropagation applied to handwritten zip code recognition. Neural Computation. Cited by: §2.2.2.
- Neuron interaction based representation composition for neural machine translation. In AAAI, Cited by: §5.
- Regularized context gates on transformer for machine translation. arXiv preprint arXiv:1908.11020. Cited by: §3.
- Train large, then compress: rethinking model size for efficient training and inference of transformers. arXiv preprint arXiv:2002.11794. Cited by: §2.2.2.
- ROUGE: A package for automatic evaluation of summaries. In ACL, Cited by: §A.2, §A.3, §4.
- Global encoding for abstractive summarization. In ACL, Cited by: §4.
- Aligning visual regions and textual concepts for semantic-grounded image representations. In NeurIPS, Cited by: §4.
- Exploring and distilling cross-modal information for image captioning. In IJCAI, Cited by: §4.
- SimNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. In EMNLP, Cited by: §4.
- Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §5.
- Effective approaches to attention-based neural machine translation. In EMNLP, Cited by: §1, §5.
- A hierarchical end-to-end model for jointly improving text summarization and sentiment classification. In IJCAI, Cited by: §4.
- Fairseq: A fast, extensible toolkit for sequence modeling. In NAACL-HLT, Cited by: §A.1.
- Scaling neural machine translation. In WMT, pp. 1–9. Cited by: §A.1, Table 2.
- BLEU: A method for automatic evaluation of machine translation. In ACL, Cited by: §A.3.
- A deep reinforced model for abstractive summarization. In ICLR, Cited by: §A.2.
- Deep contextualized word representations. In NAACL-HLT, Cited by: §5.
- U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §1, §5.
- Get to the point: summarization with pointer-generator networks. In ACL, Cited by: §A.2, §1.
- Dense information flow for neural machine translation. In NAACL-HLT, Cited by: §5.
- The evolved transformer. In ICML, Cited by: Table 2.
- Knowledge distillation for multilingual unsupervised neural machine translation. arXiv preprint arXiv:2004.10171. Cited by: §3.
- Self-training for unsupervised neural machine translation in unbalanced training data scenarios. arXiv preprint arXiv:2004.04507. Cited by: §3.
- Attention is all you need. In NIPS, Cited by: §A.1, Figure 1, §1, §1, §1, §2.1, Table 2, §3, §5.
- CIDEr: consensus-based image description evaluation. In CVPR, Cited by: §A.3, §4.
- Multi-layer representation fusion for neural machine translation. In COLING, Cited by: §5.
- Pay less attention with lightweight and dynamic convolutions. In ICLR, Cited by: §A.1, Table 2, §3.
- Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §1, §5.
- Deep layer aggregation. In CVPR, Cited by: §5.
- Fixup initialization: residual learning without normalization. In ICLR, Cited by: Table 2.
- Explicit sparse transformer: concentrated attention through explicit selection. arXiv preprint arXiv:1912.11637. Cited by: §3.
- MUSE: parallel multi-scale attention for sequence to sequence learning. arXiv preprint arXiv:1911.09483. Cited by: §3.
Appendix A Task and Implementation Details
a.1 Machine Translation
We report results on three benchmarks, including two large WMT-2014 datasets, i.e., English-German (EN-DE) and English-French (EN-FR), and a small IWSLT-2014 datasets, i.e., German-English (DE-EN). Follow common practice Vaswani et al. , Wu et al. , for EN-DE dataset, we use newstest2013 for development and newstest2014 for testing. For EN-FR, we validate on newstest2012+2013 and test on newstest2014. We use the fairseq Ott et al.  for both our re-implementation of baselines and baselines with proposal. For the experiments with Transformer on the two WMT datasets that are much larger, we use the Transformer-Big configuration and train on 8 GPUs. For the experiments on the IWSLT DE-EN dataset, we use the Transformer-Base configuration and train on a single GPU, as it is relatively small. For all datasets, we report single model performance by averaging the last 10 checkpoints. Besides, we use beam search of size 4 and length penalty of 0.6 for EN-DE and EN-FR, and use beam search of size 5 for DE-EN. For fair comparisons, following Wu et al.  and Vaswani et al. , for WMT EN-DE and EN-FR we measure case-sensitive tokenized BLEU (multi-bleu.pl) against the reference translations. For IWSLT DE-EN, the BLEU is case-insensitive, and since the target language is English, the results are also valid. For WMT EN-DE only, we apply compound splitting similar to Wu et al. . For WMT EN-DE and EN-FR datasets, we also accumulate the gradients for 16 batches before applying an update Ott et al. , except for Transformer on EN-FR where we do not accumulate gradients. Note that we report the average results over 5 different seeds for each dataset, except for WMT EN-DE and EN-FR datasets, as it consumes a lot of computing resources.
a.2 Abstractive Summarization
We train the models on the CNN-Daily Mail dataset Hermann et al. , which contains online news articles (781 tokens on average) paired with multi-sentence summaries (56 tokens on average). Following See et al. , we truncate each source sentence to 400 words and each target sentence to 100 words. The dataset is able to test the ability of our approach to deal with longer texts. For training the models, we use the default setting provided by OpenNMT Klein et al. . For the experiments with Transformer, we use the Transformer-Base configuration and train on a single GPU. When generating summaries, we follow standard practice in tuning the maximum output length, disallowing repeating the same trigram, and applying a stepwise length penalty Paulus et al. , Fan et al. . Rouge-1, 2 and Rouge-L Lin  are used to evaluate the performance of models.
a.3 Image Captioning
The task combines image understanding and language generation and is a cross-modal task compared to machine translation and text summarization. We conduct experiments on the popular COCO dataset Chen et al. . We use the publicly-available splits in Karpathy and Li  for evaluation. There are 5,000 images each in validation set and test set. For experiments on the image captioning dataset, we use the Transformer-base model and train on a single GPU. For fair comparisons, we use the RCNN-based image features provided by Anderson et al. . We report results with the help of the evaluation toolkit Chen et al. , which includes the commonly-used metrics SPICE Anderson et al. , CIDEr Vedantam et al. , BLEU Papineni et al. , METEOR Banerjee and Lavie  and ROUGE Lin .