The encoder-decoder based Neural machine translation (NMT) models Sutskever et al. (2014); Bahdanau et al. (2014); Wu et al. (2016); Vaswani et al. (2017); Zhang et al. (2019) have made great progresses and drawn much attention in recent years. In practical applications, NMT systems are often fed with a document-level input which requires reference resolution, the consistency of tenses and noun expressions and so on. Many researchers have proven that contextual information is essential to generate coherent and consistent translation for document-level translation Hardmeier (2012); Meyer and Webber (2013); Sim Smith (2017); Jean et al. (2017); Maruf and Haffari (2018); Miculicich et al. (2018); Zhang et al. (2018); Voita et al. (2018); Wang et al. (2017); Tu et al. (2018); Maruf et al. (2019). Despite the great success of the above models, they are designed for sentence-level translation tasks and exclude contextual information in the model architecture.
On these grounds, the widely used hierarchical attention networks (HAN) was proposed to integrate contextual information in document-level translation Miculicich et al. (2018). In this method, the context sentences are considered in the form of hierarchical attentions retrieved by the current generation. That is, it utilizes a word-level attention to represent a sentence and then a sentence-level attention to represent all the involved context. In this way, the final attention representation has to encode all the information needed for coherent and consistent translation, including reference information, tenses, expressions and so on. To get the multi-perspective information, it is necessary to distinguish the role of each context word and model their relationship especially when one context word could take on multiple roles Zhang et al. (2018). However, this is difficult to realize for the HAN model as its final representation for the context is produced with an isolated relevance with the query word which ignores relations with other context words.
To address the problem, we introduce Capsule Networks into document-level translation which have proven good at modelling the parts-wholes relations between low-level capsules and high-level capsules Hinton et al. (2011); Xiao et al. (2018); Sabour et al. (2017); Hinton et al. (2018); Gu and Feng (2019). With capsule networks, the words in a context source sentence is taken as low-level capsules and the information of different perspectives is treated as high-level capsules. Then in the dynamic routing process of capsule networks, all the low-level capsules trade off against each other and consider over all the high-level capsules and drop themselves at a proper proportion to the high-level capsules. In this way, the relation among low-level capsules and that between low-level capsules and high-level capsules are both explored. In order to make sure high-level capsules indeed cluster information needed by the target translation, we apply capsule networks to both sides of the current sentence and add a regularization layer using Pearson Correlation Coefficients to force the high-level capsules on the two sides to approach to each other. In addition, we still need to ensure the final output of capsule networks is relevant to the current sentence. Therefore we propose a Query-guided Capsule Network (QCN) to have the current source sentence to take part in the routing process so that high-level capsules can retain information related to the current source sentence.
To the best of our knowledge, this is the first work which applies capsule networks to document-level translation tasks and QCN is also the first attempt to customize attention for capsule networks in translation tasks. We conducted experiments on three English-German translation data sets in different domains and the results demonstrate that our method can significantly improve the performance of document-level translation compared with strong baselines.
2.1 Sentence-level NMT and Transformer Model
Sentence-level NMTs are generally based on an encoder-decoder framework Sutskever et al. (2014); Bahdanau et al. (2014); Wu et al. (2016); Vaswani et al. (2017); Zhang et al. (2019); Meng and Zhang (2019). In this framework, the encoder encodes the source sentence into a sequence of continuous representations . Given , the decoder predicts target words in order and returns the target translation . The objective function of NMT is to maximize the log-likelihood of a set of source-target language sentence pairs as
where is the training set.
As we implement our approaches based on the Transformer architecture Vaswani et al. (2017), which is a strong sentence-level NMT baseline, we give a brief description of the Transformer model. The encoder and decoder are composed of similar layers which consists of two general types of sub-layers: multi-head attention mechanism and point-wise fully connected feed-forward network (FFN).
Encoder: The encoder of the Transformer is composed of
identical layers. Each of these layers includes a multi-head self-attention mechanism that allows each position of the output of previous encoder layer to attend to all other positions, as well as a position-wise FFN, which is stacked on top of the multi-head self-attention, composed of two linear transformations and a ReLU activation function.
Decoder: The architecture of the decoder is similar to the encoder, however, it employs an additional multi-head attention sub-layer over the encoder output between the multi-head self-attention sub-layer and position-wise FFN sub-layer. The multi-head self-attention sub-layer needs to mask the input target tokens in the future.
2.2 Document-level NMT
The document-level translation task is to translate each source sentence with consideration of the previous context in the document. Formally, the translation of a document containing sentence pairs can be defined as given the source document in order and the translation system generates each translation
in order. The document translation probability can be defined as:
where , denote the source and target sentence respectively, and denotes all of the previous sentence pairs in document. For each sentence , each target word is generated according to the source representation and the generated target hypothesis, therefore the Eq. (2) can be formulated as:
where denotes the word of the translation with the length as and denotes the generated target hypothesis.
The training objective of document-level NMTs is to maximize the log-likelihood of translations in document context as following:
where is the training set of the DocNMT.
Compared with the sentence-level NMTs, the critical part of document-level NMTs is to effectively capture and utilize the related contextual information when translating the to-be-translated source sentence.
3 Our Approach
In this section, we introduce the proposed Query-guided Capsule Network (QCN) for enhancing the document-level NMT. First, we present the overall architecture of the network, and then we describe the QCN in detail.
3.1 Overall Architecture
We aim to enhance the document-level NMT performance through effectively capturing and employing the contextual features in each historical sentence that related to the current sentence. We integrating a novel Query-guided Capsule Network into the sentence-level Transformer-based NMT Vaswani et al. (2017) to capture the document-level contextual information for translating the current source sentence.
As shown in Figure 1, the overall architecture of our translation model is composed of three modules:
The Query-guided Capsule Network takes the to-be-translated source sentence as the query to guide the procedure of retrieving related and helpful contextual features from historical sentences with a novel dynamic routing algorithm.
contains a new sub-layer that attending the contextual features extracted from the QCN to effectively utilize them for translation.
Regularization Layer contains two conventional Capsule Networks to unify the source sentence and the target sentence into an identical semantic space through computing an extra PCCs loss item at the training stage.
3.2 Query-guided Capsule Network
The Capsule Network (CapsNet) Sabour et al. (2017) was proposed to build parts-wholes relationships in the iterative routing procedure, which can be used to capture features in historical sentences from low level to high level. Capsules in the lower layer vote for those in the higher layer by aggregating their transformations with iteratively updated coupling coefficients. However, there exists an obvious drawback of directly applying the Capsule Network into the document-level NMT for capturing contextual features. The reason is that the CapsNet can only extract internal features without considering whether features are related to the to-be-translated source sentence.
To address this issue, we proposed the Query-guided Capsule Network (QCN), which employ the representation of the to-be-translated source sentence as the query to guide the feature extraction procedure of the Capsule Network. In this way, contextual features that generated from the QCN are based on the internal semantic relations among each historical sentence and the external semantic relations between historical sentences and the to-be-translated source sentence. The QCN is based on an improved dynamic routing algorithm which will be detailed introduced in the next section.
Improved Dynamic Routing of QCN
Given a query vector and a set of input capsules , the dynamic routing algorithm iteratively calculates the correlation between the query vector and each input capsule and updates output capsules. Specifically, query vector is the representation of source to-be-translated sentence and input capsules are all word embeddings in previous sentences, which can be formulated as:
where and are both linear transformation functions and indicates the distance of historical sentence from the to-be-translated sentence. Each word embedding in historical sentences is concatenated with a distance-determined one-hot vector to provide positional markers. Compared to the dynamic routing method proposed by Sabour et al. (2017), our improved dynamic routing method of QCN can model information that related to the query among input capsules.
Algorithm 1 shows details of the algorithm and Figure 2 illustrates the interaction of the various components in the QCN. Initially, the improved dynamic routing algorithm of QCN recieves a sequence of lower-level capsules and a query vector and then calculates a Pearson Correlation Coefficients (PCCs) between and each input capsule (line 7). The PCCs is a measure of the linear correlation between two variables. When PCCs is close to +1, it means they have very strong positive linear correlation, and close to -1 means total negative correlation. Given a pair of variables , the formula for computing PPCs is
where and are both -dimension vectors, is the covariance and
is the standard deviation ofand respectively.
The routing iteration process then computes coupling coefficients, denoted as (line 16), with regard to a input capsule and all the higher-level capsules . In the original dynamic routing algorithm Sabour et al. (2017)
, coupling coefficients are only determined by the cumulative “agreement”, which are the prior probabilities that capsuleshould be coupled to capsule . However, in DocNMT situation, it is far from enough to cluster the high-level information from the capsules in lower-level that related to query vector according to a naive “agreement”. To address this issue, we reduce the “agreement”, when PCCs show a negative linear correlation between the query vector and the input vector . Instead, we increase the “agreement”, when PCCs are positive (line 27). The query vector is initially tiled to length and updated with the corresponding higher-level capsules in each iteration (line 29).
Our routing iteration updates higher-level capsules by adding to . This step can add more information related to the query vector and cut off the unrelated features (line 20). It is necessary to get different length of output capsules shrunk into the 0 to 1 interval using “squash” function (line 21) proposed by Sabour et al. (2017) which is shown in Eq.(8).
where can be the initial input capsule or the vector for predicting the output capsule.
3.3 Sub-layer-expanded Transformer
To effectively utilize contextual features extracted from each historical sentence by QCN, we introduce an additional context-aware multi-head attention sub-layer in each layer of the Transformer encoder. The multi-head attention can attend to all the positions of the contextual features with outputs of the previous sub-layer as a query. Then the output of the context-aware multi-head attention sub-layer is fed into the point-wise feed-forward sub-layer in each layer of the encoder. The right part in Figure 1 shows details of the Sub-layer-expanded Transformer. Specifically, the equation of multi-head attention computing procedure is as following:
where , and are the parameter matrices, , and indicates the query, key and value representations. The computation of the attention function is as following:
where indicates the dimension of queries and keys .
3.4 Regularization Layer
To better modeling the document-level translation task, we incorporate a regularization layer (Figure 3) into the whole architecture to restrict the source sentence and target sentence to an identical semantic space. This layer separately feeds the inputs of encoder and decoder into two capsule networks and
, and computes the PCCs between the outputs of two networks. We regard the PCCs as an extra regularization term in the final objective at the training stage. The loss function of our model can be formulated as:
where are parameters of the model, are historical sentences of the to-be-translated source sentence, is the to-be-translated sentence and denotes the generated target hypothesis.
|Sent No.||Doc len avg||Sent No.||Doc len avg||Sent No.||Doc len avg|
Datasets and Evaluation Metrics
We carry out experiments on English-German translation tasks in three different domains: talks, news, and speeches. The corpora statistics are shown in Table 1.
News. We take the sentence-aligned document-delimited News Commentary v11 corpus222http://www.casmacat.eu/corpus/news-commentary.html as our training set. The WMT’16 news-test2015 and news-test2016 are used for development and testing respectively.
We download all of above extracted corpora333https://github.com/sameenmaruf/selective-attn/tree/master/data from Maruf et al. (2019). The tokenization and truecase pre-processing are implemented on all datasets using the scripts of the Moses Toolkit444https://github.com/moses-smt/mosesdecoder/tree/master/scripts Koehn et al. (2007). We also apply segmentation into BPE subword units555https://pypi.org/project/subword-nmt/ Sennrich et al. (2016) with 30K merge operations.
Models and Baselines
We performed the same configuration on our models according to the settings of the Maruf and Haffari (2018). Specifically, for the Transformer, we set the hidden size and point-wise FFN size as 512 and 2048 respectively. We use 4 layers and 8 attention heads in both encoder and decoder. All dropout rates are set to 0.1 for context-agnostic model and 0.2 for context-aware model.
In the training phase, we use the default Adam optimizer Kingma and Ba (2014) with a fixed learning rate of 0.0001. The batch size is 1500 on TED dataset and 900 on both News and Europarl datasets.
4.2 Results and Analysis
Table 2 shows that our model surpasses all the context-agnosticVaswani et al. (2017) and context-awareZhang et al. (2018); Miculicich et al. (2018); Maruf and Haffari (2018) baselines on TED and Europarl datasets. For TED dataset, the performance of our model greatly exceeds that of all other baselines, and is better than Miculicich et al. (2018) with a gain of +0.59 BLEU and +0.61 Meteor. For Europarl dataset, our model got improvements with a gain of +0.07 on BLEU metric, but the Meteor score is +0.64 higher than Maruf et al. (2019) which utilize the whole document as the contextual information, whereas we only using 3 previous sentences.
Results on the sequence-level Transformer and our DocNMTs show that the captured contextual features provide helpful semantic information for enhancing the translation quality. The regularization term that we proposed can effectively further improve the model performance on TED and Europarl datasets. For the restriction of the GPU memory, we have to filter long sentences to keep our model running. Although, it hurts the model performance on the “News” dataset (contains many long sentences), the QCN module and regularization term still bring improvements.
Effect of Contextual Information Scope
To investigate the effect of contextual information scope, we carry on the number of historical sentences hyper-parameter experiments on the TED talk dataset. We fix the hyper-parameters of the QCN by setting both the number of higher-level capsules and routing iteration to four, and investigating the impact of changes in the number of historical sentences on BLEU and Meteor scores. Figure 4 shows that using one historical sentence in QCN can obtain the best Meteor score while the highest BLEU score is presented when we utilize two historical sentences. We found that the growth of Meteor and BLEU scores are opposite. Therefore, the choice of utilizing how many historical sentences is a trade-off between both scores. We choose three historical sentences as our final setting. Through experimentation, we also found it is not that the more historical sentences we utilize, the better translation performance is.
Effect of Feature Capsule Number
QCN is the crucial part of our overall architecture and the positive and negative impact depends on the configuration of the QCN. Therefore, We also investigate the effect of the hyper-parameter of QCN: the number of feature capsules.
We set the number of historical sentence as 3 according to the previously experimental results, and the number of routing iteration is set to 4. Figure 5 shows that Meteor score become highest when the number of higher-level capsules is set as 2, but BLEU score can obtain best score at 4. We finally choose 4 as the final setting because both BLEU and Meteor can obtain relatively good results.
Visualization of Agreement and PCCs
Coupling Coefficients (CCs) can indirectly reflect the variation of the “agreement”, so we visualize the coefficients in each routing iteration at the stage of decoding as shown in Figure 6
. In the first iteration, all coupling coefficients are initialized in a uniform distribution, all higher-level capsules are voted by lower-level capsules equally. Then, lower-level capsules are iteratively trained to send more information to the proper higher-level capsule. Figure6 shows that most of input capsules are tend to vote the feature capsule finally.
Different from the CCs, PCCs show the linear correlation between query and inputs. Initially, the query vector is tiled with the number of feature capsules and is used to calculate the PCCs with each input capsules, as the Figure 7 shows that the color of every column is identical. As the iterative routing begins, each query is updated according to the higher-level capsules. The function of the PPCs is to increase or decrease the coupling coefficients value according to the positive or negative value. See Figure 7, we can find that PCCs varies as iteration changes.
5 Related Work
Document-level Machine Translation
Document-level machine translation became a hot research direction in the later stage of statistical machine translation era. Hardmeier and Federico (2010) represented the links between word pairs in the context using a word dependency model for SMT to improve the translation of anaphoric pronouns. Hardmeier et al. (2012, 2013) first proposed a new document-level SMT paradigm that translates whole documents as units. However, in this period, most of the work has not achieved too many compelling results or has been only focused on a part of difficulties.
With the coming of the era of Neural Machine Translation, many works began to focus on Document-level NMT tasks. Xiong et al. (2019) trained a reward teacher to refine the translation quality from a document perspective. Tiedemann and Scherrer (2017) simply concatenated sentences in one document as models’ input or output. Jean et al. (2017) used additional context encoder to capture larger-context information. Kuang et al. (2017); Tu et al. (2018) used a cache to memorize most relevant words or features in previous sentences or translations.
Recently, several studies integrated additional modules into the Transformer-based NMTs for modeling contextual information. Voita et al. (2018); Zhang et al. (2018), Maruf and Haffari (2018) proposed a document-level NMT using a memory-networks, and Wang et al. (2017) and Miculicich et al. (2018) integrated hierarchical attention network in RNN-based NMT or Transformer to model the document-level information. Maruf et al. (2019) used the whole document as the contextual information and firslty divided document-level translation tasks into two types: offline and online.
Hinton et al. (2011) proposed the capsule conception to use vector for describing the pose of an object. The dynamic routing algorithm was proposed by Sabour et al. (2017) to build the part-whole relationship through the iterative routing procedure. Hinton et al. (2018) designed a new routing style based on the EM algorithm. Some researchers investigated to apply the capsule network for various tasks. Wang et al. (2018) investigated a novel capsule network with dynamic routing for linear time NMT. Yang et al. (2018) explored capsule networks for text classification with strategies to stabilize the dynamic routing process. Gu and Feng (2019) introduces capsule networks into Transformer to model the relations between different heads in multi-head attention. We specifically investigated dynamic routing algorithms for the document-level NMT.
We have proposed a novel Query-guided Capsule Network with an improved dynamic routing algorithm for enhancing context modeling for the document-level Neural Machine Translation Model. Experiments on English-German in different domains showed our model significantly outperforms sentence-level NMTs and achieved state-of-the-art performance on two of three datasets, which proved the effectiveness of our approaches.
We thank the anonymous reviewers for their insightful comments. This work was supported by the National Natural Science Foundation of China (NO.61662077, NO.61876174) and National Key R&D Program of China (NO.2017YFE9132900).
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §2.1.
- WIT: web inventory of transcribed and translated talks. In Proceedings of the 16 Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp. 261–268. Cited by: 1st item.
Improving multi-head attention with capsule networks.
Proceedings of the 8th CCF International Conference on Natural Language Processing and Chinese Computing, Cited by: §1, §5.
- Modelling pronominal anaphora in statistical machine translation. In IWSLT (International Workshop on Spoken Language Translation); Paris, France; December 2nd and 3rd, 2010., pp. 283–289. Cited by: §5.
- Document-wide decoding for phrase-based statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 1179–1190. External Links: Cited by: §5.
- Docent: a document-level decoder for phrase-based statistical machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Sofia, Bulgaria, pp. 193–198. External Links: Cited by: §5.
- Discourse in statistical machine translation. a survey and a case study. Discours. Revue de linguistique, psycholinguistique et informatique. A journal of linguistics, psycholinguistics and computational linguistics (11). Cited by: §1.
International Conference on Artificial Neural Networks, pp. 44–51. Cited by: §1, §5.
- Matrix capsules with EM routing. In International Conference on Learning Representations, External Links: Cited by: §1, §5.
- Does neural machine translation benefit from larger context?. CoRR abs/1704.05135. Cited by: §1, §5.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
- Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp. 177–180. Cited by: §4.1.
- Europarl: a parallel corpus for statistical machine translation. In MT summit, Vol. 5, pp. 79–86. Cited by: 3rd item.
- Cache-based document-level neural machine translation. arXiv preprint arXiv:1711.11221. Cited by: §5.
- METEOR: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pp. 228–231. Cited by: §4.1.
- Document context neural machine translation with memory networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1275–1284. External Links: Cited by: §1, 3rd item, §4.1, §4.1, §4.2, §5.
- Selective attention for context-aware neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota. Cited by: §1, §4.1, §4.2, Table 2, §5.
DTMT: a novel deep transition architecture for neural machine translation.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 224–231. Cited by: §2.1.
- Implicitation of discourse connectives in (machine) translation. In Proceedings of the Workshop on Discourse in Machine Translation, pp. 19–26. Cited by: §1.
- Document-level neural machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2947–2954. External Links: Cited by: §1, §1, §4.1, §4.2, Table 2, §5.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.1.
- Dynamic routing between capsules. In Advances in neural information processing systems, pp. 3856–3866. Cited by: §1, §3.2, §3.2, §3.2, §3.2, §5.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Cited by: §4.1.
- On integrating discourse in machine translation. In Proceedings of the Third Workshop on Discourse in Machine Translation, Copenhagen, Denmark, pp. 110–121. External Links: Cited by: §1.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3104–3112. External Links: Cited by: §1, §2.1.
- Neural machine translation with extended context. In Proceedings of the Third Workshop on Discourse in Machine Translation, Copenhagen, Denmark, pp. 82–92. External Links: Cited by: §5.
- Learning to remember translation history with a continuous cache. Transactions of the Association for Computational Linguistics 6, pp. 407–420. External Links: Cited by: §1, §5.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1, §2.1, §3.1, §4.2.
- Context-aware neural machine translation learns anaphora resolution. In ACL, Cited by: §1, §5.
- Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2826–2831. External Links: Cited by: §1, §5.
- Towards linear time neural machine translation with capsule networks. CoRR abs/1811.00287. External Links: Cited by: §5.
- Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1, §2.1.
- MCapsNet: capsule network for text with multi-task learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4565–4574. Cited by: §1.
- Modeling coherence for discourse neural machine translation. CoRR abs/1811.05683. Cited by: §5.
- Investigating capsule networks with dynamic routing for text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3110–3119. External Links: Cited by: §5.
- Improving the transformer translation model with document-level context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 533–542. External Links: Cited by: §1, §4.1, §4.2, Table 2, §5.
- Bridging the gap between training and inference for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4334–4343. Cited by: §1, §2.1.
- Refining source representations with relation networks for neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1292–1303. Cited by: §1.