Neural machine translation (NMT), which leverages neural networks to map between natural languages, has made remarkable progress in the past several years[22, 2, 24]. Capable of learning representations from data, NMT has achieved significant improvements over conventional statistical machine translation (SMT)  and become the new de facto paradigm in the machine translation community.
Despite its success, NMT suffers from a major drawback: there is no alignment to explicitly indicate the correspondence between the input and the output. As all internal information of an NMT model is represented as real-valued vectors or matrices, it is hard to associate a source word with its translational equivalents on the target side. Although the attention weights between the input and the output are available in the RNNsearch model, these weights only reflect relevance rather than translational equivalence . To aggravate the situation, attention weights between the input and the output are even unavailable in modern NMT models such as Transformer .
The lack of alignment in NMT leads to at least three problems. First, it is difficult to interpret the translation process of NMT models without alignment. In conventional SMT , the translation process can be seen as a sequence of interpretable decisions, in which alignment plays a central role. It is hard to include such interpretable decisions in NMT models without the access to alignment. Although visualization tools such as layer-wise relevance propagation 
can be used to measure the relevance between two arbitrary neurons in NMT models, the hidden states in neural networks still do not have clear connections to interpretable language structures.
Second, it is difficult to impose lexical constraints on NMT systems  without alignment. For example, given an English sentence
American peot Edgar Allan Poe
, one requires that the English phrase “Edgar Allan Poe” must be translated by NMT systems as a Chinese word “ailunpo”. Such lexical constraints are important for both automatic MT and interactive MT. In automatic MT, it is desirable to incorporate the translations of infrequent numbers, named entities, and technical terms into NMT systems . In interactive MT, human experts expect that the NMT system can be controlled and include specified translations in the system output . Although Hokamp:17 (Hokamp:17) and Post:18 (Post:18) provide solutions to impose lexical constraints, their methods can only ensure that the specified target words or phrases will appear in the system output. As a result, the ignorance of the alignment to the source side might deteriorate the adequacy of system output (see Table 3).
Third, it is difficult to impose structural constraints on NMT systems without alignment. Figure 1 shows an example of webpage and its HTML code. Unlike lexical constraints, structural constraints require that source strings enclosed in paired HTML tags must be translated as single units and the translations must be enclosed in the same paired HTML tags. For example, the Chinese translation of “aThe Raven/a” should be “awuya/a”. It is challenging for NMT models trained on plain text to translate such structured text. While removing these HTML tags before translation and inserting tags back after translation will maintain translation quality but often violate structural constraints [26, 10], only translating the plain text within tags and concatenating the translations and tags in a monotonic way can strictly conform to structural constraints but impair translation quality .
In this work, we propose to introduce phrase alignment into the translation process of arbitrary NMT models. The basic idea is to develop an NMT model that treats phrase alignment as a latent variable. During decoding, the NMT model is used to score a search space similar with conventional phrase-based SMT , in which phrase alignment is readily available. While the use of the trained NMT model keeps the capabilities of NMT in learning representations from data and capturing non-local dependencies, the availability of phrase alignment makes it possible to include interpretable decisions in the translation process. Also thanks to the availability of phrase alignment, we design a new decoding algorithm that applies to all the unconstrained, lexically constrained, and structurally constrained translation tasks. Experiments show that the use of phrase-based search space does not hurt the translation performance of NMT models on the unconstrained translation task. Moreover, our approach significantly improves over state-of-the-art methods on the lexically and structurally constrained translation tasks.
Our work is related to three lines of research: (1) interpreting NMT, (2) constrained decoding for NMT, and (3) combining SMT and NMT.
Our work is related to attempts on interpreting NMT [6, 15]. Ding:17 (Ding:17) calculate the relevance between source and target words with layer-wise relevance propagation. Such relevance measures the contribution of each source word to target word instead of translational equivalence between source and target words. li:19 (li:19) predict alignment with an external alignment model trained on the output of a statistical word aligner and use prediction differences to quantify the relevance between source and target words. However, their external alignment model is not identical to the alignment in the translation process. Our approach differs from prior studies by introducing explicit phrase alignment into the translation process of NMT models, which makes each step in generating a target sentence interpretable to human experts.
Constrained Decoding for NMT
Our work is also closely related to imposing lexical constraints on the decoding process of NMT [8, 3, 19]. Hokamp:17 (Hokamp:17) propose a lexically constrained decoding algorithm for NMT. Their approach can ensure that pre-specified target strings will appear in the system output. Post:18 (Post:18) improve the efficiency of lexically constrained decoding by introducing dynamic beam allocation. One drawback of the two methods is that they cannot impose lexical constraints on the source side due to the lack of alignment. Chatterjee:17 (Chatterjee:17) and hasler:18 (hasler:18) rely on the attention weights in the RNNsearch model  to impose source-aware lexical constraints with guided beam search. However, their methods can not be applied to Transformer . With translation options, it is also easy to impose source-aware lexical constraints using our approach for arbitrary NMT models.
The direction of imposing structural constraints remains much unexplored, especially for NMT. Most prior studies have focused on SMT. Although the ideal solution is to directly train NMT models on parallel corpora for structured text [7, 9, 23], such labeled datasets are hard to construct and remain limited in quantity. Therefore, a more practical solution is to use off-the-shelf MT systems tailored for unstructured text to translate structured text [1, 26, 10]. But these approaches face the risk of performance degradation or failure to impose structural constraints correctly. Our work proposes a structurally constrained decoding algorithm for NMT to preserve structural constraints without sacrificing translation quality.
Combining SMT and NMT
Several authors have endeavored to combine the merits of SMT and NMT [21, 11, 5]. Stahlberg:16 (Stahlberg:16) propose to use the lattice output by SMT as the search space of NMT. The major difference is that our work allows for both source word omission and target word insertion, which seem to be helpful in reducing the gap between phrase-based and neural spaces. In this work, we only use NMT models to score the translations in a phrase-based space. It is possible to exploit SMT features as suggested by Dahlmann:17 (Dahlmann:17).
Our work aims to introduce phrase alignment into the translation process of arbitrary NMT models. Figure 2
illustrate the central idea of our approach. During decoding, the target sentence and phrase alignment are generated simultaneously. As the target sentence grows from left to right, it is easy to apply arbitrary NMT models to calculate translation probabilities in an incremental way. A key difference of our approach from conventional phrase-based SMT is that unaligned source and target phrases are allowed to reduce the discrepancy between the search spaces of SMT and NMT models. For example, Figure 2(b) uses an unaligned target phrase (i.e., “de”) and Figure 2(e) uses an unaligned source phrase (e.g., “Edgar”). With access to phrase alignment, we develop a decoding algorithm that is capable of preserving lexical and structural constraints without sacrificing translation quality.
Let be a source sentence and be a target sentence. We use to denote an empty source word that connects to all unaligned target phrases and to denote an empty target word that connects to all unaligned source phrases. We use to denote the phrase alignment between the source and target sentences. Each link is a 4-tuple, where is the beginning position of the source phrase, is the ending position of the source phrase, is the beginning position of the target phrase, and is the ending position of the target phrase. For example, the phrase alignment in Figure 2 comprises five links: , , , , and . For convenience, we use to denote the source phrase spanning from to and to denote the target phrase spanning from to . For example, is “Allan Poe” and is “alunpo”.
More formally, our approach is based on a latent variable model given by
where is a set of model parameters.
The probability of generating the target sentence and phrase alignment given the source sentence can be further factored as
where is a phrase alignment model and is a phrase translation model. Note that
is a partial phrase alignment. As it is challenging to estimate the phrase alignment model from data due to the exponential search space of phrase alignments, we assume that the alignment model has a uniform distribution for simplicity and leave the learning of the alignment model for future work.
We distinguish between two kinds of phrase translation models: non-empty and empty. For non-empty target phrases, the phrase translation probability can be decomposed as a product of word-level translation probabilities:
where is the -th word in the target phrase and denotes the set of model parameters related to non-empty phrases. Note that the word-level translation probabilities in Eq. (3) can be easily calculated by arbitrary NMT models.
For the empty target phrase such as , we define the phrase translation probability as
where is the source phrase aligned to , is the surrounding context on the source side, and is the set of model parameters related to empty phrases. For simplicity, we restrict that unaligned source phrase to be a single source word. Note that .
We use the self-attention based encoder  to model the translation probability of empty target phrases. The encoder takes and as input and output the probability of omitting .
Given a parallel corpus , the standard training objective is to maximize the log-likelihood of the training data:
As training the latent-variable model requires to enumerate all possible phrase alignments, it is impractical to directly estimate and jointly. Instead, we propose to train the two models separately. For the non-empty translation model in Eq. (3), the training objective is given by
For the empty translation model in Eq. (4), we can use an external word alignment tool  to generate word alignments for the parallel corpus . It is easy to decide whether a source word is unaligned or not based on the word alignments. As a result, the training objective for the empty translation model is given by
where is an indicator vector corresponding to the -th source sentence that indicates whether is unaligned and is the cross entropy loss defined as
Given the learned model parameters and an unseen source sentence , our goal is to find the target sentence and phrase alignment with the highest probability without violating pre-specified constraints:
where is a function that checks whether the resulting translation and alignment conform to a set of pre-specified constraints . The function returns 1 if all constraints are satisfied and 0 otherwise.
As it is computationally expensive to enumerate all possible phrases and alignments during decoding, we resort to an external bilingual phrase table  to restrict the search space. Before decoding, the candidate translations of each source phrase, which are usually referred to as translation options, can be collected by matching the phrase table against the input sentence. Note that unlike Koehn:03 (Koehn:03), our approach allows a source phrase or a target phrase to be unaligned.
It is easy for our approach to impose lexical constraints during the option collection process simply by replacing the translation of the pre-specified source phrase with the pre-specified target phrase. To achieve this, we restrict that (1) the pre-specified source phrase must be translated into a continuous segment and (2) its translation options do not overlap with other words. To impose structural constraints, we restrict that the translation options within a paired HTML tags do not intersect with those outside.
As unconstrained decoding is a special case of structurally constrained decoding and lexically constrained decoding can be achieved by restricting translation options, we focus on describing the structurally constrained decoding algorithm. We use a deductive system to formally describe the decoding process. An item in the deductive system is a 4-tuple defined as follows: 111As it is easy to obtain phrase alignment during the decoding process, we omit it in the item for simplicity.
Source sentence : To capture structural constraints, we add open constraint tags (e.g., “c1” and “c2”) and close constraint tags (e.g., “/c1” and “/c2”) to the input, as shown in Figure 3. Note that sentence boundaries can also be seen as constraints.
Coverage vector : A vector that consists of 0’s and 1’s to indicate which source words have been covered. The coverage vector is initialized as .
Stack : A stack that stores constraint tags. The decoding algorithm uses the stack to preserve structural constraints.
Translation : Partial translation generated during the decoding process.
Each item is associated with a log probability yielded by our model. Note that a translation option can also be represented as an item . Except for the position of the source phrase, all other positions in are set to 0. is simply the target phrase. The log probability of a translation option is set to 0.
As shown in Figure 4, the deductive system comprises three inference rules:
Translate: Translate a source phrase using a translation option. In Figure 4, is a current item and is a translation option. This rule is activated in two cases: the translation option covers an uncovered source phrase within the constraint 222By “within the constraint”, we mean that the constraint is the innerest one that encloses a token. For example, in Figure 3, “Edgar” is within the inner constraint c2 rather than the outer constraint c1. at the top of the stack, or the source phrase is empty (i.e., ).
Push: Push a constraint tag to the stack. The algorithm constructs a special translation option for a constraint tag . For the open constraint tag “c”, this rule is activated when all source words within the constraint are uncovered and the algorithm starts to translate any source phrase within the constraint. For the close constraint tag “/c”, this rule is activated when all source words within the constraint are covered.
Pop: Pop the top two constraint tags from the stack. This rule is activated if the top two elements in the stack are paired open and close tags (e.g., “c1” and “/c1”).
Similar to lexically constrained decoding [8, 19], we use an matrix to store all items generated during decoding, where is the length of input and is the maximum length of the output. Each element is a stack of items with source words covered and target words generated. While the time complexity of the decoding algorithm in standard NMT is , the time complexity of our algorithm is , where is the beam size (i.e., the maximum number of items stored in each stack). To speed up the decoding, our approach only keeps top- items for all stacks with the same number of generated target words (i.e., ). As a result, the time complexity of our algorithm is reduced to , which is identical to that of Post:18 (Post:18).
We evaluated our approach on the Chinese-English translation task. The training set contains 1.25M sentence pairs from LDC 333The training set is composed of LDC2002E18, LDC2003E07, LDC2003E14, part of LDC2004T07, LDC2004T08, and LDC-2005T06. with 29.8M Chinese tokens and 35.8M English tokens after byte pair encoding 
with 32K merges. The NIST 2006 dataset is used as the development set and the NIST 2008 datasets is used as the test set. The evaluation metric is case-insensitive BLEU4 as calculated by the multi-bleu.perl script.
The NMT model used in our experiments is Transformer . The number of layers is set to 6 for both encoder and decoder. The hidden size is set to 512 and the filter size is set to 2,048. There are 8 separate heads in the multi-head attention. We used Adam  to optimize model parameters. During training, each batch contains approximately 25,000 tokens. We adopt the learning rate decay policy as described by Vaswani:17 (Vaswani:17). The length penalty  is used and the hyper-parameter is set to 0.6.
For our approach, we used the training set to train the non-empty translation model in Eq. (3). The same training set was also used to obtain an aligned parallel corpus using GIZA++ , which is used to extract a bilingual phrase table  to collect translation options and train the empty translation model in Eq. (4). The translation options of the empty source phrase are restricted to most frequent words of which the probabilities of aligning to the empty source phrase are higher than 0.2 on the training set.
Results on Unconstrained Decoding
In this experiment, we compared our method with the standard Transformer model .
Effect of Empty Phrases
Table 1 shows the effect of empty source and target phrases on the development set. The empty source phrase allows for target word insertion and the empty target phrase permits source word omission. It is clear that introducing empty phrases on both sides is beneficial for improving translation quality, suggesting that it is important to use empty phrases to reduce the discrepancy between the phrase-based search space and neural models. An interesting finding is that allowing for target word insertion but disabling source word omission dramatically hurts the translation performance (i.e., 39.28). We find that the decoder tends to insert many meaningless target words.
Comparison with Transformer
Table 2 shows the comparison between the standard Transformer model and our latent variable model. Our model is different from the standard model in two aspects. First, our model uses a phrase lattice to represent the search space. Second, empty phrases are introduced to make the search space more flexible than that of conventional SMT. We find that our model slightly improves over the standard model, suggesting that we can use the phrase-based search space to replace the standard search space for lexically and structurally constrained decoding.
|lexical constraints||(“taose”, “color blossoms”)|
|source||“ taose ” qian ban duan hai ting hao de , dajia dou shi xinren .|
|reference||the first half of “ color blossoms ” is quite good . they are all first-timers .|
|no constraint||the first half of the “ peach color ” is still quite good . people are new people .|
|DBA||in the first half of the “ peach color blossoms , ” people are new people .|
|this work||the first half of the “ color blossoms ” is still quite good . people are new people .|
|lexical constraints||(“yaoguanju”, “fda”), (“7 yue 30 ri”, “july 30”), (“wendiya”, “avandia”)|
|source||yaoguanju jiang yu 7 yue 30 ri juxing youguan wendiya anquanxing de tingzhenghui|
|reference||the fda will hold a hearing into the safety of avandia on july 30 .|
|no constraint||the drug administration will hold hearings on the safety of wendiya on july 30 .|
|DBA||fda avandia will hold a hearing on the safety of man dim on july 30.|
|this work||the fda will hold hearings on the safety of avandia on july 30 .|
Figure 5 shows the comparison between the attention and alignment. As there is no attention between the input and output in the Transformer model, the heatmap in Figure 5 is taken from the encoder-decoder attention in the third layer. In the heatmap, the attention weight is averaged over 8 different heads. While the attention matrix only reveals the relevance between source and target words, the phrase alignment generated by our model is more useful for achieving lexically and structurally constrained decoding.
Results on Lexically Constrained Decoding
In this experiment, we compared our method with dynamic beam allocation (DBA) proposed by Post:18 (Post:18). We asked human experts to pre-specify 467 distinct lexical constraints with 1,005 occurrences for the NIST 2008 dataset. They are mostly translations of named entities.
We find that imposing lexical constraints using DBA achieves a BLEU score of 38.54 and our approach achieves a BLEU score of 39.43. Table 3 shows some example translations. Given a lexical constraint (“taose”, “color blossoms”), unconstrained decoding fails to generate “color blossoms” on the target side. DBA is capable of enforcing the target phrase of the lexical constraint to appear in the translation. However, there is an extra target word “peach” (highlighted in bold) that is also connected to “taose”. In other words, “taose” is translated twice in a wrong way. To make things worse, DBA omits the source phrase “hai ting hao de” (highlighted in italic). Similar findings are also observed on the second example, in which the Chinese word “wendiya” is translated twice by DBA: “avandia” and “man dim” (highlighted in bold).
We observe that 6.9% of the source phrases of lexical constraints on the test set are repeatedly translated by DBA while the proportion drops to 0.3% for our approach. One possible reason is that DBA ignores the source side of a lexical constraint and thus inevitably impairs the adequacy of the resulting translation.
Results on Structurally Constrained Decoding
We evaluated our structurally constrained decoding algorithm on a webpage translation task.
As labeled data is limited in quantity for webpage translation, we still use the unstructured Chinese-English dataset that contains 1.25M sentence pairs as the training set. We built a test set for Chinese-English structured text translation based on the webpages of Wikipedia. The test set contains 200 sentences with HTML tags retained. On average, each sentence in the test set has 36.9 words and 2.6 pairs of HTML tags.
We compared our approach with the following five baselines: 444We did not compare with the methods that train SMT models on parallel corpora for webpage translation because these datasets are not publicly available.
Remove: Remove all HTML tags before decoding and do not insert tags back to translations after decoding.
Split : Split the input by tags before decoding, translate textual parts independently, and concatenate translations monotonically after decoding.
Match : Remove all HTML tags before decoding and insert tags back to translations by matching.
Align : Remove all HTML tags before decoding and insert tags back to translations using word alignments generated by GIZA++.
Google: The Google Translate online system. 555https://translate.google.com/
All the baselines except Google share the same Transformer model with our approach.
|Method||w/o tag||w/ tag||in tag|
Results on Webpage Translation
Table 4 shows the comparison of imposing structural constraints with existing methods on the test set. As Remove ignores all HTML tags, it is not capable of imposing structural constraints. Split ensures that the structural constraints can be imposed correctly because the sentence segments between HTML tags are translated independently, but the translation quality drops dramatically. Match and Align take the full advantage of standard NMT to translate the textual parts but often fail to recover HTML tags correctly after decoding. According to the translations, it seems that Google uses a strategy similar to Split but achieves much higher BLEU scores because it used much larger training data than all other methods. Our approach achieves the best performance in terms of all evaluation metrics by fully preserving the structural constraints without losing translation quality.
We have presented a latent variable model for neural machine translation that treats phrase alignment as an unobserved latent variable. The introduction of phrase alignment makes it possible to decompose the translation process of arbitrary NMT models into interpretable steps. In addition, it is also convenient to use our approach to impose lexical and structural constraints thanks to the availability of phrase alignment. Experiments show that the proposed method achieves significant better performance on both lexically and structurally constrained translation tasks.
-  (1997) Automatic english/arabic html home page translation tool. In Proceedings of The First Workshop on Technologies for Arabizing the Internet, Cited by: Introduction, Constrained Decoding for NMT, item 2.
-  (2015) Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR 2015, Cited by: Introduction, Introduction, Constrained Decoding for NMT.
-  (2017) Guiding neural machine translation decoding with external knowledge. In Proceedings of the Second Conference on Machine Translation, Cited by: Constrained Decoding for NMT.
-  (2016) PRIMT: a pick-revise framework for interactive machine translation. In Proceedings of NAACL 2016, Cited by: Introduction.
-  (2017) Neural machine translation leveraging phrase-based models in a hybrid search. In Proceedings of ACL 2017, Cited by: Combining SMT and NMT.
-  (2017) Visualizing and understanding neural machine translation. In Proceedings of ACL 2017, Cited by: Introduction, Interpreting NMT.
-  (2010) TMX markup: a challenge when adapting smt to the localisation environment. European Association for Machine Translation. Cited by: Constrained Decoding for NMT.
-  (2017) Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of ACL 2017, Cited by: Introduction, Constrained Decoding for NMT, Decoding.
-  (2011) The integration of moses into localization industry. In Proceedings of EAMT 2011, Cited by: Constrained Decoding for NMT.
-  (2013) Transferring markup tags in statistical machine translation: a two-stream approach. Machine Translation Summit XIV. Cited by: Introduction, Constrained Decoding for NMT, item 4.
-  (2017) Neural lattice search for domain adaptation in machine translation. In Proceedings of IJCNLP 2017, Cited by: Combining SMT and NMT.
-  (2015) Adam: a method for stochastic optimization. In Proceedings of ICLR 2015, Cited by: Setup.
-  (2017) Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, Cited by: Introduction.
-  (2003) Statistical phrase-based translation. In Proceedings of NAACL 2003, Cited by: Introduction, Introduction, Introduction, Decoding, Approach, Setup.
-  (2019) On the word alignment from neural machine translation. In Proceedings of ACL 2019, Cited by: Interpreting NMT.
-  (2015) Addressing the rare word problem in neural machine translation. In Proceedings of ACL 2015, Cited by: Introduction.
-  (2003) A systematic comparison of various statistical alignment models. Computational Linguistics. Cited by: Training, Setup.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL 2002, Cited by: Setup.
-  (2018) Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In Proceedings of NAACL 2018, Cited by: Constrained Decoding for NMT, Decoding, Table 3.
-  (2016) Neural machine translation of rare words with subword units. In Proceedings of ACL 2016, Cited by: Setup.
-  (2016) Syntactically guided neural machine translation. In Proceedings of ACL 2016, Cited by: Combining SMT and NMT.
-  (2014) Sequence to sequence learning with neural networks. In Proceedings of NIPS 2014, Cited by: Introduction.
-  (2011) SMT-cat integration in a technical domain: handling xml markup using pre & post-processing methods. In Proceedings of EAMT 2011, Cited by: Constrained Decoding for NMT.
-  (2017) Attention is all you need. In Proceedings of NIPS 2017, Cited by: Introduction, Introduction, Constrained Decoding for NMT, Modeling, Setup, Results on Unconstrained Decoding.
-  (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. Note: arXiv:1609.08144v2 Cited by: Setup.
-  (2001) An automatic english-arabic html page translation system. Journal of Network and Computer Applications. Cited by: Introduction, Constrained Decoding for NMT, item 3.