Machine Translation with Unsupervised Length-Constraints

by   Jan Niehues, et al.
Maastricht University

We have seen significant improvements in machine translation due to the usage of deep learning. While the improvements in translation quality are impressive, the encoder-decoder architecture enables many more possibilities. In this paper, we explore one of these, the generation of constraint translation. We focus on length constraints, which are essential if the translation should be displayed in a given format. In this work, we propose an end-to-end approach for this task. Compared to a traditional method that first translates and then performs sentence compression, the text compression is learned completely unsupervised. By combining the idea with zero-shot multilingual machine translation, we are also able to perform unsupervised monolingual sentence compression. In order to fulfill the length constraints, we investigated several methods to integrate the constraints into the model. Using the presented technique, we are able to significantly improve the translation quality under constraints. Furthermore, we are able to perform unsupervised monolingual sentence compression.


page 1

page 2

page 3

page 4


A neural interlingua for multilingual machine translation

We incorporate an explicit neural interlingua into a multilingual encode...

Training Multilingual Machine Translation by Alternately Freezing Language-Specific Encoders-Decoders

We propose a modular architecture of language-specific encoder-decoders ...

Improving Neural Text Simplification Model with Simplified Corpora

Text simplification (TS) can be viewed as monolingual translation task, ...

Neural Machine Translation by Jointly Learning to Align and Translate

Neural machine translation is a recently proposed approach to machine tr...

Multilingual Simultaneous Speech Translation

Applications designed for simultaneous speech translation during events ...

Unsupervised Paraphrasing without Translation

Paraphrasing exemplifies the ability to abstract semantic content from s...

Controlling Utterance Length in NMT-based Word Segmentation with Attention

One of the basic tasks of computational language documentation (CLD) is ...

1 Introduction

Neural machine translation (NMT) Sutskever et al. (2014); Bahdanau et al. (2014)

exploits neural networks to directly learn to transform sentences in a source language to sentences in a target language. This technique has significantly improved the quality of machine translation

Bojar et al. (2016); Cettolo et al. (2015). The advances in quality also allow for the application of this technology to new real-world applications.

While research systems often purely focus on a high translation quality, real-world applications often have additional requirements for the output of the system. One example is the mapping of markup information from the source text to the target text Zenkel et al. (2019). In this work, we will focus on another use case, the generation of translations with given length constraints. Thereby, we focus on compression. That means the target length is shorter than the actual length of the translation. When translating from one language to another, the length of the source text is usually different from the length of the target text. While for most applications of machine translation this does not pose a problem, for some applications this significantly deteriorates the user experience. For example, if the translation should be displayed in the same layout as the source text (e.g. in a website), it is advantageous if the length stays the same. Another use case are captions for videos. A human is only capable of reading text with a given speed. For an optimal user experience, it is therefore not only important to present an accurate translation, but also to present the translation with a maximum number of words.

A first approach to address this challenge would be to use a cascade of a machine translation and sentence compression system. In this case, we would need training data to train the machine translation system and additional training data to train the sentence compression system. It is very difficult and sometimes even impossible to collect the training data for the sentence compression task. Furthermore, we need a sentence compression model with a parametric length reduction ratio. For a supervised model, we would therefore need examples with different length reduction ratios. Therefore, this work focuses on unsupervised sentence compression. In this case, the cascaded approach cannot directly be applied and we will start with an end-to-end approach to length-constraint translation.

While our work focuses on the end-to-end approach to translation combined with sentence compression, monolingual sentence compression is another important task. For example, human-generated c aptions are often not an accurate transcription of the audio, but in addition the text is shortened. This is due to cognitive processing constraints. The user is able to listen to more words in a given time than he can read in the same amount of time. When combining the length-constrained machine translation with the idea of zero-shot machine translation, the proposed method is also able to perform monolingual sentence compression. Compared to related work, this methods will learn the text compression in an unsupervised manner without ever seeing a compressed sentence.

The main contribution of this work is an end-to-end approach to length-constraint translation by jointly performing machine translation and sentence compression. We are able to show that for this task an end-to-end approach outperforms the cascade of machine translation and unsupervised sentence compression.

Secondly, we perform an in depth analysis how to integrate additional constraints into neural machine translation. We investigate methods to restrict the search to translations with a given length as well as adapting the transformer-based encoder-decoder. In the analysis, we also investigated the portability of the technique to encode other constraints. Therefore, we applied the same techniques to avoiding difficult words.

A third contribution of this work is to extend the presented approach to unsupervised monolingual sentence compression. By combining the presented approach with multilingual machine translation, we are able to also generate paraphrases with a given length constraint. The investigation shows that a system that is trained on several languages is able to successful generate monolingual paraphrases.

2 Constraint decoding

In this work, we address the challenge of generating machine translations with additional constraints. The main application is length-constrained translation. That means that we want to generate a translation with a given target length. We focus thereby on the case of shortening the translations. While the length can be measured in words, sub-word tokens or letters, in the experiments we measured the length by sub-word tokens.

A first straightforward approach is to restrict the search space to generate only translations with a given length. The length of the output is modeled by the probability of the end-of-sentence (EOS) token. By modifying this probability, we introduce a hard constraint that is always fulfill.

Based on the experiments with this type of length constraints, we then investigate methods to include the length constraints directly into the model. For that we used two techniques successfully used in encoder-decoder models: embeddings and positional encodings. In this case, the target length is modeled as a soft constraint. However, by combining both techniques, we can again achieve a hard constraint.

2.1 Search space

A first strategy to include the additional constraints is to ignore them during training and restrict the search space during inference to translations that fulfill the constraint. For length constraints, this can be achieved by manipulating the end-of-sentence token probability. First, we need to ensure that the EOS token is not generated before the desired length of output . This can be ensure by setting the probability for the end-of-sentence token to zero for all positions before the desired length and renormalizing the probability.


Finally, we ensure to stop the search at the desired length by setting the probability of the end-of-sentence token to one if the output sequence has reached this length.


While this approach will guarantee that the output of the translation systems always meets the length condition (hard constraint), it has also one major drawback. Until the system reaches the constraint length, the system is not aware of how many words it is still allowed to generate. Therefore, it is not able to shorten the beginning of the sentence in order to fulfil the length constraint.

Motivated by this observation, we investigate methods to integrate the length constraint into the model and not only apply it during inference.

2.2 Pseudo-supervised training

The first challenge we need to address when including the length constraint into the model itself is the question of training data. While there is large amounts of parallel training data, it is hard to acquire training data with length constraints. Therefore, we investigate methods to train the model with standard parallel training data.

We perform the training by a type of pseudo-supervision. For each source sentence, in training, we also know the translation and thereby its length. The main idea is that we now assume this sentence was generated with the constraint to generate a translation with exactly the length of the given translation. Of course, this is mostly not the case. The human translator generated a translation that appropriately expresses the meaning of the source sentence and not a sentence that fulfills the length constraints.

Therefore, learning in this case is more difficult. Since the translations are generated without the given length constraints the systems might learn to simply ignore the length information and instead generate a normal translation putting all the information of the source sentence into the target sentence. In this case, we would not have the possibility to control the target length by specifying our desired length.

Therefore, we continue to investigate different possibilities how to encode the constrained target length in the model architecture.

2.3 Length representation

In this work, we investigate three different methods to represent the target length in the model. In all cases our training data consists of a source sentence , a target sentence and the target length .

Source embedding

A first method is to include the target length into the source sentence as an additional token. This is motivated by successful approaches for multilingual machine translation Ha et al. (2016), domain adaptation Kobus et al. (2017) and formality levels Sennrich et al. (2016). We change the training procedure to not use as the input to the encoder of the NMT system, but instead . Thereby, the encoder will learn a embedding for each target length seen during training.

There are two challenges using this approach. First, the dependency between the described length and the output is quite long within the model. Therefore, the model might ignore the information and just learn to generate the best translation for a given source sentence. Secondly, the representations for all possible target lengths are independent from each other. The embedding for length is not constrained to be similar to the one of length . This poses a special challenge for long sentences which occur less frequently and therefore the embedding of these lengths will not be learned as well as the frequent ones.

Target embedding

We address the first challenge by integrating the length constraint directly into the decoder. This is motivated by similar approaches to supervised sentence compression Kikuchi et al. (2016) and zero-shot machine translation Ha et al. (2017). We incorporate the information of the number of remaining target words at each target position. For one, this should ensure that the length information is not lost during the decoding process. Secondly, by embedding smaller numbers which occur more frequently in the corpus towards the end of the sentence, the problem of rare sentence lengths does not matter that much.

Formally, at each decoder step the baseline model starts with the word embedding of the last target word

. In the original transformer architecture, the positional encoding is applied on top of the embedding to generate the first hidden representation.


In our proposed architecture, we include the number of remaining target words to be generated . We concatenate with the length embedding and then apply a linear translation and a non-linearity to reduce the hidden size to the one of the original word embedding


The proposed architecture allows the model to consider the number of remaining target words at each decoding step. While the baseline model will only cut the end of the sentences, the model is able to shorten already consider the constraints at the beginning of the sentence.

Positional encoding

Finally, we also address the challenge of representing sentence lengths that are less frequent. The transformer architecture Vaswani et al. (2017) introduced the positional encoding. This encodes the position within the sentence using a set of trigonometric functions. While their method encodes the position relative to the start of the sentence, we follow Takase and Okazaki (2019) to encode the position relative to the end of the sentence. Thereby, at each position we encode the number of remaining words of the sentence. Formally, we replace by .

2.4 Additional constraints

Besides constraining the number of words, other constraints can be implemented as easily using the same framework. In this work, we show this by limiting the number of complex and difficult words. One use case is the generation of paraphrases in simplified language. A metric to measure text difficulty, the Dale-Chall Readability metric Chall and Dale (1995), for example, counts such difficult words. In an NMT system, longer words are typically split into subword units by Byte Pair encoding Sennrich et al. (2016). A complex word like marshmallow is split into several parts like mar@@ shm@@ allow. Thereby, @@ indicates that the word is not yet finished.

We can encourage the system to use less complex words that need to be represented by several BPE tokens by counting the number of tokens that do not end a word. In the proposed encoding scheme these are all tokens that end in @@. During decoding, we then try to generate translations with a minimal number of these tokens.

3 Evaluation

The lack of appropriate data is not only a challenge for training the model but also for evaluating. The default approach to evaluate a machine translation methods is to compare the translation by the system with human translation using some automatic metric (e.g. BLEU Papineni et al. (2001)).

In our case, we would need to have a human-generated translation, which also fulfills the additional constraints. For example, translation with a length that is shorten to 80% of the input.

Since this type of translation data is not available, we investigate methods to compare the length-constraint output of the system with standard human translation that do not fulfill any specific constraints.

3.1 Word matching metrics

While there is significant research in automatic metrics for machine translation Ma et al. (2018, 2019), BLEU is still the most commonly used metric. Therefore, a first approach would be to use BLEU to compare the automatic translation with length constraints with the human translation without constraints. If we were using length constraints, this would lead to low BLEU score due to the length penalty of the metric. But since all systems must fulfill the length constraint, the penalty would be the same for all output and we could still compare between the different outputs.

Reference: So CEOs, a little bit better than average, but here’s where it gets interesting.
Baseline: CEOs are a little bit
Constraint: the CEOs are interesting .
Table 1: Example of constraint translation

A problem of using the BLEU metric for this task is illustrated by the example translations in Table 1. The Baseline system only uses the length constraint for restricting the search space. In the constraint system, we are using the length constraint also as additional embeddings in the decoder. Looking at this example sentence, a human would always rate the constraint translation better as the baseline translation. The problem of the latter model is that it is often generating a prefix of the full translation. While this does not lead to a good constrained translation, it still leads to a relative high BLEU score. In this case, we have one matching 3-gram, two bigrams and four unigrams.

In contrast, the constraint model only contains words matching the reference scattered over the sentence. Therefore, in this case, we only have two unigram matches. Guided by this observation, we used different metrics to evaluate the models.

3.2 Embedding-based metrics

In order to address the challenges mentioned in the last section, we used metrics that are based on sentence embeddings instead of word or character-based representation of the sentence. This way it is no longer important that the words occur in the same sequence in automatic translation and reference. Based on the performance of the automatic metrics in the WMT Evaluation campaign in 2018, we used RUSE Shimanaka et al. (2019)

metric. It uses sentence embeddings from three different models: InferSent, Quit-Thought and Universal Sentence Encoder. Then the quality is estimated by an MLP based on the representation of the hypothesis and the reference translation.

4 Experiments

4.1 Data

We train our systems on the TED data from the IWSLT 2017 multilingual evaluation campaign Cettolo et al. (2017). The data contains parallel data between German, English, Italian, Dutch and Romanian. We create three different systems. The first system is only trained on the German-English data, the second one is trained on German-English and English-German data and the last one is trained on {German, Dutch, Italian, Romanian} and English data in the both directions.

The data is preprocessed using standard MT procedures including tokenization, truecasing and byte-pair encoding with 40K codes. For model selection, the checkpoints performing best on the validation data (dev2010 and tst2010 combined) are averaged, which is then used to translate the tst2017 test set.

In the experiments, we address two different targeted lengths. In order to not use any information from the reference, we measure length limits relative to the source sentence length by counting subword units. We aim to shorten the translation to produce output that is 80% and 50% of the source sentence length. In all experiments, the length is measured by the number of sub-word tokens.

4.2 System

We use the transformer architecture Vaswani et al. (2017) and increase the number of layers to eight. The layer size is 512 and the inner size is 2048. Furthermore, we apply word dropout with Gal and Ghahramani (2016). In addition, layer dropout is used with as in the original work. We use the same learning rate schedule as in the original work. The implementation is available at

4.3 Task difficulty

Initially, we wanted to investigate the difficulty of having the additional length constraints. Therefore, we used the length of the human reference translation as a first target length. One could even argue that should make the typical machine translation easier, since some information about the translation is known. The results of this experiment are shown in Table 2. Since we do not perform compression in this experiment, the aforementioned problem with BLEU should not apply here.

Baseline 30.80 -0.085
Only Search 28.32 -0.124
Source Emb 28.56 -0.126
Decoder Emb 27.88 -0.140
Decoder Pos 28.80 -0.138
Table 2: Using oracle length

However, the results indicate the baseline system achieves the best BLEU score as well as the best RUSE score. All other models generate translations that perfectly fit the desired target length, but this leads to a drop in translation quality. Therefore, even if the target length is the same as the one of the reference translation, by restriction increases the difficulty of the problem. One reason could be that the machine translation system rarely generates translations which exactly match the reference. By forcing the translation to have an exact predefined length, we are increasing the difficulty of the problem.

4.4 Length representation

In a first series of experiments, we analyzed the different techniques to encode the length of the output. First, we are interested in whether the different length representations are able to enforce an output that has the length we are aiming at (soft constraints). For the German to English translation task, the length of the different encoding versions are shown in Table 3

. We define the length as the average difference between the targeted output given in BPE units and the output of the translation system.

First, without adding any constraints, the models generate translations that differ by 3.9 and 10.29 words from the targeted length. By specifying the length in the source side, we can reduce the length difference to half a word in the case of a targeted length of 80% and one and a half words in the case of 50% of the source length. The models using the decoder embeddings and the decoder positional encoding where able to nearly perfectly generate translation with the correct number of words.

Encoding Avg. length difference
80% 50%
Baseline 3.90 10.29
Source Emb 0.55 1.40
Decoder Emb 0.07 0.16
Decoder Pos 0.09 0.19
Table 3: Avg. Length distance

Besides fulfilling the length constraints, the translations needs to be accurate. Since we wanted to have a fair comparison, we evaluated the output when using a restrict search space, so that only translation with the correct number of words are generated (hard constraints). The results are summarized in Table 4

Encoding RUSE
80% 50%
Baseline -0.272 -0.605
Source Emb -0.263 -0.587
Decoder Emb -0.2469 -0.555
Decoder Pos -0.2598 -0.577
Table 4: German-English translation quality

As shown be the results, we see improvements in translation quality when using the source embedding within the encoder. We have further improvements if we represent the targeted length within the decoder. In this case, we can improve the RUSE score by 2% and 5% absolute. The decoder encodings perform similarly, with small advantage for using embeddings and not positional encodings.Therefore, in the remaining of the experiments we use the embeddings.

4.5 Multi-lingual

In a second series of experiments, we combine the constraint translation approach with multi-lingual machine translation. The combination of both offers the unique opportunity to perform unsupervised sentence compression. We can treat the translation of English to English as a zero-shot direction Johnson et al. (2017); Ha et al. (2016). This has not been addressed in traditional multi-lingual machine translation, since in this case the model will often just copy the source sentence to the target one. By adding the length constraints, we force the mode to reformulate the sentence in order to fulfil the length constraint.

The results for these experiments are shown in Table 5. In this case, we compared three scenarios. First, a model trained only to translation from German to English. Secondly, a model trained to translate from German to English and English to German. Finally, a model trained on 4 language to and from English.

Target Length 0.8 Target Length 0.5
Baseline Dec. Emb Baseline Dec. Emb
DE-EN -0.272 -0.247 -0.587 -0.554
DE+EN -0.264 -0.817 -0.223 -0.905 -0.598 -0.954 -0.523 -0.978
All -0.225 -0.102 -0.214 0.020 -0.560 -0.525 -0.548 -0.481
Table 5: Multi-lingual systems

First of all, since the models are trained on relative small data, we always gain when using more language pairs. Secondly, for all models training from German to English, the decoder embedings is clearly better than the baseline. Finally, to perform paraphrasing, we need a multi-lingual system with several language pairs. Both model trained only on the German to English and English to German data fail to generate adequate translation. In contrast, if we look at the translation from English to English for the multilingual model, the scores are clearly better than the ones from German to English. Furthermore, again, the system with decoder embeddings is clearly better than the baseline system.

In addition, we performed the same experiment with a target length of half the source length (Table 5). Although the absolute scores are significant lower since the model has to reduce the length further, the tendency is the same for this direction.

4.6 End2End vs. Cascaded

Length Model DE-EN EN-EN
0.8 End2End -0.247 0.020
Cascade -0.259 -0.118
Cascade Fix. Pivot -0.166
0.5 End2End -0.555 -0.481
Cascade -0.575 -0.521
Cascade Fix. Pivot -0.544
Table 6: Comparison of End-to-End and Cascaded approach

In this work, we are able to combine machine translation and sentence compression. In a third series of experiments, we wanted to investigate the advantage of modelling it in an end-to-end fashion compared to a cascade of different models. We performed this investigation again for two tasks: German to English and English to English.

The cascade system for German to English, first translates the German text to English with a baseline machine translation system. In a second step, the output is compressed with the multi-lingual MT system. For the English-to-English system, the cascade system removes the zero-shot condition. Therefore, we first translate from English to German with the baseline system and then translate with length contrasted from German to English. In cascade fix pivot also the English to German system already fulfill the length constraint.

As shown in Table 6, in all condition, the end-to-end approach outperforms the cascaded version. This is especially the case for the English-to-English machine translation. Compared to multi-lingual machine translation, for this tasks it seems to be beneficial to perform the zero-shot tasks instead of using a pivot language.

4.7 Simplification

Metric DE-EN DE+EN All
Base Simp. Base Simp. Base Simp
BPE tokens 1961 1053 1978 1041 1899 991
DCI 7.63 7.47 7.69 7.5 7.66 7.45
FRE 83.86 86.18 84.31 85.49 82.98 85.59
BLEU 30.80 30.62 32.25 31.38 32.84 31.29
RUSE -0.085 -0.092 -0.082 -0.080 -0.042 -0.084
Table 7: Simplification

In a last series of experiments (Table 7), we investigate the ability of the framework to generate simpler sentence. As described in Section 2.4, we concentrate on reducing the number of rare and complex words. Again, we are using the decoder embedding to represent the amount of BPE units in the sentences. We use a system for 1 language pair, 2 language pairs and the system using 8 language pairs. First, the system is able to reduce the number of BPE tokens in the text significantly. The amount of tokens is reduce by up to 48%. Since the number of tokens is nearly keep the same, this is also reflected in a better readability. On the other hand, we see that the translation quality is only affected slightly.

4.8 Qualitative Results

For the length restricted system, we also present examples in Table 8. The translation were generated with the multi-lingual system using restricted search space with 0.8 times and 0.5 times the length of the source length. The length is thereby measured using the number of subword tokens.

Und, obwohl es wirklich einfach scheint, ist es tatsächlich richtig schwer,
weil es Leute drängt sehr schnell zusammenzuarbeiten.
Reference: And, though it seems really simple, it’s actually pretty hard because it
forces people to collaborate very quickly.
Base 0.8: and even though it really seems simple , it is actually really hard , because
it really pushes
Dec. Emb. 0.8 : and although it really seems simple , it is really hard because it drives
people to work together .
Base 0.5 : and even though it really seems simple , it is really hard
Dec. Emb. 0. 5: it is really hard because it drives people to work together .

Konstrukteure erkennen diese Art der Zusammenarbeit als Kern eines
iterativen Vorgangs.
Reference: Designers recognize this type of collaboration as the essence of the
iterative process.
Base 0.8: now , traditional constructors recognize this kind of collaboration as the core
Dec. Emb. 0.8 designers recognize this kind of collaboration as the core of iterative .

Base 0.5:
now , traditional constructors recognize this kind
Dec. Emb: 0.5 developers recognize this kind of collaboration .

Table 8: Examples

In the examples we see clearly the problem of the baseline model when using a restrict search space. The model mainly outputs the prefix of the long translation and do not try to put the main content into the shorter segment. In contrast, the system using the decoder embeddings are aware when generating a word how much space the still have to fill the content. Therefore, they do not just cut part of the sentence, but compress the sentence and extract the most important part of the sentence. While the first example is more concentrating on the second part of the original sentence, the second one is focusing at the beginning. Although the model reducing the length by 50% have to remove some content of the original sentence, the sentence is still understandable.

5 Related Work

The most common approach to model the target length within NMT is the use of coverage models Tu et al. (2016). More recently, Lakew et al. (2019) used similar techniques to generate translation with the same length as the source sentence. Compared to these works, we tried to significantly reduce the length of the sentence and thereby have the situation where the training and testing condition differ significant. This work on length-controlled machine translation is strongly related to sentence compression, where the compression is performed in the monolingual case. First approach used rule-based approaches Dorr et al. (2003) for extractive sentence compression. In abstractive compression methods using syntactic translation Cohn and Lapata (2008) and phrase-based machine translation were investigated Wubben et al. (2012)

. The success of encoder-decoder models in mainly areas of natural language processing

Sutskever et al. (2014); Bahdanau et al. (2014) motivated it success full application to sentence compression. Kikuchi et al. (2016) and Takase and Okazaki (2019) investigated an approach to directly control the output length. Although their methods uses similar techniques to ours, the model is trained in a supervised way. Motivated by recent success in unsupervised machine translation Artetxe et al. (2018); Lample et al. (2018), a first approach to learn text compression in an unsupervised fashion was presented in Fevry and Phang (2018). Text compression in a supervised fashion for subtitles was investigated in Angerbauer et al. (2019).

In contrast to text compression, the combination of readability and machine translation has been researched recently. Agrawal and Carpuat (2019) presented an approach to model the readability using source side annotation. In contrast to our work, they concentrated on the scenario where manually created training data is available. In Marchisio et al. (2019) the authors specified the desired readability difficulty either by a source token or though the architecture by different encoder. While they concentrate on a single task and have only a limited number of difficulty classes, the work presented here is able to handle a huge number of possible output class (e.g. in text compression the number of words) and can be applied for different task.

6 Conclusion

In this work, we present a first approach to length-restricted machine translation. In contrast to work on monolingual sentence compression, we focus thereby on unsupervised methods. By combining the results with multi-lingual machine translation, we are also able to perform monolingual unsupervised sentence compression.

We have shown that is important to include the length constraints to the decoder to achieve translations fulfilling the constraints. Furthermore, modeling the task in an end-to-end fashion improves over splitting the task into different sub-tasks. This is even true for zero-shot conditions.


  • S. Agrawal and M. Carpuat (2019) Controlling Text Complexity in Neural Machine Translation. arXiv:1911.00835 [cs]. Note: arXiv: 1911.00835 External Links: Link Cited by: §5.
  • K. Angerbauer, H. Adel, and T. Vu (2019) Automatic Compression of Subtitles with Neural Networks and its Effect on User Experience. In Proceedings of the 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), pp. 594–598. External Links: Document Cited by: §5.
  • M. Artetxe, G. Labaka, E. Agirre, and K. Cho (2018) Unsupervised Neural Machine Translation. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0. External Links: Link Cited by: §1, §5.
  • O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. Jimeno Yepes, P. Koehn, V. Logacheva, C. Monz, M. Negri, A. Névéol, M. Neves, M. Popel, M. Post, R. Rubino, C. Scarton, L. Specia, M. Turchi, K. Verspoor, and M. Zampieri (2016) Findings of the 2016 Conference on Machine Translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Berlin, Germany, pp. 131–198. External Links: Link, Document Cited by: §1.
  • M. Cettolo, M. Federico, L. Bentivoldi, J. Niehues, S. Stüker, K. Sudoh, K. Yoshino, and C. Federmann (2017) Overview of the IWSLT 2017 Evaluation Campaign. In Proceedings of the 14th International Workshop on Spoken Language Translation (IWSLT 2017), Tokio, Japan. External Links: Link Cited by: §4.1.
  • M. Cettolo, J. Niehues, S. Stüker, L. Bentivoldi, R. Cattoni, and M. Federico (2015) The IWSLT 2015 Evaluation Campaign. In Proceedings of the Twelfth International Workshop on Spoken Language Translation (IWSLT 2015), Da Nang, Vietnam. External Links: Link Cited by: §1.
  • J.S. Chall and E. Dale (1995) Readability revisited: the new Dale-Chall readability formula. Brookline Books. External Links: ISBN 978-1-57129-008-3, Link, LCCN 95016034 Cited by: §2.4.
  • T. Cohn and M. Lapata (2008) Sentence Compression Beyond Word Deletion. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 137–144. External Links: Link Cited by: §5.
  • B. Dorr, D. Zajic, and R. Schwartz (2003) Hedge Trimmer: A Parse-and-Trim Approach to Headline Generation. In

    Proceedings of the HLT-NAACL 03 Text Summarization Workshop

    pp. 1–8. External Links: Link Cited by: §5.
  • T. Fevry and J. Phang (2018) Unsupervised Sentence Compression using Denoising Auto-Encoders. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, pp. 413–422. External Links: Link, Document Cited by: §5.
  • Y. Gal and Z. Ghahramani (2016)

    A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

    In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, USA, pp. 1027–1035. Note: event-place: Barcelona, Spain External Links: ISBN 978-1-5108-3881-9, Link Cited by: §4.2.
  • T. L. Ha, J. Niehues, and A. Waibel (2016) Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder. In Proceedings of the 13th International Workshop on Spoken Language Translation (IWSLT 2016), Seattle, USA. External Links: Link Cited by: §2.3, §4.5.
  • T. L. Ha, J. Niehues, and A. Waibel (2017) Effective Strategies in Zero-Shot Neural Machine Translation. In Proceedings of the 14th International Workshop on Spoken Language Translation (IWSLT 2017), Tokio, Japan. External Links: Link Cited by: §2.3.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2017) Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. External Links: Link, Document Cited by: §4.5.
  • Y. Kikuchi, G. Neubig, R. Sasano, H. Takamura, and M. Okumura (2016) Controlling Output Length in Neural Encoder-Decoders. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1328–1338 (en). External Links: Link, Document Cited by: §2.3, §5.
  • C. Kobus, J. Crego, and J. Senellart (2017) Domain Control for Neural Machine Translation. In Proceddings of Recent Advances in Natural Language Processing (RANLP 2017), Varna, Bulgaria, pp. 372–378. External Links: Document Cited by: §2.3.
  • S. M. Lakew, M. Di Gangi, and M. Federico (2019) Controlling the Output Length of Neural Machine Translation. In Proceedings of the 16th International Workshop on Spoken Language Translation (IWSLT 2019), Hong Kong. External Links: Link, Document Cited by: §5.
  • G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2018) Unsupervised Machine Translation Using Monolingual Corpora Only. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • Q. Ma, O. Bojar, and Y. Graham (2018) Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, pp. 671–688. External Links: Link, Document Cited by: §3.1.
  • Q. Ma, J. Wei, O. Bojar, and Y. Graham (2019) Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Florence, Italy, pp. 62–90. External Links: Link, Document Cited by: §3.1.
  • K. Marchisio, J. Guo, C. Lai, and P. Koehn (2019) Controlling the Reading Level of Machine Translation Output. In Proceedings of MT Summit XVII, Vol. 1, pp. 11 (en). Cited by: §5.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2001) BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, Morristown, NJ, USA, pp. 311. External Links: Link, Document Cited by: §3.
  • R. Sennrich, A. Birch, and B. Haddow (2016) Controlling Politeness in Neural Machine Translation via Side Constraints. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2016), San Diego, California, USA, pp. 35–40. External Links: Link Cited by: §2.3.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §2.4.
  • H. Shimanaka, T. Kajiwara, and M. Komachi (2019) RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Stroudsburg, PA, USA, pp. 751–758. External Links: Link, Document Cited by: §3.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, pp. 3104–3112. External Links: Link Cited by: §1, §5.
  • S. Takase and N. Okazaki (2019) Positional Encoding to Control Output Sequence Length. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3999–4004. External Links: Link, Document Cited by: §2.3, §5.
  • Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li (2016) Modeling Coverage for Neural Machine Translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 76–85. External Links: Link, Document Cited by: §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. CoRR abs/1706.0. External Links: Link Cited by: §2.3, §4.2.
  • S. Wubben, A. van den Bosch, and E. Krahmer (2012) Sentence Simplification by Monolingual Machine Translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 1015–1024. External Links: Link Cited by: §5.
  • T. Zenkel, J. Wuebker, and J. DeNero (2019) Adding Interpretable Attention to Neural Translation Models Improves Word Alignment. arXiv:1901.11359 [cs]. Note: arXiv: 1901.11359 External Links: Link Cited by: §1.