Levenshtein Transformer

05/27/2019 ∙ by Jiatao Gu, et al. ∙ Facebook NYU college 0

Modern neural sequence generation models are built to either generate tokens step-by-step from scratch or (iteratively) modify a sequence of tokens bounded by a fixed length. In this work, we develop Levenshtein Transformer, a new partially autoregressive model devised for more flexible and amenable sequence generation. Unlike previous approaches, the atomic operations of our model are insertion and deletion. The combination of them facilitates not only generation but also sequence refinement allowing dynamic length changes. We also propose a set of new training techniques dedicated at them, effectively exploiting one as the other's learning signal thanks to their complementary nature. Experiments applying the proposed model achieve comparable performance but much-improved efficiency on both generation (e.g. machine translation, text summarization) and refinement tasks (e.g. automatic post-editing). We further confirm the flexibility of our model by showing a Levenshtein Transformer trained by machine translation can straightforwardly be used for automatic post-editing.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural sequence generation models are widely developed and deployed in tasks such as machine translation (Bahdanau et al., 2015; Vaswani et al., 2017). As we examine the current frameworks, the most popular autoregressive models generate tokens step-by-step. If not better, recent non-autoregressive approaches (Gu et al., 2018; Kaiser et al., 2018; Lee et al., 2018) have proved it possible to perform generation within a much smaller number of decoding iterations.

In this paper, we propose Levenshtein Transformer (LevT), aiming to address the lack of flexibility of the current decoding models. Notably, in the existing frameworks, the length of generated sequences is either fixed or monotonically increased as the decoding proceeds. This remains incompatible with human-level intelligence where humans can revise, replace, revoke or delete any part of their generated text. Hence, LevT is proposed to bridge this gap by breaking the in-so-far standardized decoding mechanism and replacing it with two atomic operations —

insertion and deletion.

We train the LevT using imitation learning. The resulted model contains two policies and they are executed in an alternate manner. Empirically, we show that LevT achieves comparable or better results than a standard Transformer model on machine translation and summarization, while maintaining the efficiency advantages benefited from parallel decoding similarly to  

(Lee et al., 2018). With this model, we argue that the decoding becomes more flexible. For example, when the decoder is given an empty token, it falls back to a normal sequence generation model. On the other hand, the decoder acts as a refinement model when the initial state is a low-quality generated sequence. Indeed, we show that a LevT trained from machine translation is directly applicable to translation post-editing without any change. This would not be possible with any framework in the literature because generation and refinement are treated as two different tasks due to the model’s inductive bias.

One crucial component in LevT framework is the learning algorithm. We leverage the characteristics of insertion and deletion — they are complementary but also adversarial. The algorithm we propose is called “dual policy learning”. The idea is that when training one policy (insertion or deletion), we use the output from its adversary at the previous iteration as input. An expert policy, on the other hand, is drawn to provide a correction signal. Despite that, in theory, this learning algorithm is applicable to other imitation learning scenarios where a dual adversarial policy exists, in this work we primarily focus on a proof-of-concept of this algorithm landing at training the proposed LevT model.

To this end, we summarize the contributions as follows:

  • We propose Levenshtein Transformer (LevT), a new sequence generation model composed of the insertion and deletion operations. This model achieves comparable or better results than a strong Transformer baseline in both machine translation and text summarization, but with much better efficiency (up to speed-up);

  • We propose a corresponding learning algorithm under the theoretical framework of imitation learning, tackling the complementary and adversarial nature of the dual policies;

  • We recognize our model as a pioneer attempt to unify sequence generation and refinement, thanks to its built-in flexibility. With this unification, we empirically validate the feasibility of applying a LevT model trained by machine translation directly to translation post-editing, without any change.

2 Problem Formulation

2.1 Sequence Generation and Refinement

We unify the general problems of sequence generation and refinement by casting them to a Markov Decision Process (MDP) defined by a tuple

. We consider the setup consisting an agent interacting with an environment which receives the agent’s editing actions and returns the modified sequence. We define as a set of discrete sequences up to length where is a vocabulary of symbols. At every decoding iteration, the agent receives an input drawn from scratch or uncompleted generation, chooses an action and gets a reward . We use to denote the set of actions and for the reward function. Generally the reward function measures the distance between the generation and the ground-truth sequence, which can be any distance measurement such as the Levenshtein distance (Levenshtein, 1965). It is crucial to incorporate into the our formulation. As the initial sequence, the agent receives—when is an already generated sequence from another system, the agent essentially learns to do refinement while it falls back to generation if is an empty sequence. The agent is modeled by a policy,

, that maps the current generation over a probability distribution over

. That is, .

2.2 Actions: Deletion & Insertion

Following the above MDP formulation, with a subsequence , the two atomic actions – deletion and insertion – are called to generate . Here we let and be special symbols <s> and </s>, respectively. Since we mainly focus on the policy of a single round generation, the superscripts are normally omitted in this section. For conditional generation like MT, our policy also includes an input of source information which is also omitted here.


The deletion policy reads the input sequence , and for every token , the deletion policy makes a binary decision which is 1 (delete this token) or 0 (keep it). We additionally constrain to avoid sequence boundary being broken.


It is slightly more complex to build the insertion atomic because it involves two phases: placeholder prediction and token prediction so that it is able to insert multiple tokens at the same slot. First, among all the possible inserted slots () in , predicts the possibility of adding one or several placeholders. In what follows, for every placeholder predicted as above, a token prediction policy replaces the placeholders with actual tokens in the vocabulary.

Policy combination

Recall that our two atomic operations are complementary. Hence we combine them in an alternate fashion. For example in sequence generation from the empty, insertion policy is first called and it is followed by deletion, and then repeat till the certain stopping condition is fulfilled. Indeed, it is possible to leverage the parallelism in this combination. We essentially decompose one iteration of our sequence generator into three phases: “delete tokens – insert placeholders – replace placeholders with new tokens”. Within each stage, all operations are performed in parallel. More precisely, given the current sequence , and suppose the action to predict is , we our policy for one iteration is:


where and . We parallelize the computation within each sub-tasks.

3 Levenshtein Transformer

In this section, we cover the specs of Levenshtein Transformer and the dual-policy learning algorithm. Overall our model takes a sequence of tokens (or none) as the input then iteratively modify it by alternating between insertion and deletion, until the two policies combined converge. We describe the detailed learning and inference algorithms in the Appendix.

Figure 1:

The overall framework of the decoder of the proposed Levenshtein Transformer. We show how the same architecture can be applied for three different tasks with specific classifiers. For simplicity, the attention between the encoder outputs is omitted within each Transformer-Block.

3.1 Model

We use Transformer (Vaswani et al., 2017) as the basic building block. For conditional generation, the source is included in each TransformerBlock. The states from the -th block are:


where and are the token and position embeddings, respectively.

Policy Classifiers

The decoder outputs () are passed to three policy classifiers:

  1. Deletion Classifier: LevT scans over the input tokens (except for the boundaries) and predict “deleted” () or “kept” () for each token position,


    where , and we always keep the boundary tokens.

  2. Placeholder Classifier: LevT predicts the number of tokens to be inserted at every consecutive position pairs, by casting the representation to a categorical distribution:


    where . Based on the number () of tokens it predicts, we insert the considered number of placeholders at the current position. In our implementation, placehoder is represented by a special token <PLH> which was reserved in the vocabulary.

  3. Token Classifier: following the placeholder prediction, LevT needs to fill in tokens replacing all the placeholders. This is achieved by training a token predictor as follow:


    where with parameters being shared with the embedding matrix.

Early Exit

Although it is parameter-efficient to share the same Transformer architecture across the above three heads, there is room for improvement as one decoding iteration requires three full passes of the network. To make trade-off between performance and computational cost, we propose to perform early exit (attaching the head to an intermediate block instead of the last one) for and while keeping always based on the last block, considering that token prediction is usually more challenging than the other two tasks.

3.2 Dual-policy Learning

Imitation Learning

We use imitation learning to train the Levenshtein Transformer. Essentially we let the agent imitate the behaviors that we draw from some expert policy . The expert policy is derived from direct usage of ground-truth targets or less noisy version filtered by sequence distillation (Kim and Rush, 2016). The objective is to maximize the following expectation:

where is the output after inserting palceholders upon . , are the roll-in polices and we repeatedly draw states (sequences) from their induced state distribution . These states are first executed by the expert policy returning the suggested actions by the expert, and then we maximize the conditional log-likelihood over them.

Roll-in Policy

Figure 2: The data-flow of learning.

By definition, the roll-in policy determines the state distribution fed to during training. In this work, we have two strategies to construct the roll-in policy — adding noise to the ground-truth or using the output from the adversary policy. Figure 2 shows a diagram of this learning paradigm. We formally write down the roll-in policies as follows.

  1. Learning to Delete: we design the as a stochastic mixture between the initial input or the output by applying insertion from the model with some mixture factor :


    where and is any sequence ready to insert tokens. is obtained by sampling instead of doing argmax from Eq. (5).

  2. Learning to Insert: similar to the deletion step, we apply a mixture of the deletion output and a random word dropping sequence of the round-truth, inspired by recent advances of training masked language model (Devlin et al., 2018). We use random dropping as a form of noise injection to encourage more exploration. Let and ,


Expert Policy

It is crucial to construct an expert policy in imitation learning which cannot be too hard or too weak to learn from. Specifically, we considered two types of experts:

  1. Oracle: One way is to build an oracle which accesses to the ground-truth sequence. It returns the optimal actions (either oracle insertion or oracle deletion ) by:


    Here, we use the Levenshtein distance (Levenshtein, 1965)111We only consider the variant which only computes insertion and deletion. No substitution is considered. as considering it is possible to obtain the action suggestions efficiently by dynamic programming.

  2. Teacher Model: We also explore to use another teacher model to provide expert policy, which is known as knowledge distillation (Kim and Rush, 2016). This technique has been widely used in previous approaches for nonauoregressive generation (Gu et al., 2018). More precisely, we first train an autoregressive teacher model using the same datasets and then replace the ground-truth sequence by the beam-search result of this teacher-model, . We use the same mechanism to find the suggested option as using the ground-truth oracle.

3.3 Inference

Greedy Decoding

At inference time, we apply the trained model over the initial sequence for several iterations. We greedily pick up the actions associated with high probabilities in Eq. (3)(4)(5). Moreover, we find that using search (instead of greedy decoding) does not yield much gain in LevT. This observation is quite opposite to what has been widely discovered in autoregressive decoding. We hypothesize this is because the local optimal point brought by greedy decoding in autoregressive models is often far from the optimality point globally. Search techniques resolve this issue with tabularization. In our case, however, because LevT inserts or deletes tokens dynamically, it could easily revoke the tokens that are found sub-optimal and re-insert better ones.

Termination Condition

The decoding stops when one of the following conditions is fulfilled:

  1. Nothing to delete, Nothing to insert: The policy chooses to keep all the current tokens, and predicts “empty” placeholders at everywhere.

  2. Direct-Loop: Unfortunately, our MDP assumption cannot avoid the situations where the agent gets stuck in an infinite loop; i.e. the insertion and deletion counter each other and keep looping. Although this is not common, we terminate the decoding once this is spotted.

  3. Timeout: We further set a maximum number of iterations (timeout) to guarantee a constant-time complexity in the worst case (Lee et al., 2018; Ghazvininejad et al., 2019).

Penalty for Empty Placeholders

Similar to  Stern et al. (2019), we add a penalty to insert “empty” placeholder in decoding. Overly inserting “empty” placeholders may result in shorter output. A penalty term

is subtracted from the logits of

in Eq. (4).

4 Experiments

We validate the efficiency, effectiveness, and flexibility of Levenshtein Transformer extensively across three different tasks — machine translation (MT), text summarization (TS) and automatic post-editing (APE) for machine translation, from both generation (§4.1) and refinement (§4.2) perspectives.

4.1 Sequence Generation

For the sequence generation perspective, we evaluate LevT model on MT and TS. As a special case, sequence generation assumes empty as input and no initial deletion is applied.

Data & Evaluation

We use three diversified language pairs for MT experiments: WMT’16 Romanian-English (Ro-En)222http://www.statmt.org/wmt16/translation-task.html, WMT’14 English-German (En-De)333http://www.statmt.org/wmt14/translation-task.html and WAT2017 Small-NMT English-Japanese (En-Ja)444http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2017/snmt/index.html. The TS experiments use preprocessed data from the Annotated English Gigaword (Rush et al., 2015)555https://github.com/harvardnlp/sent-summary. We learn byte-pair encoding (BPE, Sennrich et al., 2016)

vocabulary on tokenized data. Detailed dataset statistics can be found in the Appendix. For evaluation metrics, we use BLEU 

(Papineni et al., 2002) for MT and ROUGE-1,2,L (Lin, 2004) for TS. Before computing the BLEU scores for Japanese output, we always segment Japanese words using KyTea 666http://www.phontron.com/kytea/.

Models & Training

We adopt the model architecture of Transformer base (Vaswani et al., 2017) for the proposed LevT model and the autoregressive baseline. All the Transformer-based models are trained on Nvidia Volta GPUs with maximum steps and a total batch-size of around tokens per step (We leave more details to the Appendix).

Figure 3: An example of WAT’17 En-Ja translation with two decoder iterations by LevT. We present the inserted tokens in purple and deleted tokens with red strikethrough


Dataset Metric Transformer Levenshtein Transformer
greedy beam4 oracle teacher
Quality Ro-En BLEU 31.67 32.30 33.02
En-De BLEU 26.02 26.56 24.43 26.67
En-Ja BLEU 42.86 43.68 42.36 43.17
Gigaword ROUGE-1 34.91 35.19 35.57 36.08
ROUGE-2 17.05 17.58 17.11 18.33
ROUGE-L 32.66 32.98 33.55 33.81
Speed Ro-En Latency (ms) / 326 / 27.1 349 / 27.1   97 / 2.19
En-De Latency (ms) / 343 / 28.1 369 / 28.1 126 / 2.88   92 / 2.05
En-Ja Latency (ms) / 261 / 22.6 306 / 22.6 112 / 2.61 106 / 1.97
Gigaword Latency (ms) / 116 / 10.1 149 / 10.1   98 / 2.32   84 / 1.73
Table 1: Generation quality (BLEU , ROUGE-1/2/L ) and latency (ms ) as well as the average number of decoder iterations () on the standard test sets for LevT and the autoregressive baseline (with both greedy and beam-search outputs). We show the results of LevT trained from both oracle and the autoregressive teacher model.

Overall results

We present our main results on the generation quality and decoding speed in Table 1. We measure the speed by the averaged generation latency of generating one sequence at a time on single Nvidia V100 GPU. To remove the implementation bias, we also present the number of decoder iterations as a reference. It can be concluded that for both MT and summarization tasks, our proposed LevT achieves comparable and sometimes better generation quality compared to the strong autoregressive baseline, while LevT is much more efficient at decoding. A translation example is shown in Figure 3 and we leave more in Appendix.

Oracle v.s. Teacher Model

As shown in Table 1, training the LevT with the teacher model achieves better results than the oracle counterpart, in most of the cases. Another interesting observation is the model has smaller latency. We conjecture that this is due to that the output of the teacher model possesses fewer modes and much less noisy than the real data. Consequently, LevT needs less number of iterations to converge to this expert policy.

(a) Number of decoding iterations888We plot the iterations where for better visualization.v.s. output length measured on the test set of Ro-En. For most of the time, LevT decodes with much smaller number (generally, 13) of iterations.
(b) BLEU scores v.s. speed-up of latency across for LevT across variant halting layers and the autoregressive baselines on the test set of Ro-En.
Figure 4: Plots showing the decoding efficiency of the proposed Levenshtein Transformer.

Analysis of Efficiency

As shown in Figure 4 (a), our model learns to properly terminate the decoding and adjust it based on the length of input. We also explore the variants of “early exit” where we denote LevT(-) as a model with and blocks for deletion (Eq. (3)) and placeholder prediction (Eq. (4)) respectively. Figure 4 (b) shows that althought it compromises the generation quality a bit, our model with early exit achieves up to speed-up with on-par performance comparing against a strong autoregressive Transformer using beam-search.

33.02 0.202
31.78 0.037
Table 2: Test BLEU and training loss for deletion () using variant roll-in polices on WMT Ro-En dataset.

Importance of mixture roll-in policy

We perform an ablation study on the learning algorithm. Specifically, we train a model with no mixing of the in Equation (6). We name this experiment by

due to its resemblance to a denoising autoencoder. We follow closely a standard pipeline established by 

Lee et al. (2018). Table 2 shows this comparison. As we can see that the deletion loss from is much smaller while the generation BLEU score is inferior. We conjecture that this is caused by the mismatch between the states from the model and the roll-in policy in training the .

4.2 Sequence Refinement

We evaluate LevT’s capability of refining sequence outputs on the APE task. In this setting, inputs are pairs of the source sequence and a black-box MT system generation. The ground-truth outputs are from real human edits with expansion using synthetic data.


We follow a normal protocol in the synthetic APE experiments (Grangier and Auli, 2017): we first train the input MT system on half of the dataset. Then we will train a refinement model on the other half based on the output produced by the MT model trained in the previous phase. For the real APE tasks, we use the data from WMT17 Automatic Post-Editing Shared Task999http://www.statmt.org/wmt17/ape-task.html on En-De. It contains both real PE triples and a large-scale synthetic corpus.

Models & Evaluation

The baseline model is a standard Transformer encoding the concatenation of the source and the MT system’s output. For the MT system here, we want some imperfect systems that need to be refined. We consider a statistical phrase-based MT system (PBMT, Koehn et al., 2003) and an RNN-based NMT system (Bahdanau et al., 2015). Apart from BLEU scores, we additionally apply translation error rate (TER, Snover et al., 2006) as it is widely used in the APE literature.

Dataset MT Do-Nothing Transformer Levenshtein Transformer
system Scratch Zero-shot Fine-tune
Synthetic Ro-En PBMT 27.5 / 52.6 28.9 / 52.8 29.1 / 50.4 30.1 / 51.7
NMT 26.2 / 56.5 26.9 / 55.6 28.3 / 53.6 28.0 / 55.8
En-De PBMT 15.4 / 69.4 22.8 / 61.0 25.8 / 56.6 16.5 / 69.6
En-Ja NMT 37.7 / 48.0 41.0 / 44.9 42.2 / 44.3 39.4 / 47.5
Real En-De PBMT 62.5 / 24.5 67.2 / 22.1 66.9 / 21.9 59.6 / 28.7 70.1 / 19.2
Table 3: Performance (BLEU / case-sensitive TER ) comparison on APE. “do nothing” represents the results of the original MT system output; the autoregressive model uses beam-size . For the proposed LevT, we use “scratch” to denote training from scratch on the APE triple data, and use “zero-shot” to denote applying an MT pre-trained LevT model directly for post-editing tasks. The same model can be further fine-tuned. All scores with underlines are from the model trained with an autoregressive teacher model as the expert policy.

Overall results

We show the major comparison in Table 3. When training from scratch, LevT consistently improves the performance of the input MT system (either PBMT or NMT). It also achieves better performance than the autoregressive Transformer in most of the cases.

Pre-training on MT

Thanks to the generality of the LevT model, we show it is feasible to directly apply the LevT model trained by generation onto refinement tasks — in this case — MT and APE. We name this a “zero-shot post-editing” setting. According to Table 3, the pre-trained MT models are always capable of improving the initial MT input in the synthetic tasks.

The real APE task, however, differs quite a bit from the synthetic tasks because human translators normally only fix a few spotted errors. This ends up with very high BLEU scores even for the “Do-nothing” column. However, the pre-trained MT model achieves the best results by fine-tuning on the PE data indicating that LevT is able to leverage the knowledge for generation and refinement.

(a) Test set BLEU scores for WMT Ro-En
(b) Test set TER scores for Real APE En-De
Figure 5: MT & PE Performance v.s. Timeout iterations w/o oracle instructions.

Collaborate with Oracle

Thanks to the saperation of insertion and deletion operations, LevT has better interpretability and controllability. For example, we test the ability that LevT adapts oracle (e.g. human translators) instructions. As shown in Figure 5, both MT and PE tasks have huge improvement if every step the oracle deletion is given. This goes even further if the oracle provides both the correct deletion and the number of placehoders to insert. It also sheds some light upon computer-assisted text editing for human translators.

5 Related Work

Non-Autoregressive or Non-Monotonic Decoding

Breaking the autoregressive constraints and monotonic (left-to-right) decoding order in classic neural sequence generation systems has recently attracted much interest. Stern et al. (2018); Wang et al. (2018) designed partially parallel decoding schemes to output multiple tokens at each step. Gu et al. (2018) proposed a non-autoregressive framework using discrete latent variables, which was later adopted in Lee et al. (2018) as iterative refinement process. Ghazvininejad et al. (2019) introduced the masked language modeling objective from BERT (Devlin et al., 2018) to non-autoregressively predict and refine translations. Welleck et al. (2019); Stern et al. (2019); Gu et al. (2019) generate translations non-monotonically by adding words to the left or right of previous ones or by inserting words in arbitrary order to form a sequence.

Editing-Based Models

Novak et al. (2016)

predict and apply token substitutions iteratively on phase-based MT system outputs using convolutional neural network. QuickEdit 

(Grangier and Auli, 2017) and deliberation network (Xia et al., 2017) both consist of two autoregressive decoders where the second decoder refines the translation generated by the first decoder. Guu et al. (2018) propose a neural editor which learned language modeling by first retrieving a prototype and then editing over that. Freitag et al. (2019) correct patterned errors in MT system outputs using transformer models trained on monolingual data.

6 Conclusion

We propose Levenshtein Transformer, a neural sequence generation model based on insertion and deletion. The resulted model achieves performance and decoding efficiency, and embraces sequence generation to refinement in one model. The insertion and deletion operations are arguably more similar to how human writes or edits text. For future work, it is potential to extend this model to human-in-the-loop generation.


We would like to thank Kyunghyun Cho, Marc’Aurelio Ranzato, Douwe Kiela, Qi Liu and our colleagues at Facebook AI Research for valuable feedback, discussions and technical assistance.


Appendix A Learning & Inference Algorithm

We present the detailed algorithms for learning and decoding from Levenshtein Transformer as follows. For simplicity, we always omit the source information in conditional sequence generation tasks such as machine translation which is handled by the cross-attention with an encoder on .

The learning algorithm is shown in Algorithm 1. is the environment and is denoted as the Levenshtein distance, and we can easily back-track the optimal insertion and deletion operations through dynamic programming. We only show the the case with single batch-size for convenience. We also present the inference algorithm in Algorithm 2. If the initial sequence is empty (<s></s>), the proposed model will skip the first deletion and do sequence generation. Otherwise, the model starts with deletion operations and refine the input sequence.

  Initialize: Training set , expert policy , model policy , random deletion policy , ,
     Sample a training pair
     if expert is a teacher model then
        Set the teacher’s output as the target
     end if
     if  then
        , where
        , where
     end if
     , where
     if  then
        , where
     end if
      , where
  until Maximum training steps reached
Algorithm 1 Learning for Levenshtein Transformer
  Initialize: Input , step , maximum step , model policy .
     if  then
        Empty sequence, skip deletion:
        Delete tokens: , where
     end if
     if (t > 0) & (then
        Termination condition satisfied: direct loop
     end if
     Assign deleted output for back-up
     Insert placeholders: , where
     if  then
        Termination condition satisfied: nothing to delete, nothing to insert.
     end if
     if  then
        Nothing to insert, skip insertion:
        Replace placeholders: , where
     end if
     Update steps: t = t + 1
  until Reach the maximum length
Algorithm 2 Decoding for Levenshtein Transformer

Appendix B Dataset and Preprocessing Details

Table 4 and 5 list the statistics ( of sentences, vocabulary) for all the datasets used in this work. We learn BPE vocabulary with joint operations for WMT En-De and Gigaword and joint operations for WMT Ro-En. For WAT En-Ja, we adopt the official BPE vocabularies learned separately on source and target side.

Dataset Train Valid Test Vocabulary
Translation WMT’16 Ro-En 608,319 1999 1999 34,983
WMT’14 En-De 4,500,966 3000 3003 37,009
WAT’17  En-Ja 2,000,000 1790 1812 17,952 / 17,801
Summarization English Gigaword 3,803,957 189,651 1951 30,004
Table 4: Dataset statistics for sequence generation tasks (MT and TS).
Dataset MT-Train APE-Train Valid Test Vocabulary
Synthetic WMT’16 Ro-En 300,000 308,319 1999 1999 34,983
WMT’14 En-De 2,250,000 2,250,967 3000 3003 37,009
WAT’17  En-Ja 1,000,000 1,000,000 1790 1812 17,952 / 17,801
526,368 (fake)
+ 24,000 (real)
2000 2000 40,349
Table 5: Dataset statistics for sequence refinement tasks (APE).
Figure 6: Averaged number of decoding iterations v.s. length of the source sentences on Romanian (Ro) monolingual corpus.

Appendix C Model and Training Details

c.1 Sequence Generation Tasks

Transformer models are used for autoregressive baselines as well as teacher models (for the expert policy). By default, we set , , , , , , and . Source and target side share embeddings in all the training pairs except for WAT En-Ja where BPE vocabularies of both side are learned separately and are almost non-overlapping.

Since the training objectives for Levenshtein Transformer contains randomness terms (Eq. (6) (7)), we instead use BLEU (for MT) or ROUGE-2 (for TS) to select the best checkpoint by validation scores. We do not average checkpoints in this work.

c.2 Sequence Refinement Tasks

For synthetic APE tasks, we keep the same training conditions for LevT as those for MT tasks (§C.1). As described earlier in §4.2, we build the baseline Transformer by concatenating the source and MT system’s output as the input sequence for the encoder. Specially, we restart the positional embeddings for the MT output, add an additional language embedding for each token of the input sequence to show its language type. The detailed hyperpameters are the same as the standard Transformer.

As described in §4.2, we consider the following two different imperfect MT systems to provide the refinement inputs. Firstly, we consider the traditional statistical phrase-based machine translation system (PBMT). We follow the instruction to build the basic baseline model via moses101010http://www.statmt.org/moses/?n=Moses.Baseline. As for the NMT-based model, we use a single layer attention-based model composed by LSTM. We build this model on fairseq-py111111https://github.com/pytorch/fairseq/blob/master/fairseq/models/lstm.py with the default configuration.

For the real APE task, we follow the procedures introduced in Junczys-Dowmunt and Grundkiewicz (2016). Synthetic corpus has two subsets: a 500K one and a 4M one. We over-sample real data by 10 times and merge it with the 500K synthetic data to train APE models. Besides, we also train a LevT MT model on the bigger (4M) synthetic corpus where we only use the source and target pairs.

c.3 Implementation

Both the proposed Levenshtein Transformer and the baseline Transformer are implemented using PyTorch

121212https://pytorch.org/. The code will be released based on the acceptance.

Appendix D Balanced Speed Test on All Lengths

We see from figure 4 (a) that most of the translations are gotten within iterations. Long sentences (e.g. with over 100 tokens) however are relatively underrepresented in the validation set and have sparser data points than that for shorter ones.

To mitigate this bias, we try adding extra data points for Ro-En by selecting and translating long sentences from a monolingual corpus based on New Crawl. Specifically, we group Ro sentences based on their lengths from and for each group, we randomly sample sentences to decode for each group. We show the averaged number of decoding iterations of each length group in Figure 6.

We see that the time-complexity of the proposed LevT is approximately linear (not constant) to the input length (), but with a much smaller ratio ( iteration tokens) compared to the standard auto-regressive modes ( iteration token).

Appendix E More Decoding Examples

We present more examples from the proposed Levenshtein Transformer as follows.

Figure 7: Translation examples for WAT’17 Small-NMT En-Ja with the Levenshtein Transformer.
Figure 8: Translation examples for WMT’16 Ro-En with the Levenshtein Transformer.
Figure 9: Translation examples for WMT’14 En-De with the Levenshtein Transformer.
Figure 10: Translation examples for English Gigaword with the Levenshtein Transformer.
Figure 11: Post-editing examples for WMT’17-APE En-De with the Levenshtein Transformer.
Figure 12: An example for machine translation and zero-shot post-editing over a PBMT system’s output on WMT’16 Ro-En with the Levenshtein Transformer (LevT) trained for MT. It is clear to find that, the pre-trained LevT can directly adapt to the PBMT’s output and have a different refinement results compared to translate from scratch.