Monotonic Infinite Lookback Attention for Simultaneous Machine Translation

by   Naveen Arivazhagan, et al.

Simultaneous machine translation begins to translate each source sentence before the source speaker is finished speaking, with applications to live and streaming scenarios. Simultaneous systems must carefully schedule their reading of the source sentence to balance quality against latency. We present the first simultaneous translation system to learn an adaptive schedule jointly with a neural machine translation (NMT) model that attends over all source tokens read thus far. We do so by introducing Monotonic Infinite Lookback (MILk) attention, which maintains both a hard, monotonic attention head to schedule the reading of the source sentence, and a soft attention head that extends from the monotonic head back to the beginning of the source. We show that MILk's adaptive schedule allows it to arrive at latency-quality trade-offs that are favorable to those of a recently proposed wait-k strategy for many latency values.


page 1

page 2

page 3

page 4


Thinking Slow about Latency Evaluation for Simultaneous Machine Translation

Simultaneous machine translation attempts to translate a source sentence...

Monotonic Multihead Attention

Simultaneous machine translation models start generating a target sequen...

Anticipation-free Training for Simultaneous Translation

Simultaneous translation (SimulMT) speeds up the translation process by ...

SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory

Lip reading, aiming to recognize spoken sentences according to the given...

Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy

Simultaneous machine translation (SiMT) generates translation before rea...

Infusing Future Information into Monotonic Attention Through Language Models

Simultaneous neural machine translation(SNMT) models start emitting the ...

Gaussian Multi-head Attention for Simultaneous Machine Translation

Simultaneous machine translation (SiMT) outputs translation while receiv...

1 Introduction

Simultaneous machine translation (MT) addresses the problem of how to begin translating a source sentence before the source speaker has finished speaking. This capability is crucial for live or streaming translation scenarios, such as speech-to-speech translation, where waiting for one speaker to complete their sentence before beginning the translation would introduce an intolerable delay. In these scenarios, the MT engine must balance latency against quality: if it acts before the necessary source content arrives, translation quality degrades; but waiting for too much source content can introduce unnecessary delays. We refer to the strategy an MT engine uses to balance reading source tokens against writing target tokens as its schedule.

Recent work in simultaneous machine translation tends to fall into one of two bins:

  • The schedule is learned and/or adaptive to the current context, but assumes a fixed MT system trained on complete source sentences, as typified by wait-if-* Cho and Esipova (2016)

    and reinforcement learning approaches 

    Grissom II et al. (2014); Gu et al. (2017).

  • The schedule is simple and fixed and can thus be easily integrated into MT training, as typified by wait- approaches Dalvi et al. (2018); Ma et al. (2018).

Neither scenario is optimal. A fixed schedule may introduce too much delay for some sentences, and not enough for others. Meanwhile, a fixed MT system that was trained to expect complete sentences may impose a low ceiling on any adaptive schedule that uses it. Therefore, we propose to train an adaptive schedule jointly with the underlying neural machine translation (NMT) system.

Monotonic attention mechanisms Raffel et al. (2017); Chiu and Raffel (2018) are designed for integrated training in streaming scenarios and provide our starting point. They encourage streaming by confining the scope of attention to the most recently read tokens. This restriction, however, may hamper long-distance reorderings that can occur in MT. We develop an approach that removes this limitation while preserving the ability to stream.

We use their hard, monotonic attention head to determine how much of the source sentence is available. Before writing each target token, our learned model advances this head zero or more times based on the current context, with each advancement revealing an additional token of the source sentence. A secondary, soft attention head can then attend to any source words at or before that point, resulting in Monotonic Infinite Lookback (MILk) attention. This, however, removes the memory constraint that was encouraging the model to stream. To restore streaming behaviour, we propose to jointly minimize a latency loss. The entire system can efficiently be trained in expectation, as a drop-in replacement for the familiar soft attention.

Our contributions are as follows:

  1. We present MILk attention, which allows us to build the first simultaneous MT system to learn an adaptive schedule jointly with an NMT model that attends over all source tokens read thus far.

  2. We extend the recently-proposed Average Lagging latency metric Ma et al. (2018), making it differentiable and calculable in expectation, which allows it to be used as a training objective.

  3. We demonstrate favorable trade-offs to those of wait- strategies at many latency values, and provide evidence that MILk’s advantage extends from its ability to adapt based on source content.

(a) Soft attention.
(b) Monotonic attention.
(c) MILk attention.
Figure 1: Simplified diagrams of the attention mechanisms discussed in Sections 3.1 and 3.2. The shading of each node indicates the amount of attention weight the model assigns to a given encoder state (horizontal axis) at a given output timestep (vertical axis).

2 Background

Much of the earlier work on simultaneous MT took the form of strategies to chunk the source sentence into partial segments that can be translated safely. These segments could be triggered by prosody Fügen et al. (2007); Bangalore et al. (2012) or lexical cues Rangarajan Sridhar et al. (2013), or optimized directly for translation quality Oda et al. (2014). Segmentation decisions are surrogates for the core problem, which is deciding whether enough source content has been read to write the next target word correctly Grissom II et al. (2014). However, since doing so involves discrete decisions, learning via back-propagation is obstructed. Previous work on simultaneous NMT has thus far side-stepped this problem by making restrictive simplifications, either on the underlying NMT model or on the flexibility of the schedule.

Cho16 apply heuristics measures to estimate and then threshold the confidence of an NMT model trained on full sentences to adapt it at inference time to the streaming scenario. Several others use reinforcement learning (RL) to develop an agent to predict read and write decisions 

Satija and Pineau (2016); Gu et al. (2017); Alinejad et al. (2018). However, due to computational challenges, they pre-train an NMT model on full sentences and then train an agent that sees the fixed NMT model as part of its environment.

Dalvi2018 and Ma2018 use fixed schedules and train their NMT systems accordingly. In particular, Ma2018 advocate for a wait- strategy, wherein the system always waits for exactly tokens before beginning to translate, and then alternates between reading and writing at a constant pre-specified emission rate. Due to the deterministic nature of their schedule, they can easily train the NMT system with the schedule in place. This can allow the NMT system to learn to anticipate missing content using its inherent language modeling capabilities. On the downside, with a fixed schedule the model cannot speed up or slow down appropriately for particular inputs.

Press2018 recently developed an attention-free model that aims to reduce computational and memory requirements. They achieve this by maintaining a single running context vector, and eagerly emitting target tokens based on it whenever possible. Their method is adaptive and uses integrated training, but the schedule itself is trained with external supervision provided by word alignments, while ours is latent and learned in service to the MT task.

3 Methods

In sequence-to-sequence modeling, the goal is to transform an input sequence into an output sequence

. A sequence-to-sequence model consists of an encoder which maps the input sequence to a sequence of hidden states and a decoder which conditions on the encoder output and autoregressively produces the output sequence. In this work, we consider sequence-to-sequence models where the encoder and decoder are both recurrent neural networks (RNNs) and are updated as follows:


where is the encoder state at input timestep , is the decoder state at output timestep , and is a context vector. The context vector is computed based on the encoder hidden states through the use of an attention mechanism Bahdanau et al. (2014). The function produces a distribution over output tokens given the current state and context vector . In standard soft attention, the context vector is computed as follows:



is a multi-layer perceptron.

One issue with standard soft attention is that it computes based on the entire input sequence for all output timesteps; this prevents attention from being used in streaming settings since the entire input sequence needs to be ingested before generating any output. To enable streaming, we require a schedule in which the output at timestep is generated using just the first input tokens, where .

3.1 Monotonic Attention

Raffel2017 proposed a monotonic attention mechanism that modifies standard soft attention to provide such a schedule of interleaved reads and writes, while also integrating training with the rest of the NMT model. Monotonic attention explicitly processes the input sequence in a left-to-right order and makes a hard assignment of to one particular encoder state denoted . For output timestep , the mechanism begins scanning the encoder states starting at

. For each encoder state, it produces a Bernoulli selection probability

, which corresponds to the probability of either stopping and setting , or else moving on to the next input timestep,

, which represents reading one more source token. This selection probability is computed through the use of an energy function that is passed through a logistic sigmoid to parameterize the Bernoulli random variable:


If , is incremented and these steps are repeated; if , is set to and is set to .

This approach involves sampling a discrete random variable and a hard assignment of

, which precludes backpropagation. Raffel2017 instead compute the probability that

and use this to compute the expected value of , which can be used as a drop-in replacement for standard soft attention, and which allows for training with backpropagation. The probability that the attention mechanism attends to state at output timestep is computed as


There is a solution to this recurrence relation which allows to be computed for all in parallel using cumulative sum and cumulative product operations; see Raffel2017 for details.

Note that when is either or

, the soft and hard approaches are the same. To encourage this, Raffel2017 use the common approach of adding zero-mean Gaussian noise to the logistic sigmoid function’s activations. Equation 

8 becomes:


One can control the extent to which

is drawn toward discrete values by adjusting the noise variance

. At run time, we forgo sampling in favor of simply setting .

While the monotonic attention mechanism allows for streaming attention, it requires that the decoder attend only to a single encoder state, . To address this issue, Chiu2018 proposed monotonic chunkwise attention (MoChA), which allows the model to perform soft attention over a small fixed-length chunk preceding , i.e. over all available encoder states, for some fixed chunk size .

3.2 Monotonic Infinite Lookback Attention

In this work, we take MoChA one step further, by allowing the model to perform soft attention over the encoder states . This gives the model “infinite lookback” over the past seen thus far, so we dub this technique Monotonic Infinite Lookback (MILk) attention. The infinite lookback provides more flexibility and should improve the modeling of long-distance reorderings and dependencies. The increased computational cost, from linear to quadratic computation, is of little concern as our focus on the simultaneous scenario means that out largest source of latency will be waiting for source context.

Concretely, we maintain a full monotonic attention mechanism and also a soft attention mechanism. Assuming that the monotonic attention component chooses to stop at , MILk first computes soft attention energies


for where is an energy function similar to Equation (4). Then, MILk computes a context by


Note that a potential issue with this approach is that the model can set the monotonic attention head for all , in which case the approach is equivalent to standard soft attention. We address this issue in the following subsection.

To train models using MILk, we compute the expected value of given the monotonic attention probabilities and soft attention energies. To do so, we must consider every possible path through which the model could assign attention to a given encoder state. Specifically, we can compute the attention distribution induced by MILk by


The first summation reflects the fact that can influence as long as , and the term inside the summation reflects the attention probability associated with some monotonic probability and the soft attention distribution. This calculation can be computed efficiently using cumulative sum operations by replacing the outer summation with a cumulative sum and the inner operation with a cumulative sum after reversing . Once we have the distribution, calculating the expected context follows a familiar formula: .

3.3 Latency-augmented Training

By moving to an infinite lookback, we have gained the full power of a soft attention mechanism over any source tokens that have been revealed up to time

. However, while the original monotonic attention encouraged streaming behaviour implicitly due to the restriction on the system’s memory, MILk no longer has any incentive to do this. It can simply wait for all source tokens before writing the first target token. We address this problem by training with an objective that interpolates log likelihood with a latency metric.

Sequence-to-sequence models are typically trained to minimize the negative log likelihood, which we can easily augment with a latency cost:


where is a user-defined latency weight, is a vector that describes the delay incurred immediately before each target time step (see Section 4.1), and is a latency metric that transforms these delays into a cost.

In the case of MILk, is equal to , the position of the monotonic attention head.111We introduce to generalize beyond methods with hard attention heads and to unify notation with Ma2018. Recall that during training, we never actually make a hard decision about ’s location. Instead, we can use , the probability that , to get expected delay:


So long as our metric is differentiable and well-defined over fractional delays, Equation (15) can be used to guide MILk to low latencies.

3.4 Preserving Monotonic Probability Mass

In the original formulations of monotonic attention (see Section 3.1), it is possible to choose not to stop the monotonic attention head, even at the end of the source sentence. In such cases, the attention returns an all-zero context vector.

In early experiments, we found that this creates an implicit incentive for low latencies: the MILk attention head would stop early to avoid running off the end of the sentence. This implicit incentive grows stronger as our selection probabilities come closer to being binary decisions. Meanwhile, we found it beneficial to have very-near-to-binary decisions in order to get accurate latency estimates for latency-augmented training. Taken all together, we found that MILk either destabilized, or settled into unhealthily-low-latency regions. We resolve this problem by forcing MILk’s monotonic attention head to once stop when it reaches the EOS token, by setting .222While training, we perform the equivalent operation of shifting the any residual probability mass from overshooting the source sentence, , to the final source token at position . This bypasses floating point errors introduced by the parallelized cumulative sum and cumulative product operations Raffel et al. (2017). This same numerical instability helps explain why the parameterized stopping probability does not learn to detect the end of the sentence without intervention.

4 Measuring Latency

Our plan hinges on having a latency cost that is worth optimizing. To that end, we describe two candidates, and then modify the most promising one to accommodate our training scenario.

4.1 Previous Latency Metrics

Cho16 introduced Average Proportion (AP), which averages the absolute delay incurred by each target token:


where is delay at time : the number of source tokens read by the agent before writing the target token. This metric has some nice properties, such as being bound between 0 and 1, but it also has some issues. Ma2018 observe that their wait- system with a fixed incurs different AP values as sequence length ranges from 2 () to (). Knowing that a very-low-latency wait-1 system incurs at best an of 0.5 also implies that much of the metric’s dynamic range is wasted; in fact, Alinejad2018 report that AP is not sufficiently sensitive to detect their improvements to simultaneous MT.

Recently, Ma2018 introduced Average Lagging (AL), which measures the average rate by which the MT system lags behind an ideal, completely simultaneous translator:


where is the earliest timestep where the MT system has consumed the entire source sequence:


and accounts for the source and target having different sequence lengths. This metric has the nice property that when , a wait- system will achieve an AL of , which makes the metric very interpretable. It also has no issues with sentence length or sensitivity.

4.2 Differentiable Average Lagging

Average Proportion already works as a function, but we prefer Average Lagging for the reasons outlined above. Unfortunately, it is not differentiable, nor is it calculable in expectation, due to the in Equation (19). We present Differentiable Average Lagging (DAL), which eliminates the by making AL’s treatment of delay internally consistent.

AL’s is used to calculate , which is used in turn to truncate AL’s average at the point where all source tokens have been read. Why is this necessary? We can quickly see ’s purpose by reasoning about a simpler version of AL where .

Statistics Scores
1 2 3 4
3 4 4 4
3 3 2 1 AL = 3 AL = 2.25
Table 1: Comparing AL with and without its truncated average, tracking time-indexed lag when for a wait- system.

Table 1 shows the time-indexed lags that are averaged to calculate AL for a wait-3 system. The lags make the problem clear: each position beyond the point where all source tokens have been read () has its lag reduced by 1, pulling the average lag below . By stopping its average at , AL maintains the property that a wait- system receives an AL of .

is necessary because the only way to incur delay is to read a source token. Once all source tokens have been read, all target tokens appear instantaneously, artificially dragging down the average lag. This is unsatisfying: the system lagged behind the source speaker while they were speaking. It should continue to do so after they finished.

AL solves this issue by truncating its average, enforcing an implicit and poorly defined delay for the excluded, problematic tokens. We propose instead to enforce a minimum delay for writing any target token. Specifically, we model each target token as taking at least units of time to write, mirroring the speed of the ideal simultaneous translator in AL’s Equation (18). We wrap in a that enforces our minimum delay:


Like , represents the amount of delay incurred just before writing the target token. Intuitively, the enforces our minimum delay: is either equal to , the number of source tokens read, or to , the delay incurred just before the previous token, plus the time spent writing that token. The recurrence ensures that we never lose track of earlier delays. With in place, we can define our Differentiable Average Lagging:

Statistics Scores
1 2 3 4
3 4 5 6
3 3 3 3 DAL = 3
Table 2: DAL’s time-indexed lag when for a wait- system.

DAL is equal to AL in many cases, in particular, when measuring wait- systems for sentences of equal length, both always return a lag of . See Table 2 for its treatment of our wait-3 example. Having eliminated , DAL is both differentiable and calcuable in expectation. Cherry2019 provide further motivation and analysis for DAL, alongside several examples of cases where DAL yields more intuitive results than AL.

5 Experiments

We run our experiments on the standard WMT14 English-to-French (EnFr; 36.3M sentences) and WMT15 German-to-English (DeEn; 4.5M sentences) tasks. For EnFr we use a combination of newstest 2012 and newstest 2013 for development and report results on newstest 2014. For DeEn we validate on newstest 2013 and then report results on newstest 2015. Translation quality is measured using detokenized, cased BLEU Papineni et al. (2002). For each data set, we use BPE Sennrich et al. (2016) on the training data to construct a 32,000-type vocabulary that is shared between the source and target languages.

5.1 Model

Our model closely follows the RNMT+ architecture described by Chen2018 with modifications to support streaming translation. It consists of a 6 layer LSTM encoder and an 8 layer LSTM decoder with additive attention Bahdanau et al. (2014). All streaming models including wait-k, MoChA and MILk use unidirectional encoders, while offline translation models use a bidirectional encoder. Both encoder and decoder LSTMs have 512 hidden units, per gate layer normalization Ba et al. (2016), and residual skip connections after the second layer. The models are regularized using dropout with probability 0.2 and label smoothing with an uncertainty of 0.1 Szegedy et al. (2016). Models are optimized until convergence using data parallelism over 32 P100s, using Adam Kingma and Ba (2015) with the learning rate schedule described in Chen2018 and a batch size of 4,096 sentence-pairs per GPU. Checkpoints are selected based on development loss. All streaming models use greedy decoding, while offline models use beam search with a beam size of 20.

We implement soft attention, monotonic attention, MoChA, MILk and wait- as instantiations of an attention interface in a common code base, allowing us to isolate their contributions. By analyzing development sentence lengths, we determined that wait- should employ a emission rate of 1 for DeEn, and 1.1 for EnFr.

5.2 Development

unpreserved preserved
0.0 27.7 21.0 27.7 27.9
0.1 27.0 13.6 27.6 10.5
0.2 25.7 11.6 27.5 8.7
Table 3: Varying MILk’s with and without mass preservation on the DeEn development set.

We tuned MILk on our DeEn development set. Two factors were crucial for good performance: the preservation of monotonic mass (Section 3.4), and the proper tuning of the noise parameter in Equation 11, which controls the discreteness of monotonic attention probabilities during training.

Table 3 contrasts MILk’s best configuration before mass preservation against our final system. Before preservation, MILk with a latency weight still showed a substantial reduction in latency from the maximum value of 27.9, indicating an intrinsic latency incentive. Furthermore, training quickly destabilized, resulting in very poor trade-offs for s as low as .

0 3.4 24.2
1 10.8 12.9
2 24.6 12.3
3 27.5 10.4
4 27.5 8.7
6 26.3 7.2
Table 4: Varying MILk’s discreteness parameter with fixed at 0.2 on the DeEn development set.

After modifying MILk to preserve mass, we then optimized noise with fixed at a low but relevant value of 0.2, as shown in Table 4. We then proceeded the deploy the selected value of for testing both DeEn and EnFr.

5.3 Comparison with the state-of-the-art

We compare MILk to wait-k, the current state-of-the-art in simultaneous NMT. We also include MILk’s predecessors, Monotonic Attention and MoChA, which have not previously been evaluated with latency metrics. We plot latency-quality curves for each system, reporting quality using BLEU, and latency using Differentiable Average Lagging (DAL), Average Lagging (AL) or Average Proportion (AP) (see Section 4). We focus our analysis on DAL unless stated otherwise. MILk curves are produced by varying the latency loss weight ,333, 0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01, 0.0 wait- curves by varying ,444 and MoChA curves by varying chunk size.555 (Monotonic Attention), 2, 4, 8, and 16 Both MILk and wait- have settings ( and ) corresponding to full attention.

Figure 2: Quality-latency comparison for German-to-English WMT15 (DeEn) with DAL (upper), AL (lower-left), AP (lower-right).
Figure 3: Quality-latency comparison for English-to-French WMT14 (EnFr) with DAL (upper), AL (lower-left), AP (lower-right).

Results are shown in Figures 7(a) and 7(b).666Full sized graphs for all latency metrics, along with the corresponding numeric scores are available in Appendix A, included as supplementary material. For DeEn, we begin by noting that MILk has a clear separation above its predecessors MoChA and Monotonic Attention, indicating that the infinite lookback is indeed a better fit for translation. Furthermore, MILk is consistently above wait- for lags between 4 and 14 tokens. MILk is able to retain the quality of full attention (28.4 BLEU) up to a lag of 8.5 tokens, while wait- begins to fall off for lags below 13.3 tokens. At the lowest comparable latency (4 tokens), MILk is 1.5 BLEU points ahead of wait-.

EnFr is a much easier language pair: both MILk and wait- maintain the BLEU of full attention at lags of 10 tokens. However, we were surprised to see that this does not mean we can safely deploy very low s for wait-; its quality drops off surprisingly quickly at (DAL=8.4, BLEU=39.8). MILk extends the flat “safe” region of the curve out to a lag of 7.2 (BLEU=40.5). At the lowest comparable lag (4.5 tokens), MILk once again surpasses wait-, this time by 2.3 BLEU points.

The point for wait- has been omitted from all graphs to improve clarity. The omitted BLEU/DAL pairs are 19.5/2.5 for DeEn and 28.9/2.9 for EnFr, both of which trade very large losses in BLEU for small gains in lag. However, wait-’s ability to function at all at such low latencies is notable. The configuration of MILk tested here was unable to drop below lags of 4.

Despite MILk having been optimized for DAL, MILk’s separation above wait- only grows as we move to the more established metrics AL and AP. DAL’s minimum delay for each target token makes it far more conservative than AL or AP. Unlike DAL, these metrics reward MILk and its predecessors for their tendency to make many consecutive writes in the middle of a sentence.

5.4 Characterizing MILK’s schedule

Figure 4: Two EnFr sentences constructed to contrast MILk’s handling of a short noun phrase John Smith against the longer John Smith’s lawyer. Translated by MILk with .

We begin with a qualitative characterization of MILk’s behavior by providing diagrams of MILk’s attention distributions. The shade of each circle indicates the strength of the soft alignment, while bold outlines indicate the location of the hard attention head, whose movement is tracked by connecting lines.

In general, the attention head seems to loosely follow noun- and verb-phrase boundaries, reading one or two tokens past the end of the phrase to ensure it is complete. This behavior and its benefits are shown in Figure 4, which contrast the simple noun phrase John Smith against the more complex John Smith’s laywer. By waiting until the end of both phrases, MILk is able to correctly re-order avocat (lawyer).

Figure 5: An example EnFr sentence drawn from our development set, as translated by MILk with .
Figure 6: An example EnFr sentence drawn from our development set, as translated by wait-6.

Figure 5 shows a more complex sentence drawn from our development set. MILk gets going after reading just 4 tokens, writing the relatively safe, En 2008. It does wait, but it saves its pauses for tokens with likely future dependencies. A particularly interesting pause occurs before the de in de la loi. This preposition could be either de la or du, depending on the phrase it modifies. We can see MILk pause long enough to read one token after law, allowing it to correctly choose de la to match the feminine loi (law).

Looking at the corresponding wait-6 run in Figure 6, we can see that wait-6’s fixed schedule does not read law before writing the same de. To its credit, wait-6 anticipates correctly, also choosing de la, likely due to the legal context provided by the nearby phrase, the constitutionality.

We can also perform a quantitative analysis of MILk’s adaptivity by monitoring its initial delays; that is, how many source tokens does it read before writing its first target token? We decode our EnFr development set with MILk as well as wait-6 and count the initial delays for each.777Wait-6 will have delays different from 6 only for source sentences with fewer than 6 tokens. The resulting histogram is shown in Figure 7.

Figure 7: Histogram of initial delays for MILk () and wait-6 on the EnFr development set.

We can see that MILk has a lot of variance in its initial delays, especially when compared to the near-static wait-6. This is despite them having very similar DALs: 5.8 for MILk and 6.5 for wait-6.

6 Conclusion

We have presented Monotonic Infinite Lookback (MILk) attention, an attention mechanism that uses a hard, monotonic head to manage the reading of the source, and a soft traditional head to attend over whatever has been read. This allowed us to build a simultaneous NMT system that is trained jointly with its adaptive schedule. Along the way, we contributed latency-augmented training and a differentiable latency metric. We have shown MILk to have favorable quality-latency trade-offs compared to both wait- and to earlier monotonic attention mechanisms. It is particularly useful for extending the length of the region on the latency curve where we do not yet incur a major reduction in BLEU.


Appendix A Expanded Results

We provide full-sized versions of our Quality-Latency curves from Section 5.3 in Figure 8. We also provide a complete table of results in Tables 5 and 6. As in the main text, DAL is Differentiable Average Lagging, AL is Average Lagging and AP is Average Proportion. wait- is parameterized by , MoChA by its chunk size and MILk by its latency weight . Results for EnFr MILk with are omitted, as it failed to converge.

(a) German-to-English WMT15 (DeEn) test set.
(b) English-to-French WMT14 (EnFr) test set.
Figure 8: BLEU versus latency (top: differentiable average lagging, middle: average lagging, bottom: average proportion) for our two language pairs (left: DeEn, right: EnFr).


19.5 2.5 1.5 0.56
23.8 4.2 3.1 0.63
25.3 6.1 4.9 0.70
26.7 8.1 6.8 0.75
27.3 9.9 8.8 0.80
27.6 11.7 10.6 0.84
28.1 13.3 12.3 0.87
28.3 14.9 14.0 0.89
28.6 17.6 16.9 0.93
28.5 19.9 19.3 0.95
28.4 27.9 27.9 1.00


26.0 6.9 4.7 0.68
25.6 6.2 4.0 0.66
26.0 6.4 4.2 0.67
26.4 7.4 5.1 0.70
26.6 8.4 5.9 0.72


25.3 4.1 2.8 0.60
26.4 5.1 3.5 0.63
26.9 5.7 4.0 0.64
27.4 6.7 4.7 0.67
28.4 8.5 6.0 0.71
28.5 10.3 7.5 0.76
28.5 12.6 9.5 0.81
28.6 24.4 22.7 0.97
28.4 27.9 27.9 1.00
Table 5: Complete DeEn test set results, backing the curves in Figure 7(a).


28.9 2.9 2.1 0.57
35.6 4.5 3.7 0.63
38.4 6.5 5.5 0.70
39.8 8.4 7.5 0.75
40.5 10.3 9.4 0.80
40.6 12.1 11.3 0.84
40.9 13.9 13.0 0.87
40.7 15.5 14.7 0.89
41.1 18.3 17.7 0.93
41.1 20.8 20.3 0.96
40.6 28.8 28.8 1.00


37.7 5.5 3.4 0.63
37.3 5.4 3.3 0.62
37.1 5.6 3.6 0.63
38.1 6.8 4.5 0.66
38.6 7.9 5.1 0.69


37.9 4.6 3.1 0.61
38.7 4.9 3.3 0.61
39.1 5.2 3.6 0.63
39.6 5.8 4.0 0.64
40.5 7.2 5.1 0.68
40.9 8.4 6.2 0.71
40.7 17.9 14.8 0.89
40.5 28.8 28.8 1.00
Table 6: Complete EnFr test set results, backing the curves in Figure 7(b).