Lossless Speedup of Autoregressive Translation with Generalized Aggressive Decoding

03/30/2022
by   Heming Xia, et al.
Microsoft
Peking University
0

In this paper, we propose Generalized Aggressive Decoding (GAD) – a novel decoding paradigm for speeding up autoregressive translation without quality loss, through the collaboration of autoregressive and non-autoregressive translation (NAT) of the Transformer. At each decoding iteration, GAD aggressively decodes a number of tokens in parallel as a draft with NAT and then verifies them in the autoregressive manner, where only the tokens that pass the verification are kept as decoded tokens. GAD can achieve the same performance as autoregressive translation but much more efficiently because both NAT drafting and autoregressive verification are fast due to parallel computing. We conduct experiments in the WMT14 English-German translation task and confirm that the vanilla GAD yields exactly the same results as greedy decoding with an around 3x speedup, and that its variant (GAD++) with an advanced verification strategy not only outperforms the greedy translation and even achieves the comparable translation quality with the beam search result, but also further improves the decoding speed, resulting in an around 5x speedup over autoregressive translation. Our models and codes are available at https://github.com/hemingkx/Generalized-Aggressive-Decoding.

READ FULL TEXT VIEW PDF

Authors

page 15

05/20/2022

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding

We study lossless acceleration for seq2seq generation with a novel decod...
04/07/2020

Improving Fluency of Non-Autoregressive Machine Translation

Non-autoregressive (nAR) models for machine translation (MT) manifest su...
09/23/2021

The Volctrans GLAT System: Non-autoregressive Translation Meets WMT21

This paper describes the Volctrans' submission to the WMT21 news transla...
05/31/2021

Effective Batching for Recurrent Neural Network Grammars

As a language model that integrates traditional symbolic operations and ...
06/09/2021

Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding

In this paper, we propose Shallow Aggressive Decoding (SAD) to improve t...
02/08/2020

LAVA NAT: A Non-Autoregressive Translation Model with Look-Around Decoding and Vocabulary Attention

Non-autoregressive translation (NAT) models generate multiple tokens in ...
06/02/2021

Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation

Knowledge distillation (KD) is commonly used to construct synthetic data...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since the Transformer Vaswani et al. (2017)

prevailed in natural language processing (NLP), autoregressive decoding has become the

de facto

approach to neural machine translation (NMT) as well as other tasks involved with text generation, because it is easy to train and reliable to generate high-quality translations. Despite its advantages, autoregressive translation (AT) has been widely blamed for its poor inference efficiency, which motivates non-autoregressive translation (NAT)

Gu et al. (2018). Unlike AT which sequentially decodes only one token at each iteration so that the next token prediction can condition on the previous decoding results, NAT decodes tokens in parallel without depending on the surface form of previous tokens, largely improving the inference efficiency.

Recent research in NAT mainly focuses on improving its translation quality to bridge the performance gap between NAT and AT Gu et al. (2018); Ghazvininejad et al. (2019); Gu et al. (2019); Saharia et al. (2020); Qian et al. (2021); Ran et al. (2021); Song et al. (2021); Geng et al. (2021); Savinov et al. (2021). Until now, however, NAT’s performance has been still less reliable111Although some NAT models’ BLEU scores are reported to match AT in some benchmarks by previous work, the comparison (e.g., an elaborately learned NAT model distilled from a powerful teacher VS a naive AT model trained from scratch) is usually not persuasive enough to draw the conclusion that NAT can really perform as well as AT. than AT, as NAT is much more difficult than AT given its unawareness of the conditional dependence of translated tokens.

Given the fact that AT translates better whereas NAT performs faster, we propose an approach named Generalized Aggressive Decoding (GAD), which is inspired by the success of Aggressive Decoding Sun et al. (2021b) in a lossless speedup of Grammatical Error Correction, to employ NAT to accelerate translation without quality loss compared with AT. GAD decomposes a decoding iteration into two substeps: draft and verify: At each decoding iteration, GAD first aggressively drafts (i.e., decodes) a fixed number of tokens222We use “a block of drafted tokens” to denote them in the following parts of this paper. in parallel through NAT; Then, the drafted tokens are verified in an AT manner to determine how many of them match AT’s (top-1) results and thus can be accepted as translation results, as Figure 2 shows. In contrast to conventional AT which decodes at a low speed, AT verification is highly efficient because it performs in parallel; more importantly, it helps guarantee GAD’s translation can be as good as AT, resulting in a desirable balance between translation speed and quality, as shown in Figure 1.

In addition to the vanilla GAD whose translation is required (strictly by the top-1 matching criterion in AT verification) to be identical to greedy decoding of AT, we propose GAD++ — an advanced variant of GAD by slightly loosening the rigid requirement during AT verification. GAD++ not only yields translations beyond greedy decoding, but also prevents good drafted tokens from being discarded just because they are different from greedy decoding results, leading to a higher inference speedup.

The experiments in the WMT14 English-German (EN-DE) translation benchmark show that GAD can yield the exactly same translations as greedy decoding of AT with a speedup and that its variant GAD++ outperforms greedy decoding and achieves comparable translation quality to the state-of-the-art beam search results with an around speedup.

The contributions of this paper are two-fold:

  • We propose Generalized Aggressive Decoding (GAD) – a novel efficient decoding paradigm for lossless acceleration of autoregressive translation.

  • We propose GAD++ with an advanced strategy to slightly loosen the rigid AT verification in the vanilla GAD, yielding better translations than greedy decoding with further improved efficiency.

2 Background

2.1 Autoregressive Translation

Given a source sentence and the target sentence

, an autoregressive translation (AT) model is trained with the target distribution of conditional probabilities based on the chain rule:

(1)

where denotes previous target tokens before the position. As Eq (1) shows, an AT model is trained via the teacher-forcing strategy that uses ground truth target tokens as previously decoded tokens, which performs efficiently as the probability at each iteration can be calculated in parallel.

During inference, an AT model sequentially predicts output tokens, conditioning on the preceding decoded tokens:

(2)

where is the predicted target sentence.

Although AT offers desirable translation quality, its sequential decoding scheme with limited parallelism largely reduces its decoding speed, being its main efficiency bottleneck.

2.2 Non-Autoregressive Translation

To improve the inference efficiency, non-autoregressive translation (NAT) Gu et al. (2018); Ghazvininejad et al. (2019); Qian et al. (2021) removes the sequential dependence between target tokens with a conditional independence assumption for modeling the target sentence:

(3)

In contrast to AT that will not start predicting until are completely decoded, NAT decodes333In this paper, we use to denote NAT’s translations, while we use to denote AT decoded/verified translation results. the output sentence in parallel, which is much more efficient than AT:

(4)

On the other hand, however, the conditional independence assumption makes it hard to train a NAT model well, leading to a degradation in translation quality despite an improvement in decoding speed.

2.3 Aggressive Decoding

Aggressive decoding Sun et al. (2021b) offered a unique perspective to improve the inference efficiency of Grammatical Error Correction (GEC). Given a source sentence , it first assumes that the output (i.e., corrected sentence) is identical to the source (i.e., no edit), and aggressively decodes as many tokens as possible in parallel conditioning on the source tokens (i.e., previously decoded tokens based on the assumption):

where . Then, it verifies if is identical to . If , the inference will finish, meaning that the assumption is correct: there is no grammatical errors in the input and thus no edit is needed. Otherwise, we find the bifurcation index such that and . We discard all the tokens after the bifurcation index , and then sequentially re-decode them following the conventional autoregressive decoding until we find the next opportunity (i.e., unique suffix matching) to switch back to aggressive decoding.

Since an output sentence is usually highly similar to its source sentence in GEC, aggressive decoding can significantly improve the inference efficiency with the guarantee that its outputs are identical to autoregressive greedy decoding.

Figure 3: Illustration of GAD++. Compared to the vanilla GAD strictly requiring the drafted tokens to match the top-1 result of the AT verifier, GAD++ slightly loosens the criterion to trust NAT’s draft more, by only requiring the drafted tokens to fall in the top- of the AT verifier with a tolerable log-likelihood gap (not shown in this Figure; see Eq (7)). As a result, GAD++ allows more drafted tokens to be accepted even if they are slightly different from the top-1 result of the AT verifier, leading to a higher inference speedup.

3 Generalized Aggressive Decoding

Inspired by the success of Aggressive Decoding in GEC, we propose Generalized Aggressive Decoding (GAD) for accelerating machine translation without quality loss (compared with autoregressive translation). Unlike the original Aggressive Decoding that regards the input sentence as the target sentence draft, which works only for tasks where the input and output are highly similar, GAD additionally employs NAT to generate the target translation draft, allowing it to be applied to any seq2seq task where NAT works well. Specifically, GAD decomposes every decoding iteration into two substeps – draft and verify:

Draft

At each decoding iteration, GAD first utilizes an NAT model to aggressively decode a block of drafted tokens (denoted as [MASK] in its decoder input in Figure 2) in parallel, conditioning on preceding translated tokens. Formally, given the source sentence and the previous translated tokens , GAD decodes the next (drafted) tokens as a block in parallel:

Verify

Then, the drafted tokens are verified with an AT model in the autoregressive manner, which performs in parallel. As the original Aggressive Decoding, we find the bifurcation position by comparing the drafted tokens with the autoregressive decoding results conditioning on the draft as Figure 2 shows:

(5)

where is the indicator function and is the top-1 result verified by the AT model conditioning on the previous translated tokens and the drafted tokens . We only accept the verified tokens before (including) the bifurcation position as translated tokens, which ensures GAD to yield the same results as greedy decoding of AT:

We iterate decoding with the above substeps until the termination condition is met, i.e. the [EOS] token is decoded or the sentence reaches the maximal length. As illustrated, GAD is highly efficient because both draft and verify perform in parallel.

3.1 NAT drafter

As demonstrated above, an NAT model is the key to generalizing Aggressive Decoding to machine translation, which can efficiently generate a block of drafted tokens in parallel. Our NAT drafter differs from other NAT models in two aspects: First, we only require the NAT drafter to decode a block (i.e., fixed length) of tokens in each decoding iteration, instead of the whole sequence; Second, as illustrated in Figure 2, since we decode from Left to Right, the NAT drafter is required to decode tokens conditioning on the previously decoded tokens. Formally, given the source sentence and the randomly sampled prefix () of the target sentence , the model is trained to predict the next tokens, as shown in Figure 2:

In addition, we leverage the glancing strategy following Qian et al. (2021), which exploits curriculum learning during training to get better performance. As in previous NAT work, we apply sequence-level knowledge distillation (Seq-KD) Kim and Rush (2016) by an autoregressive Transformer teacher model to train our NAT drafter.

3.2 AT verifier

We use the conventional autoregressive Transformer (see Section 2.1) as our AT verifier, which is the key to guaranteeing the translation quality. As we hope as many drafted tokens by the NAT model as possible can be accepted by the AT verifier for a higher speedup, we also apply Seq-KD to the AT verifier by a shared teacher (with the NAT drafter), which not only allows the NAT drafter and AT verifier to perform similarly, but also helps improve the AT verifier’s translation quality Furlanello et al. (2018).

4 Gad++

As shown in Figure 2 and discussed in Section 3, the vanilla GAD only accepts the drafted tokens that match the top-1 result of the AT verifier, which guarantees that GAD’s translation is identical to greedy decoding of AT. However, the top-1 results are not necessarily better than the drafted tokens. As a result, the strict verification criterion (i.e., top-1 matching) will result in many good drafted tokens being discarded just because they are different from the top-1 result of the AT verifier, which limits the speedup of GAD.

To overcome this limitation, we propose a variant of GAD named GAD++, which is illustrated in Figure 3. Instead of the rigid top-1 matching requirement in the vanilla GAD shown in Eq (5) , GAD++ loosens the criterion to trust NAT’s draft more, by only requiring the drafted tokens to fall in top- candidates with a tolerable (log-likelihood) score gap (away from the top-1 result):

As discussed above, GAD++ requirement is met if Eq (6) and (7) are both true:

(6)
(7)

where is the top- ranked result’s log-likelihood score by the AT verifier.

The advanced verification criterion with the hyperparameter top-

and tolerance not only allows more drafted tokens to be accepted for a higher speedup but also enables GAD++ to yield translations beyond greedy decoding.

Models Iteration BLEU Speed
Teacher Transformer-big () N 29.31 /
AT (w/o Seq-KD) Transformer-base () N 27.76
Fully NAT NAT w/ Fertility Gu et al. (2018) 1 17.69 15.6
CTC Libovický and Helcl (2018) 1 16.56 /
Bag-of-ngrams Shao et al. (2020) 1 20.90 10.8
AXE Ghazvininejad et al. (2020a) 1 23.53 /
GLAT Qian et al. (2021) 1 25.21 15.3
AligNART Song et al. (2021) 1 26.40 13.4
DSLP Huang et al. (2021) 1 27.02 14.8
F-VAE Gu and Kong (2021) 1 27.49 16.5
Iterative NAT iNAT Lee et al. (2018) 10 21.61 /
CMLM Ghazvininejad et al. (2019) 10 27.03 1.7
LevT Gu et al. (2019) 2.1 27.27 4.0
SMART Ghazvininejad et al. (2020b) 10 27.65 1.7
DisCo Kasai et al. (2020) 4.8 27.34 3.5
Imputer Saharia et al. (2020) 8 28.20 3.9
Multi-Task NAT Hao et al. (2021) 10 27.98 1.7
RewriteNAT Geng et al. (2021) 2.7 27.83 3.1
SUNDAE Savinov et al. (2021) 16 28.46 1.4
Ours NAT drafter () 1.6 26.48
AT verifier () N 28.89
AT verifier () N 28.73
GAD () 4.9 28.73 3.0
GAD++ (, top-3, ) 4.0 28.89 3.6
GAD++ (, top-5, ) 3.1 28.73 4.5
Table 1: Results of GAD on WMT14 EN-DE benchmark. Transformer-base models w/ and w/o Seq-KD are used as the verifier of GAD and the AT baseline, respectively. is the block size. top- indicates the Top- selection in GAD++ and is the tolerance hyperparameter. We also report the averaged decoding iteration and speedup for comparison. denotes the speedup results reported in original papers obtained by comparison with greedy decoding.

5 Experiments

5.1 Experimental Settings

Datasets and Evaluation

We evaluate our approach on the most recognized machine translation benchmark: WMT14444https://www.statmt.org/wmt14 English-German translation which contains 4.5M translation pairs for training. Following prior work Ott et al. (2018), we adopt newstest-13 as our validation set for finding the best hyperparameters and model checkpoints, and test on newstest-14. We use 32K Byte Pair Encoding (BPE) Sennrich et al. (2016) subwords as the joint source-target dictionary. We use BLEU Papineni et al. (2002) to evaluate the translation quality. For inference efficiency, we use both the number of decoding iterations and the speedup over the beam search baseline. Specifically, we test the inference speed with fairseq implementation555https://github.com/pytorch/fairseq. Beam search () is our speed baseline ().

using Pytorch 1.10 with CUDA 11 on 1 Nvidia P100 GPU.

Model Configuration

We mainly study the most commonly used base-size Transformer Vaswani et al. (2017) architecture for NMT. The Transformer-base has a 6-layer encoder and a 6-layer decoder. Its embedding/FFN dimension/#heads are 512/2,048/8. We use the model architecture for both the drafter (NAT) and verifier (AT). We include model training details in Appendix. We apply sequence-level knowledge distillation as discussed in Section 3.1 and 3.2. Following recent iterative NAT work Ghazvininejad et al. (2019); Saharia et al. (2020); Savinov et al. (2021), the shared teacher we use is a Transformer-big model which trains with the raw training set and generates the distilled training set with beam search ().

5.2 Main Results

The main results of the WMT’14 English-German translation task are presented in Table 1. Unlike previous NAT approaches that are inferior to AT with Seq-KD (i.e., our AT verifier), GAD achieves the exactly same translation quality as greedy decoding by our AT verifier but with an around speedup. GAD++ further improves the results by loosening the strict top-1 matching criterion: slightly loosening (top-3, ) allow us to achieves better translation than greedy decoding – even approaching the beam search result with a higher speedup (3.0 3.6), while a little more aggressively loosening (top-5, ) further accelerates inference (3.6 4.5) owing to the acceptance of more tokens despite a marginal loss of translation quality.

Figure 4: Single sentence speedup distribution by GAD++ (, top-5, ) in WMT14 EN-DE test set which has 3,003 sentences in total.

By looking into our results, it is observed that our NAT drafter’s translation quality is better than the majority of fully NAT but inferior to most iterative NAT approaches. Compared with the NAT models including complicated mechanisms such as length prediction, length beam, reranking, and CTC that slow down the efficiency per iteration, our NAT drafter is simple and straightforward. As a result, its decoding efficiency per iteration is much higher, leading to a comparable speedup to fully NAT despite taking 1.6 decoding iterations on average. The acceptable translation quality and high efficiency of our NAT drafter significantly help accelerate autoregressive decoding, playing a critical role in GAD’s lossless speedup of AT.

To further understand the acceleration effects of GAD, we present the speedup distribution of a single sentence in the test set in Figure 4, showing that most sentences are translated with a 3~6 speedup compared to the beam search baseline, while some rare cases can even achieve over speedup.

5.3 Analysis

5.3.1 Hyperparameter

Block Size

We conduct experiments with various block sizes on our development set and show the results in Table 2. As the block size increases, the number of mean accepted tokens, which highly correlates with speedup and the number of decoding iterations, first increases and reaches a peak when . Further increasing has an adverse effect, because it will become very hard for the model to learn to translate too many tokens simultaneously given the limited model capacity, leading to a drop in both efficiency and quality.

Models Tok. BLEU Speed
AT () / 1.00 26.72 1.00
GAD++ 10 5.97 26.68 3.04
15 6.74 26.94 3.47
20 7.24 26.75 3.55
25 7.56 26.92 3.79
30 7.44 26.75 3.63
Table 2: The mean accepted tokens (Tok.), the translation quality (BLEU), and the efficiency (Speed) when decoding with a various number of block size on the development set. The results are obtained with GAD++ (top-3, ).
Top- and Tolerance in GAD++

We study the effects of hyperparameters in GAD++: top- and tolerance , and show the results on the development set in Table 3. Moderately increasing and not only leads to an increase of mean accepted tokens since AT verification becomes less strict but also improves the translation quality over greedy decoding. However, the translation quality will decrease if the constraints are over loosened: the BLEU score will degrade from the peak of 27.02 to 26.64 when decoding with top-5 selection (i.e., ) and . Based on the results in the development set, we conservatively select for the best translation quality, and use to pursue the higher speedup without substantial loss of translation quality.

5.3.2 Model Size

In addition to the base-size models, we also study larger models to test the effectiveness of GAD. We here use Transformer-big Vaswani et al. (2017) as our model architecture for both the NAT drafter and AT verifier in GAD/GAD++666The hyperparameters (e.g., block size , Top-, tolerance ) in GAD/GAD++ (big) are re-tuned on the development set, which may be different from those in the base-size models., and compare it with the conventional Transformer-big baseline as well as Blockwise Decoding Stern et al. (2018) – a state-of-the-art efficient Transformer-big variant by introducing additional heads on top of the Transformer decoder to generate next tokens as a block and verifies, which works in a similar way to ours. According to Table 4, it is clear that our GAD/GAD++ substantially outperforms the baseline and Blockwise Decoding, but with a speedup over the Transformer-big beam search. Compared with Blockwise Decoding that only uses lightweight heads to generate the next few tokens in parallel, our independent NAT drafter in GAD is much more powerful to generate more drafted tokens that can be accepted777Our vanilla GAD (base) accepts an average number of 6.13 tokens at each decoding iteration on the development set, GAD++ increases the number to 10.99 tokens with comparable quality, while Blockwise Decoding (base) is reported to accept only 1.90 tokens on average at each decoding iteration with no quality loss., turning out to result in a significantly higher speedup, despite the introduction of more parameters that only accounts for negligible additional memory cost (see Table 8 in Appendix A).

Models Top-3 () Top-5 ()
GAD++ 1 7.56/27.02 7.58/27.02
2 8.64/26.92 8.77/26.92
3 9.46/26.84 9.72/26.84
4 10.04/26.78 10.50/26.74
5 10.38/26.70 10.99/26.64
Table 3: Performances on the development set with different hyperparameters in GAD++. Each cell lists the mean accepted tokens and BLEU score. The results are obtained with GAD++ (). The BLEU score of greedy decoding of the AT verifier is 26.62.
Models Iteration BLEU Speed
Teacher Transformer-big () N 29.31 1.0
Blockwise
Stern et al. (2018)
Blockwise decoding () / 28.95 1.7
Blockwise decoding () / 27.40 3.0
Ours NAT drafter () 1.4 27.35
AT verifier () N 29.25
AT verifier () N 29.18
GAD () 4.8 29.18 3.0
GAD++ (, top-3, ) 4.2 29.32 3.5
GAD++ (, top-5, ) 2.6 29.15 5.0
Table 4: Results of GAD of the big-size model configuration on WMT14 EN-DE and the comparison to the state-of-the-art Blockwise Decoding Stern et al. (2018). denotes the speedup results reported in original papers obtained by comparison with greedy decoding.

Moreover, we observe that big-size models can use a larger block size () than the base-size models () since larger capacity equips the model with a more powerful ability to learn to decode more tokens well in parallel. To better demonstrate this point, we conduct a comparative study of the effects of the NAT drafter’s size given the same block size () in the GAD-base setting. According to Table 5, the big-size NAT drafter largely outperforms the base-size counterpart: it can generate drafted tokens more reliably (i.e., on average more drafted tokens accepted by the AT verifier), resulting in fewer decoding iterations, which indicates that GAD can be further improved if a more powerful NAT drafter is equipped.

Models Tok. Iter. BLEU
AT-Base (greedy) 1 N 28.73
GAD-Base 5.53 5.0 28.73
w/ NAT-Big 5.90 4.7 28.73
GAD++-Base (top-3, ) 6.69 4.1 28.81
w/ NAT-Big 7.32 3.8 29.12
GAD++-Base (top-5, ) 8.71 3.2 28.58
w/ NAT-Big 9.73 2.8 28.98
Table 5: A comparative study on WMT14 EN-DE test set by replacing NAT-base in GAD-base with the stronger stronger NAT-big. The results are obtained with GAD ().

5.3.3 Teacher Model(s)

We study the effects of the teacher on GAD, by comparing the results of a single teacher with the teacher ensemble of 3 Transformer-big models in Table 6. Compared with a single teacher model, teacher ensemble improves all the NAT drafter, AT verifier and end-to-end GAD/GAD++ results, indicating that our approach can be effectively benefited by a better teacher.

Models Single Teacher Teacher Ensemble
Iter. BLEU Speed Iter. BLEU Speed
Teacher Transformer-big () N 29.31 / N 30.64 /
Base NAT drafter 1.6 26.48 14.3 1.6 27.16 14.0
AT verifier () N 28.89 1.0 N 29.19 1.0
AT verifier () N 28.73 1.1 N 29.24 1.1
GAD () 4.9 28.73 3.0 4.5 29.24 3.3
GAD++ (, top-3, ) 4.0 28.89 3.6 4.0 29.31 3.4
GAD++ (, top-5, ) 3.1 28.73 4.5 3.1 29.52 4.3
Big NAT drafter 1.4 27.35 15.0 1.4 28.10 14.8
AT verifier () N 29.25 1.0 N 29.74 1.0
AT verifier () N 29.18 1.1 N 29.69 1.1
GAD () 4.8 29.18 3.0 4.2 29.69 3.3
GAD++ (, top-3, ) 4.2 29.32 3.5 3.8 29.88 3.6
GAD++ (, top-5, ) 2.6 29.15 5.0 2.5 29.81 5.1
Table 6: Performance comparison between a single teacher model and teacher ensemble (3 teacher models). We report the results of both base-size and big-size GAD. Transformer-base and Transformer-big with beam search are the AT baselines of GAD-base and GAD-big, respectively.
Modules Latency(ms) Percent(%)
AT Encoder 5.65 7.70
NAT Encoder 5.73 7.80
NAT Decoder 31.44 42.81
AT Decoder 27.33 37.21
Others 3.31 4.48
Total 73.46 100
Table 7: Profiling of GAD++ (base-size,, top-5, ) on the WMT14 EN-DE test set.

5.4 Discussion

Contrary to the stereotype that more models (parameters) tend to slow down inference, GAD’s introduction of an additional NAT model can significantly speed up AT without quality loss, by increasing computational parallelism to better utilize (hardware) computing resources. We believe GAD is promising and can benefit more from the processor hardware that will become increasingly powerful and better at parallel computing.

As a preliminary study, GAD is far from perfect and has much room to improve. First, according to the experimental results above, we know GAD’s translation quality mainly depends on the AT verifier and its efficiency relies on the NAT drafter (whose capability matters how many drafted tokens can be accepted). We believe more powerful NAT/AT models (than the simple and naive ones used in this paper) will benefit GAD to achieve better results.

Moreover, GAD’s potential can be further exploited by optimizing its implementation in computing and memory access. For example, according to Table 7 showing time cost by modules in GAD++, our naive implementation costs a total of approximately 16% of overall time to (sequentially) encode the input for AT and NAT. Obviously, this part can be optimized by performing AT and NAT encoding in parallel because they are independent, or sharing (or partially sharing) AT’s encoder with NAT, which we leave as future exploration. Also, the NAT decoder costs more than the AT decoder because it employs bi-directional attention and cannot save the computation for the already decoded tokens as AT, which we believe can be improved in the future with a better non-autoregressive decoding mechanism designed for GAD.

6 Related Work

Non-autoregressive Decoding

To address the decoding inefficiency of autoregressive translation (AT), Gu et al. (2018) first proposed Non-Autoregressive Translation (NAT), which decodes the output sentence in one single iteration despite translation quality loss. Recent work mainly focused on improving the quality while maintaining competitive speedups, including applying various training objectives Ghazvininejad et al. (2020a); Saharia et al. (2020); Du et al. (2021); Huang et al. (2021), modeling dependencies between target tokens Ghazvininejad et al. (2019); Qian et al. (2021); Gu and Kong (2021) and refining the translation outputs with multi-pass iterations Ghazvininejad et al. (2020b); Kasai et al. (2020); Hao et al. (2021); Geng et al. (2021); Savinov et al. (2021). However, due to the inherent conditional independence assumption, NAT models’ translation quality is generally less reliable than AT.

Semi-autoregressive Decoding

There are also some attempts trying to combine autoregressive and non-autoregressive decoding: Wang et al. (2018) proposed to utilize non-autoregressive decoding locally while keeping the autoregressive property globally; on the contrary, Ran et al. (2020)

introduced a local-autoregressive model which retained the non-autoregressive property globally. Similar ideas have been also proposed for GEC:

Chen et al. (2020) proposed to use a sequence tagger to identify the grammatical errors’ spans and then use autoregressive decoding to edit them to make the sentence grammatically correct; Sun et al. (2021b) proposed Aggressive Decoding to aggressively decode as many tokens as possible by assuming that the output sentence is the same as the input, and then verify through greedy decoding of the (autoregressive) Transformer model, which inspires our work. The most similar work to ours is Blockwise Decoding Stern et al. (2018) that proposed to additionally insert non-autoregressive heads on top of the Transformer decoder to generate positions in parallel and use the original autoregressive head to verify these outputs. However, its underinvestment in the non-autoregressive modeling seriously limits its performance, resulting in a much lower efficiency than our approach.

Cascade Inference

In general, cascade inference includes model cascade Huang et al. (2018); Streeter (2018); Wang et al. (2020) that sequentially applies a series of light-weighted models to handle various instances based on their difficulties, and dynamic early exiting Xin et al. (2020); Liu et al. (2020); Zhou et al. (2020); Sun et al. (2021a)

that introduces internal classifiers within deep neural networks to allow early exiting if the prediction is confident enough. Both of them are for efficient inference. Broadly, GAD can be also considered as a special case of cascade inference where the NAT drafter conducts the first round inference and the AT verifier is responsible for the second round to control the quality, which should be the very early exploration on seq2seq generation in this direction, to the best of our knowledge.

7 Conclusion and Future Work

We propose Generalized Aggressive Decoding (GAD) as well as its variant (GAD++) that achieves a state-of-the-art speedup of autoregressive translation (AT) without quality loss by the collaboration of AT and NAT. GAD demonstrates a novel yet promising perspective for efficient seq2seq generation, which is orthogonal to the efforts for advancing the state-of-the-art NAT and AT models that can further benefit GAD.

Despite the promising results, GAD still has great potential with much room to improve, as discussed in Section 5.4, and will be likely to benefit efficient lossless inference of other seq2seq tasks such as abstractive summarization, which we will leave as subsequent work to explore. We look forward to more studies that evolve GAD into a mature and standard decoding algorithm for seq2seq generation in the near future.

References

Appendix

Appendix A Memory Analysis

Table 8

shows the comparisons of peak GPU memory (footprint) utilization between GAD and AT Baselines (during inference). Compared with Transformer-base, GAD-Base only costs about an additional 400MB GPU memory which is negligible for a modern GPU. Among the 400MB additional GPU memory cost, around 250MB is used to store (static) weights of the NAT drafter, and the remaining cost is mainly for storing the final encoder’s states of the NAT drafter, which is dynamic and related to the shape (e.g., length) of the input tensor.

Models Memory Util. Percent(%)
AT () 1237MiB 7.4
AT (greedy) 1211MiB 7.6
GAD 1621MiB 10.0
GAD++ 1613MiB 9.9
Full 16280MiB 100
Table 8: Comparisons of GPU memory utilization between GAD and AT Baselines. The results are obtained with fp32 computation on a single Nvidia P100 GPU. The hyperparameters of GAD++ are , top-5, .

Appendix B Word Repetitions

With the conditional independence assumption, NAT models show a serious weakness in modeling highly multimodal distributions. The token repetition ratio is often utilized as a proxy to measure this multi-modality problem, which represents the degree of the text inconsistency. However, the role of our AT verifier guarantees that this problem does not exist in GAD. As shown in Table 9, the token repetition ratio of GAD/GAD++ is similar to that of our AT baseline, which is significantly lower than most relevant NAT models.

Models BLEU Rep.
AT (greedy) 28.73 0.18%
CMLM () 25.75 1.13%
CMLM () 27.09 0.24%
GAD 28.73 0.18%
GAD++ 28.89 0.17%
Table 9: Token repetition ratio on WMT14 EN-DE. GAD-Base is test with hyper-parameters , top-3, . CMLM is test with our implementation with the length beam set to 3.

Appendix C Hyperparameters

Hyper-parameters of training GAD are listed in table 10.

Hyperparameter Value
devices 8 Nvidia V100 GPU
label smoothing 0.1
# max tokens 20000
update frequency 4
dropout rate [0.1, 0.2, 0.3]
max source positions 1000
max target positions 1000
Adam lr
Adam 0.9
Adam 0.99
lr-scheduler inverse square
warm-up lr
weight decay 0.00001
clip norm 3.0
# warmup updates 4000
max updates 100K

max epoch

1000
Hyperparameter Value
devices 8 Nvidia V100 GPU
label smoothing 0.1
# max tokens 4096
update frequency 4
dropout rate 0.1
max source positions 1000
max target positions 1000
Adam lr
Adam 0.9
Adam 0.999
Adam
lr-scheduler inverse square
warm-up lr
weight decay 0.01
clip norm 5.0
# warmup updates 10000
max updates 300K
Table 10: Hyper-parameters and settings of the AT verifier (left) and the NAT drafter (right).

Appendix D Case Study

In Table 11, we represent several examples to illustrate how GAD/GAD++ generates translations. Take Example 1 for illustration, in the first iteration, the outputs of the drafter are non-autoregressive with multi-modality problems like "Angaben Angaben". The verifier accepts tokens of "Nach den" and replaces the inappropriate translation "Angaben" with "vorliegenden". In the second iteration, the verification of the vanilla GAD finds the bifurcation at the first position thus all tokens after this position are discarded. After 4 iterations, the decoding is finished since the [EOS] token is found.

Compared with the vanilla GAD, in the second iteration, GAD++ finds that the top- candidate "Angaben" meets the loosening requirement so that it accepts this token, showing that the loosening constraints do help to accept more tokens for our proposed GAD model.

Example 1- vanilla GAD
Source According to the details provided , the tunnel had not yet been put into use .
D Nach den Angaben Angaben war der Tunnel noch nicht in
V Nach den vorliegenden war war der Tunnel noch nicht in Betrieb
D Nach den vorliegenden Angaben war der Tunnel noch nicht in Betrieb genommen
V Nach den vorliegenden Einzelheiten war der Tunnel noch nicht in Betrieb genommen worden .
D Nach den vorliegenden Einzelheiten war der Tunnel noch nicht in Betrieb genommen [EOS]
V Nach den vorliegenden Einzelheiten war der Tunnel noch nicht in Betrieb genommen worden
D Nach den vorliegenden Einzelheiten war der Tunnel noch nicht in Betrieb genommen worden . [EOS]
V Nach den vorliegenden Einzelheiten war der Tunnel noch nicht in Betrieb genommen worden . [EOS]
Results Nach den vorliegenden Einzelheiten war der Tunnel noch nicht in Betrieb genommen worden .
Example 2-vanilla GAD
Source Yesterday , Gut@@ acht ’s Mayor gave a clear answer to this question .
D Gestern hat der Bürger@@ meister von Gut@@ acht eine klare
V Gestern hat Gut@@ Bürger@@ meister von Gut@@ acht eine klare Antwort
D Gestern hat Gut@@ acht@@ ts Bürger@@ meister eine klare Antwort auf diese Frage
V Gestern hat Gut@@ ach@@ s Bürger@@ meister eine klare Antwort auf diese Frage gegeben
D Gestern hat Gut@@ ach@@ ts Bürger@@ meister eine klare Antwort auf diese Frage gegeben
V Gestern hat Gut@@ ach@@ ts Bürger@@ meister eine klare Antwort auf diese Frage gegeben .
D Gestern hat Gut@@ ach@@ ts Bürger@@ meister eine klare Antwort auf diese Frage gegeben . [EOS]
V Gestern hat Gut@@ ach@@ ts Bürger@@ meister eine klare Antwort auf diese Frage gegeben . [EOS]
Results Gestern hat Gut@@ ach@@ ts Bürger@@ meister eine klare Antwort auf diese Frage gegeben .
Example 1-GAD++
Source According to the details provided , the tunnel had not yet been put into use .
D Nach den Angaben Angaben war der Tunnel noch nicht in
V Nach den vorliegenden war war der Tunnel noch nicht in Betrieb
D Nach den vorliegenden Angaben war der Tunnel noch nicht in Betrieb genommen [EOS]
V Nach den vorliegenden Angaben war der Tunnel noch nicht in Betrieb genommen worden
D Nach den vorliegenden Angaben war der Tunnel noch nicht in Betrieb genommen worden . [EOS]
V Nach den vorliegenden Angaben war der Tunnel noch nicht in Betrieb genommen worden . [EOS]
Results Nach den vorliegenden Angaben war der Tunnel noch nicht in Betrieb genommen worden .
Example 2-GAD++
Source Yesterday , Gut@@ acht ’s Mayor gave a clear answer to this question .
D Gestern hat der Bürger@@ meister von Gut@@ acht eine klare
V Gestern hat der Bürger@@ meister von Gut@@ acht eine klare Antwort
D Gestern hat der Bürger@@ meister von Gut@@ acht eine klare Antwort auf diese Frage gegeben . [EOS]
V Gestern hat der Bürger@@ meister von Gut@@ acht eine klare Antwort auf diese Frage gegeben . [EOS]
Results Gestern hat der Bürger@@ meister von Gut@@ acht eine klare Antwort auf diese Frage gegeben .
Table 11: Examples from the WMT14 English-German translation task. At each iteration, D and V are the outputs of the drafter and the verifier, respectively. Tokens within red blocks are the bifurcation positions. The verification pieces after the bifurcation are annotated as strikethrough. The highlighted parts are translations of previous iterations. Tokens in blue blocks are top- candidates which meet the GAD++ requirement. The hyperparameters are , top-3, . ‘@@’ is the BPE token, e.g., Gut@@ acht Gutacht. The output pieces after the [EOS] token is omitted in the table.