latent-GLAT: Glancing at Latent Variables for Parallel Text Generation

Recently, parallel text generation has received widespread attention due to its success in generation efficiency. Although many advanced techniques are proposed to improve its generation quality, they still need the help of an autoregressive model for training to overcome the one-to-many multi-modal phenomenon in the dataset, limiting their applications. In this paper, we propose latent-GLAT, which employs the discrete latent variables to capture word categorical information and invoke an advanced curriculum learning technique, alleviating the multi-modality problem. Experiment results show that our method outperforms strong baselines without the help of an autoregressive model, which further broadens the application scenarios of the parallel decoding paradigm.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/21/2021

Non-Autoregressive Translation by Learning Target Categorical Codes

Non-autoregressive Transformer is a promising text generation model. How...
05/24/2019

mu-Forcing: Training Variational Recurrent Autoencoders for Text Generation

It has been previously observed that training Variational Recurrent Auto...
11/25/2019

Non-autoregressive Transformer by Position Learning

Non-autoregressive models are promising on various text generation tasks...
06/01/2020

Cascaded Text Generation with Markov Transformers

The two dominant approaches to neural text generation are fully autoregr...
08/30/2019

Autoregressive Text Generation Beyond Feedback Loops

Autoregressive state transitions, where predictions are conditioned on p...
05/03/2020

Neural Data-to-Text Generation via Jointly Learning the Segmentation and Correspondence

The neural attention model has achieved great success in data-to-text ge...
03/03/2020

VQ-DRAW: A Sequential Discrete VAE

In this paper, I present VQ-DRAW, an algorithm for learning compact disc...

1 Introduction

Non-autoregressive Transformer (NAT, Gu et al., 2018) introduce a parallel decoding paradigm with higher decoding efficiency (> ) than autoregressive models Bahdanau et al. (2015); Gehring et al. (2017); Vaswani et al. (2017). Unlike autoregressive models, NAT models impose conditional independence assumptions in words to support parallel decoding of sentences during inference. It attracts many researchers to explore NAT in machine translation Gu et al. (2018); Lee et al. (2018); Kaiser et al. (2018) and text-to-speech tasks Chen et al. (2019); Peng et al. (2020).

Amount of researchers devoted themselves to improve the NATs’ inferior generation quality. Such as modeling word inter-dependencies by curriculum learning Guo et al. (2020a); Liu et al. (2020) or iterative refinements mechanism Ghazvininejad et al. (2019); Guo et al. (2020b), introducing latent variables to decompose target sentences and serve as the springboard for decoding Shu et al. (2019); Ma et al. (2019); Bao et al. (2021), and introduce inductive bias for models’ training Wei et al. (2019); Li et al. (2019). The most successful method is the glancing transformer (GLAT, Qian et al., 2021a), which trains the NAT model by sampling partial target words as inputs to predict the remaining target words, explicitly building dependencies between the observed and unobserved words. Qian et al. (2021b) employ GLAT to achieve impressive results on the translation task of WMT21111http://statmt.org/wmt21/, even outperforming many strong autoregressive translation systems in BLEU score Papineni et al. (2002).

Although existing NAT models achieve competitive results compared to autoregressive models in translation tasks, it is not negligible that they still need the help of an autoregressive Transformer (AT, Vaswani et al., 2017) as a teacher for training, i.e., sequence-level knowledge distillation (Kim and Rush, 2016). A well-recognized explanation is a multi-modality problem Zhou et al. (2020); Sun and Yang (2020): each input may have multiple valid outputs in datasets, which will prevent NAT models from learning to organize consistent outputs. Training with the outputs of an AT can directly bypass the multi-modal phenomenon in the dataset, effectively improving the models’ performances.

However, training NAT models by knowledge distillation are limited. First, it needs to train an extra AT model, which inevitably enlarges the training cost. Second, it is hard to promise that the teacher (or AT) model can be accurate enough in all text generation settings, which will become the bottleneck for its student NAT model. Therefore, training a model from scratch without the help of an AT model is still an open and interesting problem.

In this paper, we propose latent-GLAT, which can directly learn from the raw dataset. It alleviates the multi-modality problem following a divide-and-conquer spirit, introducing a small set of discrete latent variables to capture the target word categorical information and divide the origin goal into latent variables modeling and sentence reconstruction. First, the categorical information may have fewer multi-modality phenomena than the original words, thus can be learned directly without the help of knowledge distillation. Second, the word categorical information is informativeness to the sentence reconstruction. We can extend glancing training with these discrete latent variables for modeling the sentence, encouraging the model to build dependencies on word categorical information rather than words, which works more robustly.

Experiment results on WMT14, Quora, and DailyDialog datasets show that latent-GLAT achieves remarkable improvements over several strong baselines, verifying the effectiveness of latent-GLAT. More impressively, latent-GLAT even outperforms autoregressive models in Quora and DailyDialog datasets, further validating our motivation for removing knowledge distillation. In-depth analyses indicate that the introduced discrete latent variables are helpful to alleviate the multi-modality problem and are necessary for performance improvement.

2 Background

For a sequence-to-sequence task of predicting sequence given its input sequence , the classical autoregressively factorization decomposes the

with a series of conditional probability:

(1)

where .

Although such factorization achieved great success in previous studies Bahdanau et al. (2015); Gehring et al. (2017); Vaswani et al. (2017), they predict each word222We use BPE segmentation in our experiments, and they are strictly tokens. For clarity, we use words and tokens interchangeably in the paper. based on the prefix words, which may suffer from the issues of error accumulation and slow decoding during inference.

Non-autoregressive Transformer.

To tackle the above problems, Gu et al. (2018) firstly propose non-autoregressive Transformer (NAT), introducing a non-autoregressive factorization as:

(2)

where each word are modeled independently. During inference, the NAT model can decode the word simultaneously by for each , remarkably improving the efficiency (15 speedups to an autoregressive Transformer).

However, the independence assumption may prevent the NAT model from leveraging the inherent word dependencies to organize consistent outputs. Due to this,the efficiency improvements of NAT are at the cost of its quality, e.g., the performance degradation by more than 10.0 BLEU Papineni et al. (2002) points in machine translation tasks Gu et al. (2018). Besides, recent studies Zhou et al. (2020); Sun and Yang (2020) point out that the multi-modality phenomenon in the dataset aggravates the challenge of NAT models.

Glancing Transformer.

To mitigate the issue of missing word dependency in NAT models, Qian et al. (2021a) propose Glancing Transformer (GLAT), introducing glancing training (GLT) and sampling partial target tokens for training NAT:

(3)

where is the partial target tokens, and is its complements set. It progressively decreases the sampling ratio and obtains better performances in machine translation tasks.

Nevertheless, we find that GLAT in experiments still has a multi-modality problem333We include details of GLAT in Appendix A.: First, its sampling rate cannot be decreased to zero during training, which exists the issue of exposure bias. Second, it still heavily relies on a teacher model for further improvements Qian et al. (2021a).

Latent Transformer.

To alleviate the multi-modality problem, Kaiser et al. (2018); Shu et al. (2019); Ma et al. (2019); Bao et al. (2021) propose Latent Transformer (LT), introducing latent variables for NAT predictions as:

(4)

where is always trained by variational inference Ma et al. (2019) or discretization techniques Kaiser et al. (2018). Such latent variables are decomposed from the target sentence, which is informative to determine the mode of the sentence and alleviates the multi-modality problems.

Although Latent Transformer models improve performance in terms of BLEU score, their used autoregressive predictor Kaiser et al. (2018); Bao et al. (2021) or deep iterative transformation Shu et al. (2019); Ma et al. (2019) for predicting latent variables unavoidable sacrifice the overall decoding efficiency. Besides, they do not explicitly build the interdependencies among the outputs.

3 Proposed Method: latent-Glat

In this section, we present latent-GLAT. latent-GLAT follows Latent Transformer models Kaiser et al. (2018); Bao et al. (2021) but introduces glancing training Qian et al. (2021a) with the discrete latent variables. Our intuitions are as follows:

First, compared to the words, the introduced discrete latent variables may have fewer modes than words and be informative to determine the modes of the sentences. In such a case, we can directly learn the discrete latent variables by the Glancing Transformer Qian et al. (2021a), keeping competitive inference efficiency. More importantly, we can employ the latent variables to invoke glancing training for modeling the target sentences, which is informative enough to reduce the multi-modality problem of original sentences. Besides, glancing at latent variables also works robustly due we can obtain the latent variables during inference.

3.1 Introducing Discrete Latent Variables for Modeling Target Categorical Information

In this part, we state the structure of latent-GLAT, which introduces a small set of discrete latent variables for a NAT model, basically following Kaiser et al. (2018); Roy et al. (2018); Bao et al. (2021).

Let be the size of the discrete latent space and let denote the set . For each target sentence , we use a same-length latent variable sequence for modeling it as:

(5)

where and , is the model parameters.

Discretization.

For discretizing target sentences to latent variables, we use vector quantization Roy et al. (2018)

, which works by dividing a large set of origin vector representations into small groups. We assign each token

with a group that has the nearest distance to its representation:

(6)

where is the maintained representations and is its dimension. We use the embedding as , refer to Bao et al. (2021). Finally, the model is trained to minimize

(7)

where and are the prediction loss for words and latent variables , respectively.

The maintained representations are updated with an exponential moving average over a mini-batch of target tokens :

(8)

where is assigned count for group , and we set decay parameter in our experiments.

Figure 1: Model architecture of latent-GLAT. : Position-wise mix and representation of

by a gated neural network.

(a) for Latent Predictor
(b) for Mixture Decoder
Figure 2: Training the latent predictor and mixture decoder by glancing at discrete latent variables.

Architecture.

As shown in Figure 1, latent-GLAT mainly consists of an encoder  (NAT Encoder), a latent predictor  (NAT Predictor), and a decoder  (Mix. Decoder). We parameterize them with the multi-head attention-based encoder or decoder, similar to Transformer Vaswani et al. (2017). Their functions can be formalized as:

where we use an extra module to predict the target length and initialize the decoder inputs with the  Wei et al. (2019) mechanism.

3.2 Glancing at Discrete Latent Variables for Parallel Sequence Decoding

The small number () of discrete latent variables can capture high-level categorical information of the target words, supporting better learning design for parallel sequence decoding.

Our first insight is that we can learn to non-autoregressively predict the discretized latent variables directly without the help of distillation. Specifically, we parameterize the in a non-autoregressive fashion and use a glancing training technique (GLT, Qian et al., 2021a) for optimizing it, as shown in Figure 1(a):

(9)

where is uniformly sampled from , refer to Qian et al. (2021a). We provide more training details of latent-GLAT in Appendix B.

Our next insight is modeling the sentence based on the sampled latent variables rather than , namely, glancing at for optimizing :

(10)

We find Eqn. (10) works robustly in experiments and analyze it in Section ( 4.3).

As shown in Figure 1(b), we eventually employ words to invoke glancing training for minimizing , namely we optimize the by minimizing

(11)

where and are the sampled target tokens and discrete latent variables.

Overall Training Loss.

Our full-fledged loss includes latent variable prediction, sentence reconstruction, and length prediction losses:

(12)

where

are the hyperparameters to adjust the importance of length prediction loss

.

3.3 Inference

In inference phase, latent-GLAT predicts the target length, latent variables, and sentence in turn.

For the target length, latent-GLAT first predicts the target length with the length predictor . To avoid the length prediction errors during inference, latent-GLAT expands the length to a ranges (we use , total six candidates in our experiments).

Then, latent-GLAT predicts the latent variables with and sentence with for each candidate.

Similar to Ma et al. (2019), latent-GLAT also ranks the candidates by itself (self-reranking) and chooses the highest score output with:

(13)

where is the length penalty ratio to avoid the length bias, and denotes the length of .

4 Experiments

We conduct experiments on several generation tasks, including machine translation, paraphrase generation, and dialog generation.

4.1 Experimental Setup

Dataset.

We chose the most popular benchmarks for each task:

  • Machine Translation (MT): We follow previous practices in NAT models and use the WMT14 English (EN) German (DE) corpus (4.5M sentence pairs) and the IWSLT14 German (DE)

    English (EN) corpus (160K sentence pairs) to validate our proposed model. We obtain the datasets following the instruction open-sourced in

    fairseq444https://github.com/pytorch/fairseq. In detail, we first tokenize the datasets with Moses script. Then, we use 37,000 and 10,000 operations to split the words into byte-pair encodings (BPE, Sennrich et al., 2016) in WMT14 and IWSLT14 datasets, respectively. We also share subword embeddings between the source and target language for each dataset.

  • Paraphrase Generation (PG): We use the Quora555https://www.kaggle.com/c/quora-question-pairs/data dataset to evaluate the paraphrase generation task. The Quora dataset contains around 135K labeled paraphrases pairs. Following the standard dataset split, we sample 100K sentence pairs from the labeled paraphrases as training data and hold out 30K pairs for testing, the remaining about 5K pairs for validation. Like the MT tasks, we tokenize the corpus with Moses scripts and split the words into BPE units with total 32K operations.

  • Dialog Generation (DG): We conduct the dialog generation experiments on the DailyDialog dataset Li et al. (2017). We obtain the processed DailyDialog dataset from Bao et al. (2020)666https://github.com/gmftbyGMFTBY/MultiTurnDialogZoo. The training set contains 87,170 sentence pairs (11,118 dialogues). The validation and testing set in the dataset contain 8069 pairs (1000 dialogues) and 7740 pairs (1000 dialogues), respectively.

Note that these tasks emphasize different aspects. The task of MT aims to transfer bilingual sentences with semantically invariant conditions. The PG task differs from machine translation and works on mode transformation in the same language, whose goal is to synthesize a sentence different from the original input but conveys the same meaning. The DG task is most challenging due to the complex generation goal.

Implementations.

We compare latent-GLAT with Transformer (Vaswani et al., 2017), NAT (Gu et al., 2018), and GLAT Qian et al. (2021a) models. We implement them based on the open-source framework fairseq (Ott et al., 2019).

For machine translation tasks, we use the base setting (, , , , and ) of Transformer Vaswani et al. (2017) for WMT14 dataset and a smaller setting (, , , , and ) for IWSLT14 dataset. The number of layers in latent-GLAT decoder and latent predictor are both set to 4 in experiments. We use inverse square root learning rate scheduling for WMT14 and a linear annealing learning rate from to in 250K steps for IWSLT14. The models are optimized with Adam (Kingma and Ba, 2015) optimizer () in 300K steps for WMT14 and 250K steps for IWSLT14. As for the ratio that used in glancing sampling, we linear anneal the ratio from to in whole training steps. The mini-batch in each step consists of 2K tokens for IWSLT14 and 64K tokens for WMT14.

Since the scale of the Quora and DailyDialog datasets are close to the IWSLT14, we keep the same setting to the IWSLT14, such as the Adam, learning rate (linear annealing from to ), and batch size (2K tokens).

Evaluation.

To validate the effectiveness of our proposed method, we evaluate it in terms of quality and efficiency. We use tokenized and cased BLEU scores Papineni et al. (2002)777We evaluate BLEU using fairseq_score script. to evaluate the generation quality of MT and PG tasks. For dialog generation, we also include BLEU-1 and BLEU-2 scores for analysis. Following the common practices Gu et al. (2018); Qian et al. (2021a), we measure the decoding latency of each model by decoding sentence by sentence and compute the speedup compared with the autoregressive Transformer (AT) model to reflect its decoding efficiency. We highlight the best NAT result.

Models WMT14 IWSLT14 Quora DailyDialog Latency Speedups
ENDE DEEN DEEN BLEU-1 BLEU-2 BLEU
Transformer (AT) 27.17 31.53 34.29 27.97 31.40 10.70 5.05 512.3 ms 1.00
NAT 10.78 15.19 17.77 24.65 41.50 1.40 0.01 33.5 ms 15.29
GLAT 16.71 24.78 29.07 27.01 39.50 26.20 26.13 33.5 ms 15.29
latent-GLAT 24.71 29.16 32.31 29.11 41.00 28.30 27.50 45.3 ms 11.31
Table 1: Main results of different models on the test set of each dataset. We measure the decoding latency and speedups on the WMT14 ENDE test set.

4.2 Main Results

We can see from Table 1 that our latent-GLAT almost outperforms all the NAT baselines (NAT and GLAT) in generation quality on all tasks while keeping a competitive decoding speedup to the autoregressive counterpart.

Machine Translation.

As seen, without the help of an AT model for training, the vanilla NAT and advanced GLAT model only obtain inferior generation quality. In contrast, latent-GLAT achieves competitive generation quality in machine translation tasks, indicating that the introduced latent variables effectively reduce the multi-modality issue and support glancing training well. It narrows the performance gap between non-autoregressive decoding and autoregressive decoding from 11.46 (GLAT vs. AT) to 2.34 (latent-GLAT vs. AT) BLEU points on WMT14 ENDE task while keeping a high-speed decoding efficiency.

Paraphrasing.

Unlike the translation task, the performance gap between non-autoregressive and autoregressive decoding on the paraphrase generation task is minor (NAT vs. AT, BLEU points, GLAT vs. AT, BLEU points ). Nevertheless, introducing discrete latent variables still is helpful to obtain a better performance. latent-GLAT realizes a non-autoregressive model with better performance than the autoregressive model on Quora (latent-GLAT vs. AT, points).

Dialog Generation.

We can see a different trend on the DailyDialog dataset — an AT model performs poorly than NAT models. Both GLAT and latent-GLAT outperform the AT model in BLEU-1, BLEU-2, and BLEU scores, indicating that these models recall more reference tokens and organize the tokens well.

We conjecture that the weak and indirect association between the inputs and outputs of the dialogue results in this unusual phenomenon. Specifically, the weak connection may encourage the AT model to predict the tokens by paying more attention to their history outputs, which degenerate to a target-side language model. In contrast, the NAT models do not have this fast track, pushing them to pay more attention to the inputs and recall more target tokens. We further find that there are so-called safe response Li et al. (2016) in AT’s outputs, which verify our conjecture.

Models WMT14 IWSLT14 Speedups
ENDE DEEN DEEN
CMLM 10.88 - - -
CMLM 22.06 - - 9.79 
CMLM 24.65 - - 3.77 
LevT 24.43 - - 2.93 
LV-NAR 11.80 - - 22.30 
SynST 20.74 25.50 23.82 4.86 
Flowseq 20.85 25.40 - 1.10 
CNAT 21.30 25.73 29.81 10.37 
AT 27.17 31.53 34.29 1.00 
NAT 10.78 15.19 17.77 15.29 
GLAT 16.71 24.78 29.07 15.29 
latent-GLAT 24.71 29.16 32.31 11.31 
Table 2: BLEU scores and speedups of different models trained with raw datasets on machine translation tasks. We quote some results from Ma et al. (2019), Guo et al. (2020b), Qian et al. (2021a), and the original paper. CMLM and LevT: using iterations during inference. : no corresponding results.

More Comparisons.

we further compare the advanced NAT models that builds upon latent variables or iterative refinement in machine translation tasks:

  • NATs w/ latent variables: LV-NAR Shu et al. (2019), SynST Akoury et al. (2019), Flowseq Ma et al. (2019), and CNAT Bao et al. (2021).

  • Iterative NATs: CMLM Ghazvininejad et al. (2019) and LevT Gu et al. (2019).

Table 2 shows that introducing latent variables (LV-NAR, Flowseq, and CNAT) or decoding with multiple iterations (CMLM and LevT) both improve non-autoregressive decoding in translation quality. However, iterative refinements or deep transformations always sacrifice decoding efficiency. In contrast, the proposed latent-GLAT outperforms all NAT models with a relatively low cost, keeping a competitive speedup over autoregressive Transformer (AT). Specifically, latent-GLAT with one-pass decoding narrows the performance gap to the AT from 5.87 BLEU points to 2.34 BLEU points on the WMT14 ENDE test set.

Figure 3: BLEU scores and their relative decoding speedups of different models on WMT14 ENDE test set. Note that we evaluate the speedups with a single GTX 1080-Ti GPU and include the results with the same evaluating hardware for fair comparisons.

Decoding efficiency.

We can see there is a trade-off between the translation quality and decoding efficiency in Table 2. We thus present the scatter plot of different models in Figure 3, showing the trend of translation quality and decoding efficiency.

As seen, latent-GLAT is located on the top-right of the baselines. It outperforms the baselines in the BLEU score if decoding speedup is fixed and in decoding speedup if the BLEU score is fixed.

4.3 Analysis

We now turn to verify our intuition that latent-GLAT can alleviate the multi-modality problem.

Methods WMT14 IWSLT14 Avg
ENDE DEEN DEEN
NAT 10.78 15.19 17.77 +6.58
 w/ KD 17.69 22.02 23.78
GLAT 16.71 24.78 29.07 +5.19
 w/ KD 25.21 29.84 31.07
Flowseq 20.85 25.40 24.75 +2.87
 w/ KD 23.72 28.39 27.55
CNAT 21.30 25.73 29.81 +3.08
 w/ KD 25.56 29.36 31.15
latent-GLAT 24.71 29.16 32.31 +0.95
 w/ KD 26.64 29.93 32.47
Table 3: BLEU scores of NAT models trained with (or without) knowledge distillation (KD) on translation tasks.

latent-GLAT largely alleviates the sentence-level multi-modal problem.

Previous researches Gu et al. (2018); Ma et al. (2019); Qian et al. (2021a); Bao et al. (2021) always utilize a Transformer model as a teacher for training NAT models, namely sequence-level knowledge distillation Kim and Rush (2016), which can directly reduces the sentence-level multi-modal phenomenon in datasets. Therefore, we use the average gains from the knowledge distillation to reflect the ability of the NAT models to overcome this issue.

As seen in Table 3, the pure NAT models heavily rely on knowledge distillation. By introducing the target information with the latent variables (Flowseq and CNAT) or sampled tokens (GLAT), the NAT models improve its’ ability to overcome the multi-modality issue. Our proposed latent-GLAT well combines the above two techniques. It obtains only 0.95 BLEU points average gains and validates our motivation.

Datasets Configuration ()
WMT14 Inputs Raw outputs 2.19 3.03
Inputs AT outputs 1.38 2.13
Inputs 1.01 1.35
Quora Inputs Raw outputs 0.86 1.48
DailyDialog Inputs Raw outputs 1.19 4.23
Table 4: Token-level or sentence-level complexity of different text generation datasets. The higher or , the more complex.

Discrete latent variables have fewer modes than raw sentences.

To validate our intuition that the introduced latent variables are easier to predict than tokens, we refer to Zhou et al. (2020) to compute the complexity metrics on each dataset according to alignment relations. Specifically, we use the fast_align888https://github.com/clab/fast_align toolkit to align source input

and target outputs

or discretized latent variable sequences . Then, we compute the token-level complexity and the sentence-level complexity according to Zhou et al. (2020). These metrics can trivially understand as the number of valid candidates for each input.

As shown in Table 4, the latent variables have the lowest complexity in both token-level complexity and sentence-level complexity. In other words, predicting the latent variable sequences is effortless than predicting others, which is consistent with our intuition. Although we obtain a lower complexity dataset by filtering the datasets with an autoregressive model (AT outputs versus Raw outputs), they may introduce model error and need extra training for AT model. In contrast, the discrete latent variables are simple and informative enough to serve as a springboard for modeling target sentences.

L# Introduce Glancing Training BLEU ()
with with
1 12.60
2 13.43 (+0.83)
3 17.11 (+4.51)
4 18.88 (+6.20)
5 22.35 (+9.75)
6 23.64 (+11.04)
Table 5: BLEU scores of different latent-GLAT configurations on the WMT14 ENDE valid set.

Glancing with latent variables improves the performance with a large margin.

We can see in Table 5 that introducing latent variables both obtain performance gains to their counterpart (L#2 vs. L#1, points, and L#4 vs. L#3, points). As expected, the gains are largely improved while adopting the glancing training with discrete latent variables (L#5 vs. L#1, points), which already outperforms glancing training with the reference token (L#5 vs. L#4, points). Finally, we jointly perform glancing training with the reference tokens and discrete latent variables, achieving the best result (L#6 vs. L#1, points).

8 16 32 64 128 256
BLEU (%) 20.80 22.16 22.61 23.64 23.26 21.94
ACC (%) 61.20 53.10 43.57 39.24 36.39 33.84
Table 6: Performances of latent-GLAT with different on the WMT14 ENDE valid set. We compute the accuracy (ACC) of latent prediction by taking the discretized latent variables as reference.
Figure 4: BLEU scores of latent-GLAT using different length penalty ratios on the WMT14 ENDE valid set. We search the length penalty ratio for latent-GLAT while fixing the .

Effects of and .

As shown in Figure 4 and Table 6, we search the hyper-parameter of latent-GLAT that the number of discrete latent variables and the length penalty ratio according to the validation performance. We notice that using more latent codes causes performance degradation during inference, in which the latent variables may degenerate to tokens and contains more prediction error during inference. The latent-GLAT implemented with 64 latent variables and obtains the best result on WMT14 ENDE valid set.

5 Related Work

Gu et al. (2018)

first propose a non-autoregressive Transformer (NAT) model for neural machine translation (NMT) and begin to explore parallel decoding. It abandons explicitly modeling word inter-dependencies to decode the tokens in parallel, significantly improving the inference speed. However, its translation quality is inferior to the Transformer 

Vaswani et al. (2017).

To alleviate this performance degradation, many researchers work to enhance word dependency modeling, including imitation learning 

Wei et al. (2019); Li et al. (2019), curriculum learning Guo et al. (2020a); Liu et al. (2020), iterative refinements Lee et al. (2018); Ghazvininejad et al. (2019); Gu et al. (2019); Guo et al. (2020b); Huang et al. (2022), and a simplified autoregressive process Sun et al. (2019). The most representative method is the glancing transformer model Qian et al. (2021a), which adaptively and progressively samples partial tokens as inputs and predicts the remaining tokens, effectively establishing the dependencies between the sampled tokens and the remaining tokens. However, these models still rely on a teacher for training, which cannot directly learn the raw dataset that contains one-to-many multi-modality phenomenon.

Introducing latent variables Bao et al. (2019, 2021) to organize the target sentence is also a helpful route. Among them, our method is close to Kaiser et al. (2018); Shu et al. (2019); Ma et al. (2019); Akoury et al. (2019); Bao et al. (2021). These methods decompose the latent variables (hints) from the target sentence and divide the origin goal into two parts: modeling latent variables and modeling the target sentences based on latent variables. It implicitly overcomes the multi-modality phenomenon of target sentences because the latent variables can largely determine the mode of the sentence. However, these methods always model the latent variables with an autoregressive predictor, which naturally sacrifices the decoding efficiency.

Unlike them, our approach models the discrete latent variables in a non-autoregressive fashion and extends glancing training with the discrete latent variables. As a result, latent-GLAT accomplishes a competitive performance both in decoding efficiency and quality.

6 Conclusion

We propose latent-GLAT, which can be directly trained without the help of knowledge distillation. Specifically, we employ discrete latent variables to capture the word categorical information and divide the original goal into the latent variables modeling and word prediction tasks. Then, we learn each task with the glancing training and encourage the model to build dependencies on the latent variables, which have fewer modes than the words and are also informative for modeling the target sentences. Experiments results on machine translation, paraphrase generation, and dialogue generation tasks validate the effectiveness of our latent-GLAT.

Acknowledgements

We would like to thank the anonymous reviewers for their insightful comments. Shujian Huang is the corresponding author. This work is supported by National Science Foundation of China (No. U1836221, 6217020152).

References

  • N. Akoury, K. Krishna, and M. Iyyer (2019) Syntactically supervised transformers for faster neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1269–1281. External Links: Document, Link Cited by: 1st item, §5.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §2.
  • S. Bao, H. He, F. Wang, H. Wu, and H. Wang (2020) PLATO: pre-trained dialogue generation model with discrete latent variable. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 85–96. External Links: Document, Link Cited by: 3rd item.
  • Y. Bao, S. Huang, T. Xiao, D. Wang, X. Dai, and J. Chen (2021) Non-autoregressive translation by learning target categorical codes. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 5749–5759. External Links: Link, Document Cited by: §1, §2, §2, §3.1, §3.1, §3, 1st item, §4.3, §5.
  • Y. Bao, H. Zhou, J. Feng, M. Wang, S. Huang, J. Chen, and L. Li (2019) Non-autoregressive transformer by position learning. arXiv preprint arXiv:1911.10677. External Links: Link Cited by: §5.
  • N. Chen, S. Watanabe, J. Villalba, and N. Dehak (2019) Listen and fill in the missing letters: non-autoregressive transformer for speech recognition. arXiv preprint arXiv:1911.04908. External Links: Link Cited by: §1.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In

    Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017

    , D. Precup and Y. W. Teh (Eds.),
    Proceedings of Machine Learning Research, Vol. 70, pp. 1243–1252. External Links: Link Cited by: §1, §2.
  • M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer (2019) Mask-predict: parallel decoding of conditional masked language models. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    ,
    Hong Kong, China, pp. 6112–6121. External Links: Document, Link Cited by: §1, 2nd item, §5.
  • J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher (2018) Non-autoregressive neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §2, §2, §4.1, §4.1, §4.3, §5.
  • J. Gu, C. Wang, and J. Zhao (2019) Levenshtein transformer. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 11179–11189. External Links: Link Cited by: 2nd item, §5.
  • J. Guo, X. Tan, L. Xu, T. Qin, E. Chen, and T. Liu (2020a) Fine-tuning by curriculum learning for non-autoregressive neural machine translation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 7839–7846. Cited by: §1, §5.
  • J. Guo, L. Xu, and E. Chen (2020b) Jointly masked sequence-to-sequence model for non-autoregressive neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 376–385. External Links: Document, Link Cited by: §1, Table 2, §5.
  • R. W. Hamming (1950) Error detecting and error correcting codes. The Bell system technical journal 29 (2), pp. 147–160. External Links: Link Cited by: 2nd item.
  • C. Huang, H. Zhou, O. R. Zaïane, L. Mou, and L. Li (2022) Non-autoregressive translation with layer-wise prediction and deep supervision. In AAAI, External Links: Link Cited by: §5.
  • L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit, and N. Shazeer (2018) Fast decoding in sequence models using discrete latent variables. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 2395–2404. External Links: Link Cited by: §1, §2, §2, §3.1, §3, §5.
  • Y. Kim and A. M. Rush (2016) Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1317–1327. External Links: Document, Link Cited by: §1, §4.3.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.1.
  • J. Lee, E. Mansimov, and K. Cho (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1173–1182. External Links: Document, Link Cited by: §1, §5.
  • J. Li, M. Galley, C. Brockett, J. Gao, and W. B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: §4.2.
  • Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu (2017) DailyDialog: a manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 986–995. External Links: Link Cited by: 3rd item.
  • Z. Li, D. He, F. Tian, T. Qin, L. Wang, and T. Liu (2019) Hint-based training for non-autoregressive translation. In NeuralIPS (to appear), External Links: Link Cited by: Appendix B, §1, §5.
  • J. Liu, Y. Ren, C. Z. Xu Tan, T. Qin, Z. Zhao, and T. Liu (2020) Task-level curriculum learning for non-autoregressive neural machine translation. AAAI. Cited by: §1, §5.
  • X. Ma, C. Zhou, X. Li, G. Neubig, and E. Hovy (2019) FlowSeq: non-autoregressive conditional sequence generation with generative flow. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4282–4292. External Links: Document, Link Cited by: §1, §2, §2, §3.3, 1st item, §4.3, Table 2, §5.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 48–53. External Links: Document, Link Cited by: §4.1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Document, Link Cited by: §1, §2, §4.1.
  • K. Peng, W. Ping, Z. Song, and K. Zhao (2020) Non-autoregressive neural text-to-speech. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 7586–7598. External Links: Link Cited by: §1.
  • L. Qian, H. Zhou, Y. Bao, M. Wang, L. Qiu, W. Zhang, Y. Yu, and L. Li (2021a) Glancing transformer for non-autoregressive neural machine translation. In ACL, External Links: Link Cited by: Appendix B, §1, §2, §2, §3.2, §3, §3, §4.1, §4.1, §4.3, Table 2, §5.
  • L. Qian, Y. Zhou, Z. Zheng, Y. Zhu, Z. Lin, J. Feng, S. Cheng, L. Li, M. Wang, and H. Zhou (2021b) The volctrans glat system: non-autoregressive translation meets wmt21. arXiv preprint arXiv:2109.11247. Cited by: §1.
  • A. Roy, A. Vaswani, N. Parmar, and A. Neelakantan (2018)

    Towards a better understanding of vector quantized autoencoders

    .
    arXiv. External Links: Link Cited by: §3.1, §3.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Document, Link Cited by: 1st item.
  • R. Shu, J. Lee, H. Nakayama, and K. Cho (2019) Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior. arXiv preprint arXiv:1908.07181. External Links: Link Cited by: §1, §2, §2, 1st item, §5.
  • Z. Sun, Z. Li, H. Wang, D. He, Z. Lin, and Z. Deng (2019) Fast structured decoding for sequence models. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 3011–3020. External Links: Link Cited by: §5.
  • Z. Sun and Y. Yang (2020) An em approach to non-autoregressive conditional sequence generation. In International Conference on Machine Learning, pp. 9249–9258. Cited by: §1, §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §1, §1, §2, §3.1, §4.1, §4.1, §5.
  • B. Wei, M. Wang, H. Zhou, J. Lin, and X. Sun (2019) Imitation learning for non-autoregressive neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1304–1312. External Links: Document, Link Cited by: Appendix B, §1, §3.1, §5.
  • C. Zhou, J. Gu, and G. Neubig (2020) Understanding knowledge distillation in non-autoregressive machine translation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1, §2, §4.3.

Appendix A Details of GLAT

According to the performance shown in Figure 4(a), we can see a GLAT model will degenerate to a NAT model while using a small sampling ratio. In such a case, introducing an autoregressive Transformer as a teacher for training the GLAT model alleviates this issue (Figure 4(b)), indicating that the GLAT model still needs the help of knowledge distillation for alleviating multi-modality problems.

(a) GLAT w/ raw data.
(b) GLAT w/ distillation.
Figure 5: BLEU score and training steps of GLAT trained with different glancing strategy (start end ratio).

Appendix B Model Details of latent-Glat

Decoder Inputs.

Following the most common practices in NAT models Wei et al. (2019); Li et al. (2019), we use Softcopy mechanism for initializing the decoder inputs :

(14)

where is the encoded representation of , and are the length of source and target sentences, respectively.

Training the Latent Predictor by glancing sampling discrete latent variables.

With the decoder input and the discretized latent variable sequence , we adopt the glancing sampling technique for training the latent predictor in the following steps:

  • Predicting : latent-GLAT predicts the latent variable sequence with its latent predictor: .

  • Determining sample number : Given and , we compute the sampling number as:

    (15)

    where is the sampling ratio decreasing in the training steps, and we use distance Hamming (1950) for measuring the prediction quality.

  • Sampling observed latent variables : Given discretized latent variable sequence and sample number , we obtain by random selecting elements from .

  • Re-constructing inputs : We construct by position-wise replacing the decoder input with .

  • Updating Latent Predictor: With the as inputs, we train the latent predictor to predict the unobserved references .

Training the Mix. Decoder with sampled discrete latent variables.

Training of Mix. Decoder is largely follow the  Qian et al. (2021a), except using extra latent variables as inputs. With the input , the reference sentence , and the sampled latent variables , we train Mix. Decoder in the following steps:

  • Predicting : latent-GLAT predicts the target sentences: .

  • Determining sample number : Given and , we compute the sampling number .

  • Sampling target tokens : We obtain the glancing reference by random selecting tokens from reference sequence .

  • Re-constructing inputs : is constructed by position-wise replacing the decoder input with embedding of .

  • Updating Mix. Decoder: We then train the Mix. Decoder to predict the unobserved references , with the and as inputs.