Neural sequence-to-sequence (seq2seq
) learning has been extensively used in various applications of natural language processing, since such network design well matches many downstream tasks (e.g., machine translation), namely mapping sequences from the source side to the target side(Sutskever et al., 2014; Gehring et al., 2017; Vaswani et al., 2017). Given a source domain and a target domain , seq2seq problems naturally derive a symmetric pair of directional tasks, i.e., a source-to-target task and a target-to-source task, and two directional learning signals. Given parallel data of and , for instance, to learn the source-to-target mapping , a standard seq2seq neural network usually employs the encoder-decoder framework, which includes an encoder to acquire the representation from the source side, and a decoder to yield the target side outputs from the encoded source representation. The target-to-source mapping could be modeled and learned in the same way.
We argue that the encoder-decoder framework cannot fully exploit the potential of bidirectional learning signals given by the seq2seq problems. Let us take machine translation as an example. (a) Encoder-decoder based seq2seq models typically only learn one directional signals to perform the corresponding unidirectional translation (Figure 1(a)). (b) Although multi-task learning could help the seq2seq models leverage both signals and perform bidirectional translation (Johnson et al., 2017a), by sharing one unidirectional networks (Figure 1(b)), they may suffer from the challenges of the parameter interference due to the limited network capacity (Arivazhagan et al., 2019; Zhang et al., 2021).
From the view of telecommunication111In telecommunications and computer networking, the simplex communication means the communication channel is unidirectional while the duplex communication is bidirectional., translation between languages resembles the duplex communication, whereas seq2seq models within the encoder-decoder framework are considered simplex. Thus we speculate that this discrepancy results in the aforementioned limitations of the encoder-decoder framework, making it not necessarily the best paradigm to model seq2seq problems.
Therefore intuitively, duplex seq2seq neural networks, which would leverage the duplex nature of seq2seq problems, could become better modeling alternatives. Conceptually, a duplex seq2seq neural network has two ends, each of which specialize ins a language and can both take inputs and yield outputs in that language (Figure 1(c)). Given a duplex neural network with source language and target language , it is expected to have an inverse , and satisfies the following reversibility:
The resulting duplex seq2seq model can take a sentence in the source language from the source end, and output a sentence in the target language to the target end (forward translation ). The same model is able to generate the reverse translation by taking a sentence in the target language from the target end to the source end (reverse translation ). In such a way, the bidirectional signals could be learned jointly by a duplex model, and the bidirectional translation can be achieved as a reversible and unified process. Thus both directions do not need to compete for the limited network capacity, but could learn together and boost each other.
However, building a duplex seq2seq neural network is yet under-studied and non-trivial. The intuition of designing such a duplex network lies in making the network reversible, as well as the computational process homogeneous for both forward and reverse directions. This is very challenging to achieve within the existing encoder-decoder paradigm for the following reasons: (a) a encoder-decoder network is irreversible. The decoder’s output end cannot take in input signals to exhibit the encoding functionality, and vice versa; (b) the encoder and decoder are heterogeneous. The decoder consists of extra cross attention modules while the encoder does not; plus, the typical decoder works autoregressively, while the encoder is non-autoregressive.
In this paper, we take the first step of building a duplex seq2seq neural network. We propose REDER 222The model name is a palindrome, which implies the model works from both ends., the Reversible Duplex Transformer, and apply it to reversible machine translation. To address the above problems, (a) for reversibility, we design REDER as a fully reversible Transformer inspired by Gomez et al. (2017); (b) for homogeneity, REDER has no division of encoder and decoder, and reads and yields sentence in a fully non-autoregressive fashion. Also, thanks to the reversibility inside the network, REDER could make use of cycle consistency to explicitly enhance intermediate layer representations.
Despite the challenges from non-autoregressive modeling and non-encoder-decoder design, experiments show that enabling reversible machine translation, by jointly learning the two translation signals on the same parallel corpus, offers REDER significant accuracy gains (about 1.5 BLEU). REDER gives the top results among state-of-the-art non-autoregressive baselines, and outperform multi-task autoregressive methods regarding bidirectional translation, which verifies our motivation. Meanwhile, REDER closely approaches and is faster than typical autoregressive models. To our best knowledge, REDER is the first duplex seq2seq network and enables the first success of reversible machine translation, which is a completely brand-new paradigm to the machine translation community.
2 Related Work
Sequence-to-Sequence Models Exploiting Bidirectional Signals. Sequence-to-sequence problems naturally induce of a symmetric pair tasks of the opposite directions, a source-to-target mapping and a target-to-source mapping. Several studies try to capture such bidirectionality as constraint to improve sequence-to-sequence tasks such as machine translation (Cheng et al., 2016a, b). Additionally, dual learning (He et al., 2016; Xia et al., 2017)
leverage reinforcement learning to achieve the interaction between two separate directional translation models. Meanwhile,Xia et al. (2018) propose a partially model-level dual learning that shares some components of similar underlying functionality of both models for forward task and reverse task. Zheng et al. (2020)
propose to model the two directional translation model with language models in a variational probabilistic framework. These approaches model two directional tasks by setting up two separate simplex models to consider the task bidirectionality. Different from them, REDER can unify direction pair within one duplex model and directly model the bidirectionality at a completely model level. In addition, another kind of work can also unify two directional tasks in a multilingual fashion(Johnson et al., 2017a; Chan et al., 2019) by sharing the same computational process of one simplex model, which would inevitably need to split the limited model capacity to learn to encode and decode two languages. In contrast, REDER can simultaneously formulate two directional tasks in one model by simply exchanging input and output ends, each of which specialize ins a language, thus bidirectional translation becomes a reversible process in which both directions do not need to compete for limited model capacity.
Non-autoregressive Sequence Generation. Non-autoregressive translation (NAT) models (Gu et al., 2018) significantly attract research interest due to the inefficiency of traditional autoregressive seq2seq models. The major research interest focuses on fully NAT models, which generate sequence in parallel within only one shot but sacrifice performance (Gu et al., 2018; Ma et al., 2019; Shu et al., 2020; Bao et al., 2019; Wei et al., 2019; Li et al., 2019; Wang et al., 2019; Qian et al., 2020; Gu & Kong, 2020)e. Besides, semi-autoregressive models greatly improve the performance of NAT models, which perform iterative refinement of translations based on previous predictions (Lee et al., 2018; Ghazvininejad et al., 2019; Gu et al., 2019; Kasai et al., 2020; Ghazvininejad et al., 2020; Shu et al., 2020). In this work, REDER takes the advantages of the probabilistic modeling of fully NAT models for resolving the designing challenge of computational homogeneity for both translation directions.
Reversible Neural Architectures.
Various reversible neural networks have been proposed for different purposes and based on different architectures. On one hand, reversible neural networks help model flexible probability distributions with tractable likelihoods(Dinh et al., 2014, 2017; Papamakarios et al., 2017; Kingma et al., 2016), which define a mapping between a simple, known density and a complicated desired density. Besides, reversibility can also assist to develop memory-efficient algorithms. The most popular approach is the reversible residual network (revnet, Gomez et al., 2017), which modifies the residual network for image classification and allows the activations at any given layer to be recovered from the activations at the following layer. Therefore layers can be reversed one-by-one as back-propagation proceeds from the output of the network to its input. Some follow-up work extends the idea of revnet to RNNs (MacKay et al., 2018) and Transformer (Kitaev et al., 2020) for natural language processing. In this paper, we borrow the idea of revnet as the basic unit of our proposed REDER, however, for different purposes. The aim of revnet and its variants lies in reducing memory consumption, while our purpose is to build a duplex seq2seq model, which can govern two directional tasks reversibly. In this line, van der Ouderaa & Worrall (2019)
3 Sequence-to-Sequence Models as Communication Channels
The standard sequence-to-sequence models. Sequence-to-sequence (seq2seq) tasks (Sutskever et al., 2014) such as machine translation (Bahdanau et al., 2015) typically adopt neural encoder-decoder models which aim to approximate the mapping function from source domain to target domain (). The encoder-decoder neural networks can be analogous to a simplex communication system in source-to-target direction (Figure 1 (a)): the encoder reads the source sequence from the source side and the decoder generates the target sequence from the target side. Reverse travel from the target to the source side is not allowed in such simplex models. In this paper, we focus on the most widely-used simplex model, the Transformer model (Vaswani et al., 2017). A Transformer-based model reads a source sequence and transforms to encoded representations by its encoder. The encoder is composed of stacked Transformer layers, each of which contains a self attention (San) and feed-forward networks (Ffn):
where is the sequence of word embeddings of , and for ease of understanding, we package layer normalization (Ba et al., 2016) inside residual blocks and omit its formulation details. Given the final encoded representation , the representations of each decoder layer are computed as:
where Can denotes the cross attention network that can fetch time-dependent context from the encoder, is the sequence of target word embeddings. Finally, it generates the corresponding target sequence autoregressively by (see Vaswani et al. (2017) for more details). Given the same parallel data, we can also learn a reverse mapping of the target-to-source direction () using another simplex model.
To leverage bidirectional learning signals, another possible choice is to employ multi-task learning to a simplex model to jointly model both directions, where both directions share the same model parameters () and from the same input end to the same output end. However, such models may suffer from the challenges of the parameter interference due to the limited network capacity (Arivazhagan et al., 2019; Zhang et al., 2021), where encoder and decoder are required to simultaneously understand and generate different languages, and the two different directional tasks compete for the limited shared network capacity.
Duplex sequence-to-sequence models. color=blue!40color=blue!40todo: color=blue!40(jjx: I still do not get the motivation of duplex based on current description for the following reasons. 1. I have no idea what is structure duality. 2. Why shared encoder-decoder can not achieve this goal? 3. Which advantages do duplex models have?) As we stated above, the standard sequence-to-sequence generation model resembles a simplex communication channel. Think of a scenario where one person from New York is making a phone call to another one in Berlin. The phone is simplex – it only takes voice input at New York, while only outputs voice at Berlin. This certainly reduced the benefit of two-way communication. Obviously, the everyday phone has the capability of taking voice input and producing voice output at both ends and transferring the signals in both communication directions. In analog, it will benefit sequence generation tasks such as machine translation by enabling a duplex sequence-to-sequence model.
Informally, a sequence-to-sequence model is duplex if it has two ends, both with sequence input and output capability, and share a same architecture to map from one sequence space to the other and vice versa.
A sequence-to-sequence model with parameter is duplex if it satisfies the following: its network has two ends: a source end and a target end; both source and target ends can take input and output sequences; the network defines a forward mapping function and a reverse mapping function , where , are the vocabularies of the source and target domains , , and , are all possible sequences; essentially, it simultaneously induces both a forward sequence generation function and its mathematical inverse by reversely executing the network, i.e. and . In addition, these two functions should satisfy the following,
Notice that the forward function has the same model parameter as the reverse function . This sequence generation model following this definition behaves similarly to a two-way communication channel.
We can apply this definition on machine translation to get a reversible translation model. Such a reversible model will be able to take a sentence in the source language from the source end and to output a sentence in the target language to the target end. With the same model it will be able to generate the reverse translation by taking a sentence in the target language from the target end to the source end.
A reversible machine translation model using duplex sequence-to-sequence is distinct from a multilingual MT model (e.g. multilingual Transformer). The multilingual Transformer takes input sentences in two (or more) languages only from the same end and outputs to the other end. Its output end cannot be used to receive input signals.
4 REDER for Reversible Machine Translation
As in the well-known saying by Richard Feynman, ‘‘What I cannot create, I do not understand”, reversible natural language processing (Franck, 1992; Strzalkowski, 1993) and its applications in machine translation (van Noord, 1990) were proposed for the purpose of building machine models that understand and generate natural languages as a reversible, unified process. Such process resembles the mechanism of the ability that allows us human beings to communicate with each other via natural languages (Franck, 1992)
. However, the attempts were not so successful mainly due to the less powerful computational models then. With the great success of deep neural networks in machine translation, the idea of reversible machine translation is more likely to realized via neural machine translation, and brings further benefits to the translation performance.
In this section, we introduce how to design a duplex neural seq2seq model, namely Reversible Duplex Transformer (REDER), that satisfies Definition 1, to realize reversible machine translation.
4.1 Challenges of Reversible Machine Translation
Designing neural architectures for reversible machine translation yet remains under-studied and has the following challenges:
Reversibility. Typical encoder-decoder networks and their neural components, such as Transformer layers, are irreversible, i.e. one cannot just obtain its inverse function by flipping the same encoder-decoder network. To meet our expectation, an inverse function of the network should be derived from the network itself.
Homogeneity. Intuitively, a pair of forward and reverse translation directions should resemble a homogeneous process of understanding and generation. However, typical encoder-decoder networks certainly do not meet such computational homogeneity due to extra cross attention layers in the decoder; and also because of the discrepancy that the decoder works autoregressively but the encoder does non-autoregressively. To meet our expectation, division of encoder and decoder should be no more exist in the desired network.
4.2 The Architecture of REDER
To solve the above challenges, we include two corresponding solutions in REDER to address the reversibility and homogeneity issues respectively, i.e., the reversible duplex Transformer layers, and the symmetric network architecture without encoder-decoder framework.
Figure 2 shows the overall architecture of REDER. As illustrated, REDER has two ends: the source end (left) and the target end (right). is the model parameter, shared by both directions. The architecture of REDER is composed of a series of identical Reversible Duplex Transformer layers, each of them contains a self attention and a feed-forward module. More concretely, when performing the source-to-target mapping , a source sentence (blue circles) 1) first transforms to its embedding and enters the source end; 2) then goes through the entire stacked layers and evolves to final representations
which are then normalized to probabilities; 3) finally its target translation (orange circles) will be generated from the target ends. The generation process is fully non-autoregressive.
Likewise, the target-to-source mapping is achieved by reversely executing the architecture of REDER from target end to source end. We will dive into the details of the key components of REDER in the following parts.
Reversibility: Reversible Duplex Transformer layers. We adopt the idea of reversible residual network (revnet, Gomez et al., 2017; Kitaev et al., 2020) in the design of the reversible duplex Transformer layer. Each layer is composed of a multi-head self-attention block and a feedforward block with a special reversible design to ensure duplex behavior. Formally, the regular form of the -th layer performs as follow:
where is the concatenation of the embedding of . Accordingly, the reverse form of can be computed by subtracting (rather than adding) the residuals:
Homogeneity: Symmetric network architecture without encoder-decoder framework. To meet our need to ensure homogeneous network computations for forward and reverse directional tasks, we therefore choose to discard the encoder-decoder paradigm.
Symmetric network. To achieve homogeneous computations, one solution is to make our network symmetric. Specifically, we let the -th to -th layers be the reverse form, whereas the latter -th to -th be the regular form:
where means a layer is connected to the next layer. Thereby the forward and reverse computational operations of the whole model become homogeneous: the forward computational operation series reads as a palindrome string and so does the reverse series, where and denote San and Ffn.
Fully non-autoregressive modeling. Note that without encoder-decoder division, the resulting model works in a fully non-autoregressive fashion for both forward and reverse directions. Thus, the conditional probability of a sequence pair becomes due to the introduced conditional independence assumption among target tokens. Once a forward computation is done, the concatenation of the output of the model serves as the representations of target translation. And then, a softmax operation is performed to measure the similarity between the model output and the concatenated embedding of ground-truth reference , to obtain the prediction probability:
We can likewise derive the procedure of for the target-to-source direction.
Modeling various-length input and output. Encoder-decoder models can easily model various-length input and output of most seq2seq problems. However, discarding encoder-decoder separation imposes a new challenge: the width of all the layers of the network is depending on the length of the input, thus it is very difficult to allow various-length input and output, especially when the input is shorter than the output. We resort to the Connectionist Temporal Classification (CTC) (Graves et al., 2006) to solve this problem, a latent alignment approach with superior performance and the flexibility of variable length prediction. Given the conditional independence assumption, CTC is capable of efficiently finding all valid alignments which derives from the target by allowing consecutive repetitions and inserting blank tokens, and marginalizes log-likelihood:
is the collapse function that recovers the target sequence by collapsing consecutive repeated tokens, and then removing all blank tokens. Note that CTC requires that the length of source input should not be smaller than the target output, which is not the case in machine translation. To deal with this, we follow previous useful practice by upsampling the source tokens by 2 times(Saharia et al., 2020; Libovický & Helcl, 2018), and filter those examples when the target lengths are still larger than the one of upsampled source sentences.
Reversibility in REDER can be assured in the continuous representation level, where REDER can recover from output representations (last layer) to input embeddings (first layer), which is also the motivation and basis of the auxiliary learning signal, , in the next section. Reversibility might not hold in the discrete token level, because the irreversible argmax operation discretizes probabilities to tokens. But REDER still shows decent reconstruction capability in practice, as visually depicted in Figure 3.
Given a parallel corpus and a single model , REDER can be jointly supervised by source-to-target and target-to-source translation for and , respectively. Thus both translation directions can be achieved in one REDER model. We refer this to bidirectional training, which is opposite to unidirectional training, where each translation direction needs a separate model. Moreover, the reversibility of REDER enables appealing potentials to exploit consistency/agreement between forward and reverse directions. We introduce two auxiliary learning signals as follows.
Cycle Consistency Symmetry of a pair of sequence-to-sequence tasks enables the use of cycle consistency (He et al., 2016; Cheng et al., 2016a). Given a source sentence , the forward prediction of REDER is obtained, and then we use the reverse model on this prediction to reconstruct it to the source language:
Finally, we maximize the consistency or agreement between the original one and reconstructed one
. Thus, the loss function reads
We expect it can provide an auxiliary signal that a valid prediction should be loyal to reconstruct its source input. Here we use cross-entropy between the probabilistic prediction of the reverse model as distance to measure the consistency.
Layer-wise Forward-Backward Agreement
Since REDER is fully reversible, which consists of a series of computationally inverse of intermediate layers, an interesting question arises: given the desired output (i.e., the target sentence), is it possible to derive the desired intermediate hidden representation by the backward target-to-source computation, and encourage the forward source-to-target intermediate hidden representations as close as possible to these “optimal” representations?
Given a source sentence , the inner representations of each layer in forward direction are:
and given its corresponding target sequence as the optimal desired output333for CTC-based model where the model prediction are the alignments, we instead extract the token sequence of the best alignment , predicted by the model, associated with the ground-truth as the optimal desired output., the inner representations of each layer in reverse direction are:
where and represent the representations of -th layer in forward and reverse models, respectively. As we consider these reverse inner layer representations as “optimal”, we try to minimize the cosine distance between the forward and backward corresponding inner layer representations:
where is stop-gradient operation.
A potential danger of both of the above auxiliary objective is model collapse, where it would probably cheat this task by simply learning an identity mapping. We solve this problem by setting a two-stage training scheme for them, where we first train REDER without using any auxiliary losses until a predefined updates, and then activate the additional losses and continue training the model until convergence.
Final Objective Given a parallel dataset of i.i.d observations, the final objective of REDER is to minimize
where and are coefficients of auxiliary losses.
|Simplex AT||Transformer base (KD teacher)||62M 2||1.0||27.60||31.50||33.85||33.70|
|Reformer (Kitaev et al., 2020)||62M 2||1.0||27.60||-||-||-|
|Simplex models leveraging bidirectional signals||Model-level DL big (Xia et al., 2018)||-||28.90||31.90||-||-|
|KERMIT (Chan et al., 2019)||-||25.60||27.40||-||-|
|KERMIT + mono (Chan et al., 2019)||-||28.10||28.60||-||-|
|MGNMT (Zheng et al., 2020)||-||27.70||31.40||32.70||33.90|
|Simplex NAT||vanilla enc-dec NAT (Gu et al., 2018)||15.6||17.69||21.47||27.29||29.06|
|CTC (Libovický & Helcl, 2018)||-||16.56||18.64||19.54||24.67|
|CTC (Saharia et al., 2020)||18.6||25.70||28.10||32.20||31.60|
CTC-based Imputer(Saharia et al., 2020)
|GLAT+NPD (Qian et al., 2020)||15.3||26.55||31.02||32.87||33.51|
|GLAT+CTC (Gu & Kong, 2020)||16.8||27.20||31.39||33.71||34.16|
|Our work|| vanilla enc-dec NAT||62M 2||16.3||19.50||24.95||29.49||29.86|
| + CTC||62M 2||15.6||26.11||30.24||33.25||33.68|
| REDER (unidirectional training)||58M||15.5||25.55||29.54||-||-|
| REDER (bidirectional training)||58M||15.5||26.70||30.68||33.10||33.23|
| + beam20 + AT reranking||58M||5.5||27.36||31.10||33.60||34.03|
We conduct extensive experiments on standard machine translation benchmarks to inspect REDER’s performance on sequence-to-sequence tasks. We demonstrate that REDER achieves competitive results, if not better, compared to strong autoregressive and non-autoregressive baseline. REDER is also the first approach that enables reversible machine translation in one unified model, where bidirectional training with paired translation directions surprisingly helps boost each them with a substantial margin.
5.1 Experimental Setup
Datasets. We evaluate our proposal on two standard translation benchmarks, i.e., WMT14 English (En) German (De) (4.5M training pairs), and WMT16 English (En) Romanian (Ro) (610K training pairs). We apply the same prepossessing steps as mentioned in prior work (EnDe: Zhou et al., 2020, EnRo: Lee et al., 2018). BLEU (Papineni et al., 2002) is used to evaluate the translation performance for all models.
Knowledge Distillation (KD). Following previous NAT studies (Gu et al., 2018; Zhou et al., 2020), REDERs are trained on distilled data generated from pre-trained auto-regressive Transformer models. The beam size is set to during generation.
Beam Search Decoding. We implement two kinds of inference policies. The first one is the most straightforward policy that adopts tokens with the highest probability at each position. For a fair comparison with other NAT studies, we also implement beam search to REDER with an efficient library of C++ implementation444https://github.com/parlance/ctcdecode. We adopt the first policy in the default setting.
Implementation Details. We design REDER based on the hyper-parameters of Transformer-base (Vaswani et al., 2017). The number of head is 8, the dimension of embedding size is 512, and the dimension of Ffn is 2048. REDER consists of 12 stacked layers. For both AT and NAT models, we set the dropout rate for WMT14 EnDe and WMT16 EnRo. We adopt weight decay with a decay rate and label smoothing with . By default, we upsample the source input by for CTC. We set and to 0.1 for all experiments. All models are trained for K updates using Nvidia V100 GPUs with a batch size of approximately K tokens. Following prior studies (Vaswani et al., 2017), we compute tokenized case-sensitive BLEU. We measure the validation BLEU scores every 2,000 updates, and average the best checkpoints to obtain the final model. As in previous NAT studies, we measure the GPU latency by running the model with a single sentence per batch on a single Nvidia V100 GPU. All models are implemented on fairseq (Ott et al., 2019).
5.2 Main Results
We compare REDER with previous AT and NAT models, as well as simplex models leveraging bidirectional learning signals. As shown in Table 1, the proposed REDER achieves competitive results compared with these strong baselines.
The proposed duplex network has a comparable capability as strong NAT models. Unlike conventional encoder-decoder models, since REDER has no division of encoder and decoder, we need to inspect the generalization ability of duplex architecture. It is surprising to see that the gap between REDER and traditional encoder-decoder NAT models (with CTC loss) is negligible within a half-point BLEU on WMT14 En-De translation ( vs ), verifying that REDER is reliable for further testing bidirectional tasks.
REDER enables reversible machine translation and better accuracy. We then show that a unified REDER trained on the same parallel data can simultaneously work in two directions. With auxiliary losses (i.e., the cycle consistency and layer-wise forward-backward agreement) enabled by the duplex reversibility, REDER surprisingly achieves more than 1 BLEU score improvements compared to its simplex version ( vs ). Finally, with the help of beam search and re-ranking, the performance of REDER is extremely close to that of the normal simplex AT models. These results verify our motivation that a duplex model can exploit the bidirectionality of sequence-to-sequence tasks in one unified model in such a way the two directions could boost each other. To the best of our knowledge, REDER is the first duplex approach that enables reversible machine translation in a unified model rather than separate simplex models.
Comparison with existing approaches. We first compare REDER with existing simplex approaches exploiting bidirectional signals (Xia et al., 2018; Zheng et al., 2020; Chan et al., 2019). These approaches need to deploy two separate simplex models for both directions (Xia et al., 2018; Zheng et al., 2020). REDER, in contrast, only needs one unified duplex model and coherently models two directions. REDER could achieve close performance compared to them despite the fact that non-autoregressive modeling is far more challenging in learning. Alternatively, Chan et al. (2019) use a same simplex network to achieve bidirectional translation via multi-task learning, needing to split limited capacity for both directions, which underperform REDER on parallel settings. We will present in-depth discussion with such multi-task approaches later ( § 5.4).
As for non-autoregressive approaches (NAT), CTC (Saharia et al., 2020) and GLAT (Qian et al., 2020) are rather helpful. Among them, Gu & Kong (2020) explore the best technique combination for NAT, in which GLAT+CTC achieves by far one of the best NAT accuracy. In contrast, this paper focuses on a totally different idea of developing a duplex seq2seq model that can perform reversible MT, in which non-autoregressive modeling is one of all the technical solutions chose for our ultimate goal. Nevertheless, REDER approaches closely to the state-of-the-art Gu & Kong (2020), whose tricks can also supplement to enhance REDER. We leave this for exploration.
Example. We also show an example regarding forward prediction and reconstruction of REDER in Figure 3.
5.3 Effects of Decoding and Re-ranking
|Transformer (AT, teacher)||27.20||31.00||0.980||1.0|
|GLAT (Qian et al., 2020)||25.21||29.84||-||-|
|+ NPD=7 + AT reranking||26.55||31.02||-||-|
|REDER (w/ )||26.70||30.68||0.935||19.9|
|+ beam=20 + AT reranking||27.36||31.10||1.000||5.5|
|+ beam=100 + AT reranking||27.52||31.45||1.000||1.2|
The performance of REDERs can be further boosted with additional (beam-search or re-ranking) techniques. For CTC beam search, we use the teacher model (AT base) to re-rank the translation candidates obtained by the beam search to determine the one with the best quality. As shown in Table 2, a larger beam size results in a smaller BP for AT models, meaning it produces shorter translations (Stahlberg & Byrne, 2019; Eikema & Aziz, 2020). For pure NAT models using the advanced glancing strategy (GLAT), noisy parallel decoding and re-ranking can provide significant improvements (1.3 for both En-De and De-En). In REDER, CTC beam search helps produce longer outputs (larger BPs) but only endows a little improvement. With beam search and AT reranking, REDER can generate more decent translations. These results imply that we need to find a better way to train REDER (and probably the NAT family) if we do not want to involve an extra AT for such a somewhat inconvenient re-ranking.
5.4 Does Duplex Network Really Matter for Bidirectional Translation?
|NAT: GLAT+CTC (b=1)||✗||27.49||31.10|
|NAT: GLAT+CTC (b=20)||✗||62M||26.79||30.45|
|NAT-multi: GLAT+CTC (b=20)||✓||62M||25.50||29.49|
By definition, learning bidirectional translation with a duplex network results in reversible machine translation. Meanwhile, multi-task learning can also help to learn two or more translation directions in one simplex model, resulting in bi- or multi-lingual NMT models. Here we refer them to multi-task simplex models in the investigated bilingual scenarios. Such multi-task simplex models share the encoder/decoder for all involved languages, which are shown to be helpful for low-resource languages but hurtful for high resource languages in multilingual machine translation scenarios (Johnson et al., 2017b; Arivazhagan et al., 2019; Zhang et al., 2020). Thus, one may ask: given such models, does duplex model really matter in performing a bidirectional task? To answer this we conduct a comparison with multi-task approaches with shared encoders and decoders for both translation directions. As shown in Table 3, (1) Multilingual-style models suffer from sharing capacity thus their performance shrink ( vs ,  vs  and  vs ), which verifies our motivation of the concern of limited model capacity of multi-task models. (2) When using beam search, reversible machine translation makes REDER outperform multi-task AT regarding bidirectional translation ( vs ). REDER can even achieve very close result to unidirectional AT ( vs /), while REDER can translate both direction in one model. In addition, all of the NAT models are faster than AT models. This evidence shows the advantage and practical value of the proposed REDER. It also indicates that reversible machine translation s more decent solution for performing bidirectional translation.
5.5 Ablation Study of Components
REDER is developed on the top of various components in terms of data (knowledge distillation), learning objective (CTC), architecture (revnet, relative attention), and auxiliary losses endowed by reversibility of REDER. We analyze their effects through various combinations in Table 4. We first consider training REDER for a single direction to seek the best practice to run REDER for sequence-to-sequence tasks. KD and CTC are essential to training REDER, as suggested by previous NAT studies (Saharia et al., 2020; Gu & Kong, 2020). Meanwhile, we notice the benefit of relative self-attention. We therefore use these three techniques by default for all of the proposed models. As for the duplex variants of models that learn both directions simultaneously, they can further improve the translation accuracy by substantial margins. These results verify our motivation that the paired translation directions could be better learned in a unified reversible model. Reversibility enables us to utilize layer-wise forward-backward agreement and cycle consistency, which are also shown to boost improvement considerably.
6 Conclusion and Future Work
In this paper, we propose REDER, the reversible duplex Transformer for reversible sequence-to-sequence problem and apply it to machine translation that for the first time shows the feasibility of a reversible machine translation system. REDER is a fully reversible model that can transform one sequence to the other one forth and back, by reading and generating through its two ends. We verify our our motivation and the effectiveness of REDER on several widely-used NMT benchmarks, where REDER shows appealing performance over strong baselines.
As for promising future directions, REDER can be applied to monolingual, multilingual and zero-shot settings, thanks to the fact that each “end” of REDER specializes a language. For instance, given trained REDERs and , we combine last half layers (the De end) of and the Ja end of to obtain a zero-shot , translating between German and Japanese. Likewise, composition of an English end and its reverse results in
, which can learn from monolingual data like an autoencoder. This compositional fashion resembles the LEGO, which manipulates only a linear number of language ends. Therefore, while adding a new language to a multilingual REDER system (in a form of composition of ends of involved languages), we would probably not need to re-train the whole system like we do for a current multilingual NMT system, which reduces the difficulty and cost to train, deploy and maintain a large scale multilingual translation system.
- Arivazhagan et al. (2019) Arivazhagan, N., Bapna, A., Firat, O., Lepikhin, D., Johnson, M., Krikun, M., Chen, M. X., Cao, Y., Foster, G., Cherry, C., et al. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019.
- Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.0473.
- Bao et al. (2019) Bao, Y., Zhou, H., Feng, J., Wang, M., Huang, S., Chen, J., and Li, L. Non-autoregressive transformer by position learning. arXiv preprint arXiv:1911.10677, 2019.
- Chan et al. (2019) Chan, W., Kitaev, N., Guu, K., Stern, M., and Uszkoreit, J. Kermit: Generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604, 2019.
Cheng et al. (2016a)
Cheng, Y., Shen, S., He, Z., He, W., Wu, H., Sun, M., and Liu, Y.
Agreement-based joint training for bidirectional attention-based
neural machine translation.
In Kambhampati, S. (ed.),
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pp. 2761–2767. IJCAI/AAAI Press, 2016a. URL http://www.ijcai.org/Abstract/16/392.
- Cheng et al. (2016b) Cheng, Y., Xu, W., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. Semi-supervised learning for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1965–1974, 2016b.
- Dinh et al. (2014) Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
Dinh et al. (2017)
Dinh, L., Sohl-Dickstein, J., and Bengio, S.
Density estimation using real NVP.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=HkpbnH9lx.
- Eikema & Aziz (2020) Eikema, B. and Aziz, W. Is MAP decoding all you need? the inadequacy of the mode in neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 4506–4520, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.398. URL https://www.aclweb.org/anthology/2020.coling-main.398.
- Franck (1992) Franck, G. Reversible grammars and natural language processing. In Proceedings of the 1992 ACM/SIGAPP Symposium on Applied Computing: Technological Challenges of the 1990’s, pp. 102–109, New York, NY, USA, 1992. Association for Computing Machinery. ISBN 089791502X. doi: 10.1145/143559.143597. URL https://doi.org/10.1145/143559.143597.
- Gehring et al. (2017) Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Convolutional sequence to sequence learning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 1243–1252. PMLR, 2017. URL http://proceedings.mlr.press/v70/gehring17a.html.
- Ghazvininejad et al. (2019) Ghazvininejad, M., Levy, O., Liu, Y., and Zettlemoyer, L. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6112–6121, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1633. URL https://www.aclweb.org/anthology/D19-1633.
- Ghazvininejad et al. (2020) Ghazvininejad, M., Levy, O., and Zettlemoyer, L. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785, 2020.
Gomez et al. (2017)
Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B.
The reversible residual network: Backpropagation without storing activations.In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 2214–2224, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/f9be311e65d81a9ad8150a60844bb94c-Abstract.html.
Graves et al. (2006)
Graves, A., Fernández, S., Gomez, F. J., and Schmidhuber, J.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.In Cohen, W. W. and Moore, A. W. (eds.), Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, volume 148 of ACM International Conference Proceeding Series, pp. 369–376. ACM, 2006. doi: 10.1145/1143844.1143891. URL https://doi.org/10.1145/1143844.1143891.
- Gu & Kong (2020) Gu, J. and Kong, X. Fully non-autoregressive neural machine translation: Tricks of the trade. arXiv preprint arXiv:2012.15833, 2020.
- Gu et al. (2018) Gu, J., Bradbury, J., Xiong, C., Li, V. O. K., and Socher, R. Non-autoregressive neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=B1l8BtlCb.
- Gu et al. (2019) Gu, J., Wang, C., and Zhao, J. Levenshtein transformer. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 11179–11189, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/675f9820626f5bc0afb47b57890b466e-Abstract.html.
- He et al. (2016) He, D., Xia, Y., Qin, T., Wang, L., Yu, N., Liu, T., and Ma, W. Dual learning for machine translation. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 820–828, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/5b69b9cb83065d403869739ae7f0995e-Abstract.html.
- Johnson et al. (2017a) Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., Hughes, M., and Dean, J. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017a. doi: 10.1162/tacl˙a˙00065. URL https://www.aclweb.org/anthology/Q17-1024.
- Johnson et al. (2017b) Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., Hughes, M., and Dean, J. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017b. doi: 10.1162/tacl˙a˙00065. URL https://www.aclweb.org/anthology/Q17-1024.
- Kasai et al. (2020) Kasai, J., Cross, J., Ghazvininejad, M., and Gu, J. Non-autoregressive machine translation with disentangled context transformer. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 5144–5155. PMLR, 2020. URL http://proceedings.mlr.press/v119/kasai20a.html.
- Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems, 29:4743–4751, 2016.
- Kitaev et al. (2020) Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB.
- Lee et al. (2018) Lee, J., Mansimov, E., and Cho, K. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1173–1182, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1149. URL https://www.aclweb.org/anthology/D18-1149.
- Li et al. (2019) Li, Z., Lin, Z., He, D., Tian, F., Qin, T., Wang, L., and Liu, T.-Y. Hint-based training for non-autoregressive machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5708–5713, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1573. URL https://www.aclweb.org/anthology/D19-1573.
- Libovický & Helcl (2018) Libovický, J. and Helcl, J. End-to-end non-autoregressive neural machine translation with connectionist temporal classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3016–3021, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1336. URL https://www.aclweb.org/anthology/D18-1336.
- Ma et al. (2019) Ma, X., Zhou, C., Li, X., Neubig, G., and Hovy, E. FlowSeq: Non-autoregressive conditional sequence generation with generative flow. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4282–4292, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1437. URL https://www.aclweb.org/anthology/D19-1437.
- MacKay et al. (2018) MacKay, M., Vicol, P., Ba, J., and Grosse, R. B. Reversible recurrent neural networks. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 9043–9054, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/4ff6fa96179cdc2838e8d8ce64cd10a7-Abstract.html.
- Ott et al. (2019) Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-4009. URL https://www.aclweb.org/anthology/N19-4009.
- Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. Masked autoregressive flow for density estimation. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 2338–2347, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/6c1da886822c67822bcf3679d04369fa-Abstract.html.
- Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://www.aclweb.org/anthology/P02-1040.
- Qian et al. (2020) Qian, L., Zhou, H., Bao, Y., Wang, M., Qiu, L., Zhang, W., Yu, Y., and Li, L. Glancing transformer for non-autoregressive neural machine translation. arXiv preprint arXiv:2008.07905, 2020.
- Saharia et al. (2020) Saharia, C., Chan, W., Saxena, S., and Norouzi, M. Non-autoregressive machine translation with latent alignments. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1098–1108, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.83. URL https://www.aclweb.org/anthology/2020.emnlp-main.83.
- Shaw et al. (2018) Shaw, P., Uszkoreit, J., and Vaswani, A. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2074. URL https://www.aclweb.org/anthology/N18-2074.
- Shu et al. (2020) Shu, R., Lee, J., Nakayama, H., and Cho, K. Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior. In EMNLP, 2020.
- Stahlberg & Byrne (2019) Stahlberg, F. and Byrne, B. On NMT search errors and model errors: Cat got your tongue? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3356–3362, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1331. URL https://www.aclweb.org/anthology/D19-1331.
- Strzalkowski (1993) Strzalkowski, T. Reversible Grammar in Natural Language Processing, volume 255. Springer Science & Business Media, 1993.
- Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104–3112, 2014. URL https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html.
van der Ouderaa & Worrall (2019)
van der Ouderaa, T. F. and Worrall, D. E.
Reversible gans for memory-efficient image-to-image translation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4720–4728, 2019.
- van Noord (1990) van Noord, G. Reversible unification based machine translation. In COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics, 1990. URL https://www.aclweb.org/anthology/C90-2052.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Wang et al. (2019) Wang, Y., Tian, F., He, D., Qin, T., Zhai, C., and Liu, T.-Y. Non-autoregressive machine translation with auxiliary regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5377–5384, 2019.
- Wei et al. (2019) Wei, B., Wang, M., Zhou, H., Lin, J., and Sun, X. Imitation learning for non-autoregressive neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1304–1312, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1125. URL https://www.aclweb.org/anthology/P19-1125.
Xia et al. (2017)
Xia, Y., Qin, T., Chen, W., Bian, J., Yu, N., and Liu, T.-Y.
Dual supervised learning.In International Conference on Machine Learning, pp. 3789–3798. PMLR, 2017.
- Xia et al. (2018) Xia, Y., Tan, X., Tian, F., Qin, T., Yu, N., and Liu, T.-Y. Model-level dual learning. In International Conference on Machine Learning, pp. 5383–5392. PMLR, 2018.
- Zhang et al. (2020) Zhang, B., Williams, P., Titov, I., and Sennrich, R. Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1628–1639, 2020.
- Zhang et al. (2021) Zhang, B., Bapna, A., Sennrich, R., and Firat, O. Share or not? learning to schedule language-specific capacity for multilingual translation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Wj4ODo0uyCF.
- Zheng et al. (2020) Zheng, Z., Zhou, H., Huang, S., Li, L., Dai, X.-Y., and Chen, J. Mirror-generative neural machine translation. In International Conference on Learning Representations, 2020.
- Zhou et al. (2020) Zhou, C., Gu, J., and Neubig, G. Understanding knowledge distillation in non-autoregressive machine translation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=BygFVAEKDH.
Appendix A Additional Empirical Results
a.1 Impact of Knowledge Distillation
|EnDe KD (only De distillated)||25.49||26.57||/|
|DeEn KD (only En distillated)||23.04||28.82||/|
|separate KD [final model]||26.70||30.68||/|
Like other NAT approaches, we find that REDER heavily relies on knowledge distillation. We report the performance of models trained on raw data and distilled data generated from AT models in Table 5. As we can see, without KD, the accuracy of REDER significantly drops. Then, we aim to explore the most proper way to integrate KD data. We observe that if we only use KD data of one direction (only target-side data are KD’ed, e.g., German sentences in En-De), it only mostly benefits a single direction. These results imply that we need to provide KD data of both directions to train REDER, and we then try to figure how to do so. We notice that if we mix the KD data of both directions by concatenating them directly, it somehow improves results compared to the policy only using single-direction KD data. Finally, we find the best way is to separately feed KD data in accordance to directions, i.e., feeding En-De KD data when training the En-De direction and providing De-En KD data when training the reverse direction De-En.