Log In Sign Up

Zero-Shot Translation using Diffusion Models

by   Eliya Nachmani, et al.

In this work, we show a novel method for neural machine translation (NMT), using a denoising diffusion probabilistic model (DDPM), adjusted for textual data, following recent advances in the field. We show that it's possible to translate sentences non-autoregressively using a diffusion model conditioned on the source sentence. We also show that our model is able to translate between pairs of languages unseen during training (zero-shot learning).


page 1

page 2

page 3

page 4


Improving Zero-shot Translation with Language-Independent Constraints

An important concern in training multilingual neural machine translation...

Improved Zero-shot Neural Machine Translation via Ignoring Spurious Correlations

Zero-shot translation, translating between language pairs on which a Neu...

Zero-shot-Learning Cross-Modality Data Translation Through Mutual Information Guided Stochastic Diffusion

Cross-modality data translation has attracted great interest in image co...

Towards User-Driven Neural Machine Translation

A good translation should not only translate the original content semant...

Few-Shot Diffusion Models

Denoising diffusion probabilistic models (DDPM) are powerful hierarchica...

Continuous Learning in Neural Machine Translation using Bilingual Dictionaries

While recent advances in deep learning led to significant improvements i...

I Introduction

Deep generative models can be categorized in the following categories: (i) Flow-based models, such as Glow [kingma2018glow] (ii) Autoregessive models, e.g. Transformer, for language modeling [vaswani2017attention], (iii) GAN [goodfellow2014generative] based models, such as, WaveGAN [donahue2018adversarial] for speech and StyleGAN [karras2020analyzing] for vision application. (iv) VAE [kingma2013auto] based models, e.g. VQ-VAE [razavi2019generating] and NVAE [vahdat2020nvae], and (v) Diffusion Probabilistic Models [sohl2015deep] such as ADM [dhariwal2021diffusion].

Diffusion probabilistic models achieve comparable and superior results to other deep generation models such as WaveGrad for speech synthesis [chen2020wavegrad] and ADM for image generation [dhariwal2021diffusion].

The underline architecture of the diffusion probabilistic models is a chain of Markov latent variables. The data flows in two directions: (i) the diffusion process, and (ii) the denoising process. The denoising process is the inference process which generates the data starting from Gaussian noise. The diffusion process is the training process which learns to transform data samples into Gaussian noise.

In the seminal work of Hoogeboom et al. [hoogeboom2021argmax]

, a diffusion model for categorical variable was introduced. The paper shows that the original diffusion process, which is suitable for continuous data such as speech and and ordinal data such as images, can model discrete categorical data. They trained a diffusion network on the language modeling task.

In this work we propose a diffusion model for neural machine translation. Furthermore, we show that the proposed model has some capabilities of zero-shot translation. To our knowledge, we are the first to perform conditional text generation using a diffusion model.

Ii Related Work

In [sohl2015deep] Sohl-Dickstein et al. introduce the diffusion process. The diffusion process takes the variational distribution and adds Gaussian noise at each time step where , is the original data point and is completely noise.

In this section we will recap the multinomial diffusion process as defined by Hoogeboom et al. [hoogeboom2021argmax] for categorical data. We denote

as a 1-hot vector with

categories. is the data point, and is the diffusion model that gradually adds a small amount of noise at each step. At , is almost completely noise. The opposite direction is a learnable distribution that denoises the data. The diffusion model is optimized with the variational bound on negative log likelihood:


Sohl-Dickstein et al. [sohl2015deep] use as condition and show that Eq.1 becomes:


where if the diffusion trajectory is defined well. The variational distribution is defined as follows:



is the probability to sample from the uniform distribution. Using a Markov chain property, one can get the closed form to sample

from :


where and are defined in the same manner as in the original DDPM [ho2020denoising], i.e. and . One can further relax the closed form:




Hoogeboom et al. [hoogeboom2021argmax] predicts a probability vector for from . They parametrize from , where

is approximated with a neural network

. Denote


Then, the variational lower bound Eq.2 becomes: logP(x_0) ≥E_q [ ∑_k x_0,k log^x_0,k - ∑_t=2^T KL ( C(θ_post(x_t, x_0)) | C(θ_post(x_t, ^x_0)) ]

It is worth to mention the work by Austin et al. [austin2021structured] which improves Hoogeboom et al. [hoogeboom2021argmax] by introducing corruption processes with uniform transition probabilities. They use transition matrices that mimic Gaussian kernels in continuous space and show that using different transition matrix leads to improved results in text generation.

Fig. 1: High level description of the proposed model. The encoder receives the source language sentence , which computes outputs that are fed into the decoder using a cross attention mechanism, as customary. The decoder receives a noisy target language sentence, , and outputs a slightly less noisy . is the time positional encoding module, consisting of a sinusoidal positional embedding and a linear layer. Its output is added to each layer of the encoder and the decoder. This is in addition to the axial positional encoding which is built in the transformer.

Iii Method

We use a neural network which predicts a probability vector at each diffusion step, similar to the method used by Hoogeboom et al. [hoogeboom2021argmax]. The architecture used is transformer-based with an additional time-based positional encoding. Unlike Hoogeboom et al., which performs unconditional text generation, we are interested in sentence translation, thus we add the sentence in the source language as a condition and predict probability vectors to create the sentence in the target language. Our architecture is inspired by an encoder-decoder approach, where the source-language sentence is given as input to a transformer encoder and the noisy target sentence is given as input to a transformer decoder, such that the encoder’s outputs are used in a cross attention mechanism in each layer of the decoder. Unlike standard encoder-decoder systems, our method does not predict output tokens one at a time (autoregressively), but rather predicts all tokens’ probabilities at each denoising step.

During training, a time step is randomly sampled, and using the noise schedule , the posteriors are calculated and a noisy target sentence is created, using the closed-form formula for the uniform noise 5. Then, a forward pass is performed to predict using our neural network:

, which in turn is used to calculate the loss function. In this notation

s are source sentences and s are target sentences.

During inference, we start from random uniform noise and iteratively run on it times to get .

Data Processing 

The diffusion model requires inputs to be of fixed length. Thus, we use padding and truncation of all sentences to a fixed length

. Sentences are padded with a special token .

Two special language tokens are added to each input sentence, the first indicates the source language and the second indicates the target language. This is used to accelerate convergence and to allow for zero-shot learning, given pairs of languages unmet during training.

Iv Experiment Setup

Iv-a Datasets

We used three datasets in total, all of which are from WMT. We trained our net on WMT14 [bojar2014findings]

DE-EN and WMT14 FR-EN jointly, and from both directions, meaning German to English, English to German, French to English and English to French. We downsampled the larger French dataset in each epoch in order to have the same number of German-English and French-English samples.

Lastly, we used the WMT19 [barrault2019findings] DE-FR for evaluation only, in order to test the method’s zero-shot learning performance.

Iv-B Evaluation Metrics

We used common machine translation evaluation metrics: Corpus level BLEU, SacreBLEU, TER and chrF. We used the official SacreBLEU implementation

[post-2018-call] with default parameters.

Iv-C Baselines

Current state-of-the-art results for the WMT14 translation tasks perform BLEU and SacreBLEU for the German-English translation and for the French-English translation. All of these methods use some type of a transformer with autoregressive decoding, use very large models, and some use extra data.

Iv-D Hyperparameter Settings

We used the ADAM optimizer [kingma2014adam], and tuned the learning rate, batch size and gamma parameters. Eventually, we used a learning rate of , gamma value of and batch size of . We also tried versus diffusion steps, and didn’t see a major difference. We also experimented with the number of transformer layers. Our best model has transformer layers. Other parameters are attention heads and hidden dimension of .

Tokenization  A Tokenizer is trained on data from all three languages, using the WordPiece method with normalization similar to BERT [devlin2018bert] (NFD Unicode, followed by Lowercase and StripAccents). Whitespace pre-tokenization is also used. The vocabulary size is a hyper parameter , which we tuned.

Vocabulary Size  The vocabulary size is an important hyper parameter since it determines the dimension of the space the diffusion model needs to predict, and it changes how the noise probability is distributed (since the constant probability of change is distributed to a different number of tokens). Hoogeboom et. al. [hoogeboom2021argmax] worked with 27 and 256 categories for the and datasets, respectively. The fact that no tokenization was used hinted us that perhaps large a vocabulary size doesn’t work well with this method. However, austin et. al. [austin2021structured] was able to train a network with 8192 categories, using a slightly different method. Furthermore, we know of the importance of tokenization in complicated tasks such as translation. Therefore, we decided to try different vocabulary sizes and see how sensitive the method is to it.

Tasks BLEU SacreBLEU TER chrF
DEEN 7.17 8.13 93.1 34.7
ENDE 3.54 4.54 102.3 33.5
FREN 8.62 9.93 88.4 37.8
ENFR 7.56 9.02 90.7 37.5
DEFR 4.17 5.06 94.7 31.4
FRDE 2.96 4.04 98.1 31.4
TABLE I: Results for the different translation tasks. The first four rows are of supervised tasks, and the last two rows are for zero-shot tasks, i.e. pairs of languages unmet during training.
1024 5.60 3.18 7.23 6.73
2048 7.92 4.42 9.83 8.99
4096 8.13 4.54 9.93 9.02
8192 6.76 4.00 8.89 7.75
TABLE II: SacreBLEU results with different vocabulary sizes. We see that gives the best results, which is not on the edge of the values chosen. This indicated its a sweet spot for the vocabulary size tradeoff.
Sample Sentence Lang. SacreBLEU
1st Input je sais qu’il voudrait une garantie de quatre ans. FR -
1st Reference i know he would like a four - year guarantee. EN -
1st Prediction i know he need a guarantee for four years. EN 17.47
2st Input

the ecb’s sole mandate has always revolved around

inflation, therefore mario draghi and his team have all

the more reason to take action at their meeting next week.

EN -
2st Reference

le mandat unique de la bce a toujours porte sur l’inflation,

donc mario draghi et son equipe ont davantage de raisons

d’agir lors de la reunion de la semaine prochaine.

FR -
2st Prediction

lle systeme unique unique de la bce est derriere le trend

en phoque, afin ou mario draghi et son mont sont plus

justifies de prendre en contact a la session prochaine.

FR 17.10
3rd Input

zwei kinder haben in uruguay den mord eines

11 - jahrigen eingestanden.

DE -
3rd Reference

two children have confessed to the murder of an

11 - year - old in uruguay.

EN -
3rd Prediction

they had spent a 118’old increased blood in uruguay

about alleged murders abandone treating two children.

EN 6.84
4th Input town council delighted with solid budget EN -
4th Reference gemeinderat freut sich uber soliden haushalt DE -
4th Prediction der stadtrat erfullt einen beliebten haushalt DE 8.12
TABLE III: Randomly selected samples from our model. 2nd sample shows a relatively good translation for a long sentence, and the 3rd sample shows a failed translation for a seemingly easier sentence.

V Results

Results for the different translation tasks are depicted in Table I

. The results are unsatisfactory, implying the method is currently not suitable for the translation task. Results for the zero-shot translation tasks (WMT19) show that some generalization to unmet pairs of languages was possible, but because the overall performance of the system is low, it is hard to estimate if the method transforms well to zero-shot learning.

Results for the vocabulary size tuning is depicted in Table II, suggesting a vocabulary of size is closest to the optimal value in this case.

Qualitatively speaking, results quality vary, and overall we see an expected correlation between the difficulty of inputs and the quality of the translation. Nonetheless, some observations are hard to explain, such as relatively good translations for seemingly hard sentences and relatively bad translations for seemingly easy sentences. Table III

shows four randomly selected samples from the test set, one from each task (ordered pair of languages).

Vi Discussion

Vi-a Learning the Transition Matrix

One idea we had was to learn the transition matrices that determine the probabilities of noise changing one token to another. In the described "vanilla" implementation, all probabilities of change are uniform. Diffusion models for continuous or ordinal data use Gaussian noise, which gives higher probabilities to small changes, resulting in a much easier learning ground for the denoising optimization procedure. This advantage is lost when using the uniform distribution for categorical data. Austin et. al. [austin2021structured] was able to improve on that by using non-uniform noise distributions.

Following this idea, we aimed to learn the noise distribution jointly with the diffusion model. Later we found out it was infeasible, since the learning procedure uses pre-computed powers of the transition matrix to enable fast learning. Specifically, for each training iteration at some , this would require the computation of the power of a matrix, where and . This makes the technique infeasible.

Vi-B Conclusions

In this work, we tried to solve a thoroughly researched NLP task, MNT, using a recent and very promising method, DDPMs, for the first time (to our knowledge). This method has the potential to generate text with high performance in a non-autoregressive way.

Although DDPMs achieve state-of-the-art results in generating both continuous and ordinal data, it is yet to show competing results for categorical data such as text. We hoped to show that it can give reasonable results for non-autoregressive translation.