I Introduction
Deep generative models can be categorized in the following categories: (i) Flow-based models, such as Glow [kingma2018glow] (ii) Autoregessive models, e.g. Transformer, for language modeling [vaswani2017attention], (iii) GAN [goodfellow2014generative] based models, such as, WaveGAN [donahue2018adversarial] for speech and StyleGAN [karras2020analyzing] for vision application. (iv) VAE [kingma2013auto] based models, e.g. VQ-VAE [razavi2019generating] and NVAE [vahdat2020nvae], and (v) Diffusion Probabilistic Models [sohl2015deep] such as ADM [dhariwal2021diffusion].
Diffusion probabilistic models achieve comparable and superior results to other deep generation models such as WaveGrad for speech synthesis [chen2020wavegrad] and ADM for image generation [dhariwal2021diffusion].
The underline architecture of the diffusion probabilistic models is a chain of Markov latent variables. The data flows in two directions: (i) the diffusion process, and (ii) the denoising process. The denoising process is the inference process which generates the data starting from Gaussian noise. The diffusion process is the training process which learns to transform data samples into Gaussian noise.
In the seminal work of Hoogeboom et al. [hoogeboom2021argmax]
, a diffusion model for categorical variable was introduced. The paper shows that the original diffusion process, which is suitable for continuous data such as speech and and ordinal data such as images, can model discrete categorical data. They trained a diffusion network on the language modeling task.
In this work we propose a diffusion model for neural machine translation. Furthermore, we show that the proposed model has some capabilities of zero-shot translation. To our knowledge, we are the first to perform conditional text generation using a diffusion model.
Ii Related Work
In [sohl2015deep] Sohl-Dickstein et al. introduce the diffusion process. The diffusion process takes the variational distribution and adds Gaussian noise at each time step where , is the original data point and is completely noise.
In this section we will recap the multinomial diffusion process as defined by Hoogeboom et al. [hoogeboom2021argmax] for categorical data. We denote
as a 1-hot vector with
categories. is the data point, and is the diffusion model that gradually adds a small amount of noise at each step. At , is almost completely noise. The opposite direction is a learnable distribution that denoises the data. The diffusion model is optimized with the variational bound on negative log likelihood:(1) |
Sohl-Dickstein et al. [sohl2015deep] use as condition and show that Eq.1 becomes:
(2) |
where if the diffusion trajectory is defined well. The variational distribution is defined as follows:
(3) |
where
is the probability to sample from the uniform distribution. Using a Markov chain property, one can get the closed form to sample
from :(4) |
where and are defined in the same manner as in the original DDPM [ho2020denoising], i.e. and . One can further relax the closed form:
(5) |
where
(6) |
Hoogeboom et al. [hoogeboom2021argmax] predicts a probability vector for from . They parametrize from , where
is approximated with a neural network
. Denote(7) |
Then, the variational lower bound Eq.2 becomes: logP(x_0) ≥E_q [ ∑_k x_0,k log^x_0,k - ∑_t=2^T KL ( C(θ_post(x_t, x_0)) | C(θ_post(x_t, ^x_0)) ]
It is worth to mention the work by Austin et al. [austin2021structured] which improves Hoogeboom et al. [hoogeboom2021argmax] by introducing corruption processes with uniform transition probabilities. They use transition matrices that mimic Gaussian kernels in continuous space and show that using different transition matrix leads to improved results in text generation.

Iii Method
We use a neural network which predicts a probability vector at each diffusion step, similar to the method used by Hoogeboom et al. [hoogeboom2021argmax]. The architecture used is transformer-based with an additional time-based positional encoding. Unlike Hoogeboom et al., which performs unconditional text generation, we are interested in sentence translation, thus we add the sentence in the source language as a condition and predict probability vectors to create the sentence in the target language. Our architecture is inspired by an encoder-decoder approach, where the source-language sentence is given as input to a transformer encoder and the noisy target sentence is given as input to a transformer decoder, such that the encoder’s outputs are used in a cross attention mechanism in each layer of the decoder. Unlike standard encoder-decoder systems, our method does not predict output tokens one at a time (autoregressively), but rather predicts all tokens’ probabilities at each denoising step.
During training, a time step is randomly sampled, and using the noise schedule , the posteriors are calculated and a noisy target sentence is created, using the closed-form formula for the uniform noise 5. Then, a forward pass is performed to predict using our neural network:
, which in turn is used to calculate the loss function. In this notation
s are source sentences and s are target sentences.During inference, we start from random uniform noise and iteratively run on it times to get .
Data Processing
The diffusion model requires inputs to be of fixed length. Thus, we use padding and truncation of all sentences to a fixed length
. Sentences are padded with a special token .Two special language tokens are added to each input sentence, the first indicates the source language and the second indicates the target language. This is used to accelerate convergence and to allow for zero-shot learning, given pairs of languages unmet during training.
Iv Experiment Setup
Iv-a Datasets
We used three datasets in total, all of which are from WMT. We trained our net on WMT14 [bojar2014findings]
DE-EN and WMT14 FR-EN jointly, and from both directions, meaning German to English, English to German, French to English and English to French. We downsampled the larger French dataset in each epoch in order to have the same number of German-English and French-English samples.
Lastly, we used the WMT19 [barrault2019findings] DE-FR for evaluation only, in order to test the method’s zero-shot learning performance.
Iv-B Evaluation Metrics
We used common machine translation evaluation metrics: Corpus level BLEU, SacreBLEU, TER and chrF. We used the official SacreBLEU implementation
[post-2018-call] with default parameters.Iv-C Baselines
Current state-of-the-art results for the WMT14 translation tasks perform BLEU and SacreBLEU for the German-English translation and for the French-English translation. All of these methods use some type of a transformer with autoregressive decoding, use very large models, and some use extra data.
Iv-D Hyperparameter Settings
We used the ADAM optimizer [kingma2014adam], and tuned the learning rate, batch size and gamma parameters. Eventually, we used a learning rate of , gamma value of and batch size of . We also tried versus diffusion steps, and didn’t see a major difference. We also experimented with the number of transformer layers. Our best model has transformer layers. Other parameters are attention heads and hidden dimension of .
Tokenization A Tokenizer is trained on data from all three languages, using the WordPiece method with normalization similar to BERT [devlin2018bert] (NFD Unicode, followed by Lowercase and StripAccents). Whitespace pre-tokenization is also used. The vocabulary size is a hyper parameter , which we tuned.
Vocabulary Size The vocabulary size is an important hyper parameter since it determines the dimension of the space the diffusion model needs to predict, and it changes how the noise probability is distributed (since the constant probability of change is distributed to a different number of tokens). Hoogeboom et. al. [hoogeboom2021argmax] worked with 27 and 256 categories for the and datasets, respectively. The fact that no tokenization was used hinted us that perhaps large a vocabulary size doesn’t work well with this method. However, austin et. al. [austin2021structured] was able to train a network with 8192 categories, using a slightly different method. Furthermore, we know of the importance of tokenization in complicated tasks such as translation. Therefore, we decided to try different vocabulary sizes and see how sensitive the method is to it.
Tasks | BLEU | SacreBLEU | TER | chrF |
---|---|---|---|---|
DEEN | 7.17 | 8.13 | 93.1 | 34.7 |
ENDE | 3.54 | 4.54 | 102.3 | 33.5 |
FREN | 8.62 | 9.93 | 88.4 | 37.8 |
ENFR | 7.56 | 9.02 | 90.7 | 37.5 |
DEFR | 4.17 | 5.06 | 94.7 | 31.4 |
FRDE | 2.96 | 4.04 | 98.1 | 31.4 |
V. Size | DEEN | ENDE | FREN | ENFR |
---|---|---|---|---|
1024 | 5.60 | 3.18 | 7.23 | 6.73 |
2048 | 7.92 | 4.42 | 9.83 | 8.99 |
4096 | 8.13 | 4.54 | 9.93 | 9.02 |
8192 | 6.76 | 4.00 | 8.89 | 7.75 |
Sample | Sentence | Lang. | SacreBLEU |
---|---|---|---|
1st Input | je sais qu’il voudrait une garantie de quatre ans. | FR | - |
1st Reference | i know he would like a four - year guarantee. | EN | - |
1st Prediction | i know he need a guarantee for four years. | EN | 17.47 |
2st Input |
the ecb’s sole mandate has always revolved around inflation, therefore mario draghi and his team have all the more reason to take action at their meeting next week. |
EN | - |
2st Reference |
le mandat unique de la bce a toujours porte sur l’inflation, donc mario draghi et son equipe ont davantage de raisons d’agir lors de la reunion de la semaine prochaine. |
FR | - |
2st Prediction |
lle systeme unique unique de la bce est derriere le trend en phoque, afin ou mario draghi et son mont sont plus justifies de prendre en contact a la session prochaine. |
FR | 17.10 |
3rd Input |
zwei kinder haben in uruguay den mord eines 11 - jahrigen eingestanden. |
DE | - |
3rd Reference |
two children have confessed to the murder of an 11 - year - old in uruguay. |
EN | - |
3rd Prediction |
they had spent a 118’old increased blood in uruguay about alleged murders abandone treating two children. |
EN | 6.84 |
4th Input | town council delighted with solid budget | EN | - |
4th Reference | gemeinderat freut sich uber soliden haushalt | DE | - |
4th Prediction | der stadtrat erfullt einen beliebten haushalt | DE | 8.12 |
V Results
Results for the different translation tasks are depicted in Table I
. The results are unsatisfactory, implying the method is currently not suitable for the translation task. Results for the zero-shot translation tasks (WMT19) show that some generalization to unmet pairs of languages was possible, but because the overall performance of the system is low, it is hard to estimate if the method transforms well to zero-shot learning.
Results for the vocabulary size tuning is depicted in Table II, suggesting a vocabulary of size is closest to the optimal value in this case.
Qualitatively speaking, results quality vary, and overall we see an expected correlation between the difficulty of inputs and the quality of the translation. Nonetheless, some observations are hard to explain, such as relatively good translations for seemingly hard sentences and relatively bad translations for seemingly easy sentences. Table III
shows four randomly selected samples from the test set, one from each task (ordered pair of languages).
Vi Discussion
Vi-a Learning the Transition Matrix
One idea we had was to learn the transition matrices that determine the probabilities of noise changing one token to another. In the described "vanilla" implementation, all probabilities of change are uniform. Diffusion models for continuous or ordinal data use Gaussian noise, which gives higher probabilities to small changes, resulting in a much easier learning ground for the denoising optimization procedure. This advantage is lost when using the uniform distribution for categorical data. Austin et. al. [austin2021structured] was able to improve on that by using non-uniform noise distributions.
Following this idea, we aimed to learn the noise distribution jointly with the diffusion model. Later we found out it was infeasible, since the learning procedure uses pre-computed powers of the transition matrix to enable fast learning. Specifically, for each training iteration at some , this would require the computation of the power of a matrix, where and . This makes the technique infeasible.
Vi-B Conclusions
In this work, we tried to solve a thoroughly researched NLP task, MNT, using a recent and very promising method, DDPMs, for the first time (to our knowledge). This method has the potential to generate text with high performance in a non-autoregressive way.
Although DDPMs achieve state-of-the-art results in generating both continuous and ordinal data, it is yet to show competing results for categorical data such as text. We hoped to show that it can give reasonable results for non-autoregressive translation.