Your Autoregressive Generative Model Can be Better If You Treat It as an Energy-Based One

by   Yezhen Wang, et al.

Autoregressive generative models are commonly used, especially for those tasks involving sequential data. They have, however, been plagued by a slew of inherent flaws due to the intrinsic characteristics of chain-style conditional modeling (e.g., exposure bias or lack of long-range coherence), severely limiting their ability to model distributions properly. In this paper, we propose a unique method termed E-ARM for training autoregressive generative models that takes advantage of a well-designed energy-based learning objective. By leveraging the extra degree of freedom of the softmax operation, we are allowed to make the autoregressive model itself be an energy-based model for measuring the likelihood of input without introducing any extra parameters. Furthermore, we show that E-ARM can be trained efficiently and is capable of alleviating the exposure bias problem and increase temporal coherence for autoregressive generative models. Extensive empirical results, covering benchmarks like language modeling, neural machine translation, and image generation, demonstrate the effectiveness of the proposed approach.


page 1

page 2

page 3

page 4


Glancing Transformer for Non-Autoregressive Neural Machine Translation

Non-autoregressive neural machine translation achieves remarkable infere...

ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

We propose to train a non-autoregressive machine translation model to mi...

Creative GANs for generating poems, lyrics, and metaphors

Generative models for text have substantially contributed to tasks like ...

Generalization in Generation: A closer look at Exposure Bias

Exposure bias refers to the train-test discrepancy that seemingly arises...

Scalable Deep Generative Modeling for Sparse Graphs

Learning graph generative models is a challenging task for deep learning...

A Spectral Energy Distance for Parallel Speech Synthesis

Speech synthesis is an important practical generative modeling problem t...

Haar Wavelet based Block Autoregressive Flows for Trajectories

Prediction of trajectories such as that of pedestrians is crucial to the...

1 Introduction

By factorizing the joint distribution into the product of a series of conditional distributions, autoregressive generative models (abbr. ARGMs) 

(TransformerBase; TransformerXL; wavenet; pixelcnnOordKEKVG16; pixelcnn++SalimansK0K17; PixelSNAILChenMRA18) simplify the difficult challenge of modeling high-dimensional joint distributions. They can be trained efficiently via maximum likelihood and generate samples of exceptional quality, making this technique popular for modeling distributions, especially for sequential data. Nonetheless, despite their potency and flexibility, ARGMs still have inherent weaknesses due to the intrinsic characteristics of chain-style conditional modeling. For example, ARGMs usually suffer from a discrepancy of the input context distributions between the training and inference stages, which causes consequent error propagation (i.e., Exposure Bias (SSRanzatoCAZ15; SSBengioVJS15)). Besides, due to the nature of greedy selection of beam search approximations, the decoded results from ARGMs may also lack in long-range coherence. We consider one approach by which ARGMs could be adapted to reduce these concerns.

Earlier work, both heuristic and theoretical, has already been proposed with those goals. For instance, the exposure bias problem of ARGMs can be alleviated to some extent with scheduled sampling 

(SSBengioVJS15; SSTransformerMihaylovaM19), by mixing input contexts from both real data and autoregressive generation, during the training stage. However, this scheme suffers from an over-correcting problem (BridginfGapNMTijcai)

. In addition, at the inference stage, beam search makes it possible to choose more diverse candidates, improving the quality of generated sequences. Nevertheless, this results in only marginal improvements in temporal coherence, since ARGMs can only leverage previous decoded contexts without consideration of the whole sequence information. Moreover, setting aside the difficulty in training them, energy-based models (EBMs) have demonstrated their effectiveness in modeling high-dimensional distributions in a variety of machine learning applications 

(zhao2017energy; arbel2021generalized; gao2021learning), without requiring the transformation of the target distribution into a product of conditional distributions. As a result, several studies (residualDengBOSR20; ResidualBakhtinDGORS21; autoregressiveEnergyDurkanN19)

attempt to combine EBMs with ARGMs, expecting to benefit from the strengths of both approaches. However, though some positive results were obtained, the existing works preferred a two-stage optimization, which first obtains a well-trained ARGM and then trains an additional EBM based on it. Such an optimization strategy does not enable ARGMs itself to benefit from the properties of EBM in modeling the joint distribution in a temporally more coherent way, and requires more training parameters to estimate energy scores, burdening the intricacy of the learning task.

In this paper, we present a novel design, which seamlessly integrates Energy-based models into AutoRegressive M

odels (E-ARM) by utilizing the extra degree of freedom within the model’s final softmax layer. We will show that in this way the ARGM can be trained using an energy-based learning objective, which allows the ARGM to not only avoid those intrinsic concerns, such as exposure bias, with the help of energy-based models as former work did 

(residualDengBOSR20; ResidualBakhtinDGORS21), but also be free of increasing the learning model’s complexity. This property makes our E-ARM rather easy to be applied on the training process of any autoregressive generative model for any specific task, as no structural changes are required.

Besides, we follows the predominant approach for training explicit density(mass) generative models to minimize the KL divergence between the (empirical) data distribution and model distribution, which gives rise to the gradient-based contrastive divergence methods 

(Hinton02Training; DeepDirectedKimB16)

for energy-based models. Typically, these methods require an Markov Chain Monte Carlo (MCMC) process to sample data from the EBM for the “negative phase” gradient estimation, which is extremely time-consuming and, meanwhile, inapplicable for discrete data, such as text. To solve this, we present a way to estimate those “negative phase” gradients through those samples generated with the network’s autoregressive view instead of the EBM view, which allows us to sidestep the usage of MCMCs, thus making the training both feasible and efficient.

Intuitively, the exposure bias in ARGMs is caused by the fact that the model is trained on real data rather than data generated by the model. On the other hand, in the EBM’s optimization process for modeling joint densities, the negative phase of contrastive divergence methods (Hinton02Training; DeepDirectedKimB16) requires sampling data from the EBM itself. Along with the fact that our method combines the EBM and the ARGM seamlessly as a whole, E-ARM can reduce the discrepancy between input data of the training and inference stage, which mitigates the exposure bias problem of the ARGM. On top of it, unlike ARGMs, which factor the joint distribution into a product of conditional distributions, EBMs are able to model the joint distribution directly and score each input at the sequence level instead of at the token level, which makes them capable of modeling long-range coherence.

In summary, the following contributions are made with this paper: i) We introduce a novel scheme, E-ARM, to integrate the EBM view into autoregressive generative models seamlessly; ii) we attempt to reduce the intrinsic problems of autoregressive models, such as exposure bias and weak temporal coherence, by optimizing an energy-based learning objective, which uses samples autoregressively generated; iii) We demonstrate how to efficiently optimize our model constructed from a single network, using the contrastive divergence method without MCMC; iv) In a number of applications, such as language modeling, neural machine translation, our model can achieve better results in comparison with relevant baselines.

2 Background

2.1 Energy-Based Models

Let and denote the data distribution and the model distribution with density(mass) functions and with respect to a base measure on sample space respectively. Energy-based models (lecun2006tutorial) are interested in learning an unnormalized energy function that defines the density(mass) function of the model distribution ,


where denotes an energy function which aims to map a data sample to an energy scalar, and denotes the normalizing constant, also known as the partition function. Any function can be used as an energy function to represent an EBM as long as it can generate a single scalar given some input and the normalizing constant is finite111Without constraining the parametrization of , this can be achieved by bounding the region of space in which takes its allowed values.. Contrastive divergence algorithms are commonly used to optimize EBMs (Hinton02Training; DeepDirectedKimB16; jemGrathwohlWJD0S20) via maximum log-likelihood. Correspondingly, the gradient of the log-likelihood, which needs to be maximized, with respect to can be expressed as


The first term in the right hand side of Eq.2 is usually called “negative phase” term while the second term is called “positive phase” term. In general, it is non-trivial to sample from an EBM, which usually requires MCMC methods (Hinton02Training; SGLDWellingT11). Stochastic Gradient Langevin Dynamics (SGLDWellingT11) is the most common MCMC algorithm in continuous state spaces. However, they are exceedingly time-consuming and not applicable when the input is discrete like in text applications.

2.2 Modeling Distributions Autoregressively

Autoregressive generative models can decompose any joint distribution into a product of conditional distributions using the product rule of probability by ordering those random variables within the joint distribution and characterizing each random variable given all variables preceding it in that order. Formally, we use

to denote the random vector covering all random variables before the time step

and denote the random variable at time step . Then we have


In recent years, remarkable accomplishments in numerous areas has been achieved by modeling distributions autoregressively (TransformerBase; GPT2; pixelrnn; pixelcnnOordKEKVG16; pixelcnn++SalimansK0K17)

, thanks to its ability to avoid the challenging goal of modeling joint high-dimensional distributions directly. In this paper, we primarily focus on autoregressive language models, but we also conduct experiments on image generation to validate the generality of our method.

2.3 Exposure Bias and Incoherence Problems in Autoregressive Models

In the discussion about the defects of sequential autoregressive generative models, the exposure bias problem (SSBengioVJS15; SSRanzatoCAZ15) is an important issue, which greatly affects the model’s deployment performance. During the training stage, the autoregressive model is always conditioned on ground truth token sequences. During the inference stage, however, the model has to rely on its own previously generated tokens to predict the next token, when the model is deployed. If an incorrect token is selected, this error can be amplified in following steps because the next prediction will be made using an unusual input (one unlike those in the training set). Besides, out of the consideration of efficiency, autoregressive decoding usually selects the most probable token at each time step, given the ones previously selected. Such a scheme assumes the largest joint probability of the whole sequence can be achieved by separately choosing the most probable next token (given its previous context) over all time steps, which is only the local optimum. Correspondingly, the chosen sequence can not always be the model’s optimum result.

3 Methodology

For a long time, as a result of compromises for improving training stability and efficiency (e.g., modeling a joint distribution by decomposing it and using a teacher-forcing training strategy), conventional autoregressive generative models have suffered from flaws such as the exposure bias and the lack of long-range coherence. To tackle these issues, we attempt to seamlessly integrate Energy-based models into AutoRegressive Models (E-ARM), and train ARGMs with an energy-based learning objective.

Formally, given a distribution of length sequences with density(mass) function on data space , we first introduce a parametric autoregressive model with a density(mass) function with parameter . Then, we define the energy-based autoregressive model , which is a product of the autoregressive model and an EBM adhered within it, with density function ,


where is the normalization term and equals to , the energy function is defined as the negative of

’s corresponding component of network’s output logits given the input context

(e.g., given a sequence “This is Friday.” and assuming the corresponding index of the token “Friday” in the vocabulary is , then the value of is the

-th component of the output logit, namely, the input tensor of the final softmax layer).

One rationale behind such a design is out of the extra degree of freedom concealed inside the softmax operation. Specifically, a -way softmax operation is a transformation , which can convert an unnormalized vector into a normalized vector with 1 as the sum of all elements.


where . It’s easy to observe that the softmax operation is unaffected by the input vector’s overall magnitude, that is, , indicating an extra degree of freedom which can be used for modeling energy.

Our primary goal is to make the parametric distribution approach the real data distribution as close as possible at any time step such that we can perform downstream inferences (e.g., density evaluation and sampling) via . This can be achieved by minimizing the Kullback-Leibler (KL) divergence between the distributions with respect to all time steps of a sequence,


where adjusts the weights of objectives with respect to different time steps, though we found that just setting all equal to each other could work well. We resort to contrastive divergence methods (Hinton1995TheA; DeepDirectedKimB16) to minimize the objective 6 by descending the gradient w.r.t. according to Eq. 2222here, we take a minimization version of the Eq. 2. Thus the sign before each phase is converse. for all time steps. For a specific time step , we have the gradient


where . Optimization via Eq. 7 involves sampling data from the model distribution and can thus lead to the discovery of non-data-like samples, whose likelihood is then explicitly reduced as the corresponding energy increasing during the training. E-ARM is therefore not plagued by the exposure bias problem naturally. Besides, because we model the joint distribution at each time step throughout the training process, E-ARM can assess the entire sequence as a whole and generate more coherent data using energy sampling (residualDengBOSR20).

4 Optimization

The key obstacle of optimizing the objective  6 via contrastive divergence methods is sampling data from the model distribution for estimating the “negative phase” gradient since we can only access the estimated density(mass) function . The common MCMC algorithms are not desirable for generating “negative” samples because they are rather time-consuming, and not applicable for discrete data such as text. In order to make the optimization process both efficient and feasible, we propose a unique way to conduct the optimization by means of importance sampling technique (importancesampling). To be specific, by replacing with the specific form in Eq.8, the “positive phase” gradient with respect to parameter can be written into


and similarly, we can get the “negative phase” gradient


We can observe that the term of in Eq. 10 is equivalent to the negative gradient of likelihood ’s logarithm, which is exactly the objective of maximizing the autoregressive generative model ’s log-likelihood, and can be easily taken care of via cross entropy loss. Besides, because carrying out sample estimation of the expectation over the data distribution is viable, and the score can be acquired by simply accessing the output logit of ARGM (according to the definition of in Sec. 3), the “positive phase” gradient can likewise be readily estimated.

The negative phase gradient estimation, on the other hand, is more involved. In Eq. 11, sampling data from is required for estimating the expectation , whereas is the introduced energy-based autoregressive model, which is an explicit autoregressive generative model and we can only access its modeled density(mass) function . However, inspired by the idea of importance sampling, we substitute the troublesome estimation of the expectation over distribution with the expectation over distribution , which is the underlying autoregressive model that can generate samples considerably easier, in practice. Accordingly, the “negative phase” gradient has the following form (See the detailed derivation in Appendix LABEL:appendix:A),




According to Eq.12, all the estimated expectations only need sampling from the autoregressive model rather than the distribution , and the reweighing weight w in Eq. 13 does not involve expectation computation over distribution either. Generally speaking, producing data from an autoregressive model is a simple ancerstral sampling process and naturally suitable for discrete data, as compared with sampling straight from an explicit generative density(mass) estimator, which needs MCMC approaches (autoregressiveEnergyDurkanN19). Besides, the term in Eq. 12 can be regarded as a re-weighted gradient of ’s information entropy with respect to . This term can be optimized similarly to the teacher-forcing training of autoregressive model with the “teacher” sequence generated autoregressively by the model itself. Actually, the scheduled sampling methods (SSBengioVJS15; SSRanzatoCAZ15; SSTransformerMihaylovaM19) are similar to this term but without the re-weighting factor. Moreover, the reweighing weight of Eq. 13 can be further refined (see the derivation in Appendix LABEL:sec:appendix_w) and we can observe that


where , indicating the possibility of which distribution ( or ) the input context is most likely to come from. Correspondingly, reflects the context ’s relative magnitude of compared with the average among all potential contexts—the larger the value of , the more likely the context in the data space coming from , which is modeled by the product of autoregressive models and EBMs. During training, those input sequences with contexts more likely from than will be assigned larger weights w while others will be assigned smaller weights w. A precise sample estimate of the denominator of w is important; otherwise, the gradient estimate will be biased. Fortunately, the bias can be reduced by replacing the estimate by an exponential moving average as MINE (mine) did. For small learning rates, this improved gradient estimation can be made to have arbitrarily small bias.

Ultimately, combining Eq. 10 and Eq. 12 , at each time step , we can optimize via descending the estimated gradient of as follows,


Eq. 15 is a rather symmetric form that can be easily estimated by using “positive” samples from the given dataset distribution and “negative” samples from the base autoregressive model .

In general, E-ARM can be viewed as a new training method for autoregressive models, which regards the ARGM as an EBM and can be optimized by a modified contrastive divergence method. The modeling of the joint distribution instead of the conditional one at each time step ensures the autoregressive network stays close to the real distribution while avoiding those problems such as exposure bias and the lack of long-range coherence. However, in practice, training the model from scratch with the energy-based learning objective using gradients descending through Eq.15 alone can not work well. The reason is that at the initial stage of the training process, what we have is just a randomly initialized autoregressive network which outputs sequences with random values given any context. This indicates disjoint supports between the real sequence’s distribution and distribution modeled by E-ARM. If we only use the energy-based learning objective derived according to contrasitve divergence methods, the gradient of log-likelihood

would be nearly zero due to disjoint supports, which makes the optimization difficult. As a result, in order to make the optimization more feasible, we train the base autoregressive model with cross entropy loss for a few epochs before introducing the energy-based learning objective with respect to

. Actually, we set the starting epoch of E-ARM objective as a hyper-parameter, see more experimental details in the Section 5.2.

On top of it, it is worth noting that for a sequence with total length , the gradient actually has counterparts with different time steps and the joint distribution modeled by an autoregressive model can be naturally broken up into pieces of conditionals. As a consequence, simply summing up these gradients results in the “negative phase” gradient having one term as follows


where indicates the specific index of the current token in the entire sequence. As a result, earlier time steps (smaller ) will get stronger training signals (larger , indicating more gradient terms), giving rise to imbalanced training for different time steps. To solve this, we substitute for to define , where is a fixed copy of

. Such a design constrains gradients only backpropagate through conditional distributions of a few recent tokens so that balances the training signals among different time steps

333In practice, we find that using recent 2 tokens worked best..

5 Experiments

To empirically corroborate the effectiveness of E-ARM and show its broad applicability, we have conducted extensive experiments and in this section, we will introduce these experimental setups, followed by an analysis of the obtained results. We mainly focus on applications of natural language processing, but also carry out some simple experiments on image generation task to show that our E-ARM is a general training method for autoregressive models. More experimental settings and analytical experiments are shown in Appendix

LABEL:sec:expr_setting and LABEL:sec:more_experiment.

5.1 Application to Neural Machine Translation

Model Label Scheduled Beam Avg.
Smoothing Sampling Searching DE EN EN DE EN IT IT EN ES EN EN ES
Base - - - 32.440.06 26.640.10 27.920.03 30.480.08 38.610.11 35.420.09 31.92
5 B 33.620.07 27.410.08 28.720.04 31.390.05 39.550.12 36.380.07 32.85
- - 33.680.03 27.620.04 28.810.07 31.420.07 39.850.13 36.710.09 33.02
5 B 34.610.08 28.460.06 29.720.10 32.290.03 40.640.07 37.480.05 33.87
- 34.230.06 27.960.03 29.260.11 31.930.08 40.160.03 37.210.04 33.46
5 B 35.100.04 28.730.04 29.970.07 32.640.12 40.910.06 37.930.10 34.21
E-ARM - - - 32.990.10 27.150.03 28.330.12 31.130.04 39.560.01 36.070.02 32.54
5 B 34.060.06 27.970.08 29.260.09 31.90 0.13 40.30 0.03 36.92 0.09 33.40
- - 33.97 0.08 28.03 0.04 29.13 0.02 31.84 0.11 40.32 0.03 36.96 0.07 33.38
5 B 34.93 0.05 28.91 0.12 30.04 0.11 32.56 0.04 41.01 0.06 37.73 0.12 34.20
- 34.58 0.09 28.38 0.12 29.56 0.10 32.11 0.03 40.93 0.03 37.56 0.07 33.85
5 B 35.36 0.05 29.11 0.04 30.25 0.09 32.82 0.11 41.58 0.07 38.19 0.03 34.55
Table 1: Comparison of BLEU scores between our approach E-ARM and the base ARGM trained just with cross-entropy loss on six translation pairs of IWSLT14 datasets. We use “-” to denote that the training trick is not used while “✔” indicates we use it. “5 B” represents we use beam searching with 5 beams.

E-ARM is first evaluated in the context of neural machine translation (NMT), which is a conditional generation task and is important in the natural language processing (NLP) field. We first analyze E-ARM on the IWSLT14 dataset, which includes six different language pairs ({German, Spanish, Italian} English and English {German, Spanish, Italian}). In addition, we test E-ARM on the WMT16 (English German) benchmark to make sure we evaluating E-ARM on a larger dataset. Hereafter we abbreviate English, German, Spanish, Italian as “En”, “De”, “Es”, “It”. We use one size of transformer (“Base-IWSLT”) for the IWSLT14 benchmark and two sizes of transformer (“Base-WMT”, “Large-WMT”) for the WMT16 benchmark. Scheduled Sampling is carried out following SSTransformerMihaylovaM19. More experimental details are reported in Appendix  LABEL:sec:expr_setting.

The results of IWSLT14 tasks are shown in Table 5.1. We test not only the pure performance of E-ARM but also the compatibility with other techniques. In detail, we can observe that (1) without any particular engineering, E-ARM outperforms the base autoregressive translation model trained with cross-entropy singly by 0.62 (31.92 32.54) BLEU points in average, especially on three translation pairs—38.61 39.56 on Spanish-to-English, 30.48 31.13 on Italian-to-English, 35.42 36.07 on English-to-Spanish. (2) E-ARM is compatible with other techniques like scheduled sampling, which can help alleviate the exposure bias problem to some extent. They are not mutually exclusive and can work together to further improve the performance of the base ARGM. (3) However, since scheduled sampling can reduce exposure bias and beam search can somewhat alleviate the flaws caused by greedy selection at each time step, the performance gain of E-ARM when all these tactics are combined is only 0.34 (34.21 34.55), which is lower than the 0.62 (31.92 32.54) obtained when the model is purely trained without these other techniques.

Model L.S. S.S. w/E-ARM
Base-WMT - - - 27.56
- - 28.04
- 28.36
Large-WMT - - - 28.70
- - 29.05
- 29.23
Table 2: Translation performance of proposed E-ARM on WMT16 EnglishGerman, evaluated with BLEU. We uniformly use 5 beams when applying beam search. “L.S.” denotes Label Smoothing and “S.S.” denotes Scheduled Sampling.

Additionally, Table 5.1 shows the performance of E-ARM on the WMT16 English German task. For two different model sizes, enabling label smoothing (L.S.) improves model performance by 0.52 and 0.35, respectively. The performance of the base transformer model further increases to 28.36 BLEU points when scheduled sampling (S.S.) is used, while the larger model improves to 29.23 points. E-ARM paired with label smoothing and scheduled sampling yields the highest scores of 28.62 and 29.44, respectively. Overall, our training strategy outperforms ARGM’s vanilla teacher-forcing training and can have uniformly favorable impacts across different models and dataset sizes.

5.2 Application to Language Modeling

Model Energy
Re-sampling CC-News Toronto Book Corpus WikiText103
Tr-Base - 18.29 17.57 30.56
Residual EBM(Tr-Base) 15.57-15.58 16.98-17.00 29.88-29.93
Tr-XL - - - 24.20
Residual EBM(Tr-XL) - - 23.85-23.87
E-ARM(Tr-Base) - 15.78 17.10 29.94
E-ARM(Tr-Base) 15.63-15.67 16.89-16.93 29.81-29.84
E-ARM(Tr-XL) - - - 23.90
E-ARM(Tr-XL) - - 23.79-23.82
Table 3: Language modeling performance of different models on three benchmarks. Evaluation is conducted using perplexity (PPL). We test the performance of E-ARM w/o energy resampling technique. The residual EBM (ResidualBakhtinDGORS21) requires an extra model having the same structure with the underlying autoregressive models to learn the energy so that doubles parameters.

To further demonstrate E-ARM’s consistency in reducing flaws of autoregressive generative models, we also conduct experiments on language modeling tasks. Three different datasets, WikiText-103 

(Wikitext103), Toronto Book Corpus (bookcorpus1; bookcorpus2), and CC-news (cc-news), are chosen as the testbed. WikiText-103 comprises 103 million training tokens from 28 thousand articles, with an average length of 3.6 thousand tokens per article; Toronto Book Corpus consists of fiction books in 16 different genres, totaling about half a billion words; and CC-news is a de-duplicated subset of the English portion of the CommonCrawl news dataset, which totals around 16 Billion words. Two autoregressive network structures are used to evaluate our method’s effectiveness: vanilla Transformer (TransformerBase) (“Tr-Base” for short) tested on all three benchmarks and Transformer-XL (TransformerXL) (“Tr-XL” for short), which is a transformer equipped with a recurrent memory, tested on Wikitext-103. In particular, residual EBM (residualDengBOSR20) shares a similar energy-based idea to improve the quality of text generation for autoregressive models, though they treat the underlying autoregressive model and the EBM as two independent models and apply the distribution modeling on the entire sequence instead of subsequences with respect to each time step. As a result, we implement it as a baseline of our method. Besides, the Top-K energy resampling post-processing technique, which is a critical module of residual EBM, is also applicable for our E-ARM since we can estimate the energy of entire input sequences444It worth noting that Top-K energy resampling can not get the PPL directly.  ResidualBakhtinDGORS21 provides a way to approximate PPL, which leads to an estimated interval of PPL. .

The final results are reported in Table 5.2. We can see from the results that E-ARM outperforms two pure autoregressive models with clear margins over all three benchmarks. Specifically, on Wikitext-103 benchmark, our E-ARM improves the performance of Transformer-Base model and Transformer-XL model by 0.62 PPL points (from 30.56 to 29.94) and 0.30 PPL points (from 24.20 to 23.90) respectively; on CC-news and Toronto Book Corpus benchmarks, our method obtains 0.51 ppl and 0.47 ppl performance gain respectively, and gets further improvement once energy resampling technique was applied. Besides, though residual EBM’s learning parameters are twice as ours and their method is unable to directly benefit autoregressive models without energy resampling, our E-ARM achieves comparable results to them, even slightly better on Toronto Book Corpus and Wikitext-103 benchmarks.

Model Start Epoch of E-ARM
Structure 5 10 15 20 25
Tr-Base 30.38 30.12 29.94 30.05 30.29
Tr-XL 24.12 23.90 23.96 24.05 24.16
Table 4: Exploring the effect of different start epochs of E-ARM on Wikitext103 benchmark. Performances are evaluated by perplexity (PPL).

In addition, we have studied the effect of different start epochs of E-ARM on the performance of language modeling, which can be seen in Table 4

. From this, we may deduce that starting E-ARM training at the 15-th and 10-th epoch yields the best results for Transformer-Base and Transformer-XL respectively, whereas starting earlier or later yields a performance decline. It is reasonable because, if E-ARM was introduced too early, the autoregressive model may not have been optimized well at that moment. As a result, the quality of generation for “negative phase” would be terrible, making energy-based training unstable. On the other hand, the underlying autoregressive model can be modified only marginally if E-ARM was introduced when the ARGM training is virtually complete.

5.3 Application to Image Generation

In order to illustrate the effectiveness and generality of our method in processing different modality tasks, we further show the results of applying E-ARM to image generation in this section. We apply E-ARM to Pixel-CNN (van2016pixel) and its variant Gated Pixel-CNN (oord2016conditional)

. Experiments are carried out on the MNIST and CIFAR-10 datasets.

Pixel-CNN 0.17 (0.13) 3.14 (3.08)
Pixel-CNN (w/E-ARM) 0.15 (0.12) 3.07 (2.98)
Gated Pixel-CNN 0.14 (0.11) 3.03 (2.90)
Gated Pixel-CNN (w/E-ARM) 0.12 (0.10) 2.97 (2.87)