On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models

by   Erik Nijkamp, et al.

This study investigates the effects Markov Chain Monte Carlo (MCMC) sampling in unsupervised Maximum Likelihood (ML) learning. Our attention is restricted to the family unnormalized probability densities for which the negative log density (or energy function) is a ConvNet. In general, we find that the majority of techniques used to stabilize training in previous studies can the opposite effect. Stable ML learning with a ConvNet potential can be achieved with only a few hyper-parameters and no regularization. With this minimal framework, we identify a variety of ML learning outcomes depending on the implementation of MCMC sampling. On one hand, we show that it is easy to train an energy-based model which can sample realistic images with short-run Langevin. ML can be effective and stable even when MCMC samples have much higher energy than true steady-state samples throughout training. Based on this insight, we introduce an ML method with noise initialization for MCMC, high-quality short-run synthesis, and the same budget as ML with informative MCMC initialization such as CD or PCD. Unlike previous models, this model can obtain realistic high-diversity samples from a noise signal after training with no auxiliary models. On the other hand, models learned with highly non-convergent MCMC do not have a valid steady-state and cannot be considered approximate unnormalized densities of the training data because long-run MCMC samples differ greatly from the data. We show that it is much harder to train an energy-based model where long-run and steady-state MCMC samples have realistic appearance. To our knowledge, long-run MCMC samples of all previous models result in unrealistic images. With correct tuning of Langevin noise, we train the first models for which long-run and steady-state MCMC samples are realistic images.



There are no comments yet.


page 1

page 2

page 4

page 10


On Learning Non-Convergent Short-Run MCMC Toward Energy-Based Model

This paper studies a curious phenomenon in learning energy-based model (...

Stochastic Security: Adversarial Defense Using Long-Run Dynamics of Energy-Based Models

The vulnerability of deep networks to adversarial attacks is a central p...

Particle Dynamics for Learning EBMs

Energy-based modeling is a promising approach to unsupervised learning, ...

No MCMC for me: Amortized sampling for fast and stable training of energy-based models

Energy-Based Models (EBMs) present a flexible and appealing way to repre...

JEM++: Improved Techniques for Training JEM

Joint Energy-based Model (JEM) is a recently proposed hybrid model that ...

Implicit Generation and Generalization in Energy-Based Models

Energy based models (EBMs) are appealing due to their generality and sim...

Learning Energy-based Model with Flow-based Backbone by Neural Transport MCMC

Learning energy-based model (EBM) requires MCMC sampling of the learned ...

Code Repositories


PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

view repo


Public website for showing code for thesis

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Diagnosing Energy-based Models

1) steps. 2) steps. 3) steps.
4) steps. 5) steps. 6) steps.
Figure 1: Long-run MCMC samples from uniform noise for non-convergent ML learning with noise-initialized MCMC on the Oxford Flowers 102 dataset. The energy function learns realistic structure early in sampling before the oversaturated steady-state samples appear. 100 MCMC steps are used in training.

Statistical modeling of high-dimensional signals is a challenging task encountered in many academic disciplines and practical applications. We focus on image signals in this work. When images come without annotations or labels, the highly effective tools of deep supervised learning cannot be applied and unsupervised techniques must be used instead. The Variational Auto-Encoder (VAE)

[20] and Generative Adversarial Networks (GANs) [10] are two widely-used unsupervised image models. The VAE is capable of accurate reconstruction and denoising while GANs can synthesize realistic images from a latent noise signal.

In this paper, we will focus on another unsupervised paradigm: the energy-based model in the form of an unnormalized Gibbs-Boltzmann density. Markov Chain Monte Carlo (MCMC) samples can be used to approximate the unknown and intractable log partition function while learning.

1) Non-convergent ML. 2) Convergent ML.
Figure 2: Illustration of the mixing of positive samples (yellow), negative samples (red), steady state samples (blue) and modes (gray) for non-convergent ML (left) and convergent ML (right).
1) Non-convergent. 2) Convergent. 3) Convergent.
Figure 3: Comparison of training samples and steady state samples. The top row displays MCMC samples used to update the model during training and the bottom row shows steady-state samples. Training method from left to right: 1) non-convergent ML using non-informative initialization and 100 Langevin steps, 2) convergent ML with non-informative initialization and 20,000 Langevin steps, and 3) convergent ML with persistent initialization and 500 Langevin steps.

1) Non-Convergent ML.

2) Convergent ML.

Figure 4: Long-run Langevin paths from positive samples to metastable samples for the Oxford Flowers 102 dataset. We implement the MH correction for Langevin dynamics to ensure true sampling. Two variations of Algorithm 1 are presented. Top: Non-convergent ML using 100 MCMC steps from non-informative initialization. Bottom: Convergent ML using 500 MCMC steps from persistent initialization.

We highlight two important but unrecognized phenomena that occur when using Maximum Likelihood (ML) to train energy-based models with a ConvNet potential energy function: 1) improper MCMC implementation yields models with high-quality short-run samples and low-quality long-run samples, 2) high-quality long-run samples require careful tuning of the Langevin noise with respect to the spectral norm of the ConvNet potential.

Previous works studying ML training of ConvNet potentials, such as [39, 38, 8], use a relatively small number of Langevin MCMC updates to obtain samples in each model update. These works all make use of informative initialization, meaning that the initial states of MCMC sampling during ML learning are chosen to somehow reflect the current learned density (see Section 2.3). The authors universally find that after enough updates of the model, MCMC samples obtained by short-run Langevin from informative initialization are realistic images that resemble the data.

However, we find that energy functions learned by prior methods have major defects regardless of MCMC initialization, network structure, and other training parameters. In particular, the long-run and steady-state MCMC samples of energy functions from previous implementations are oversaturated images with significantly lower energy than the observed data (see Figure 2 left, and Figure 5). In this case it is not appropriate to describe the learned model as an approximate density for the training set because the model assigns disproportionately high probability to images which differ dramatically from observed data. The systematic difference between high-quality short-run samples and low-quality long-run samples is a crucial phenomenon that appears to have gone unnoticed in previous studies.

The convergence or non-convergence of MCMC samples to their steady-state during ML training is the first learning dimension that we investigate. As we can observe from previous implementations, approximate convergence of MCMC samples is not a necessary condition for obtaining realistic synthesized images from the learned model. In addition, informative initialization is not sufficient to ensure MCMC convergence without a proper implementation of the sampling phase.

We use this insight to show that informative MCMC initialization is not needed in non-convergent ML learning. We introduce the first ML learning implementation that initializes MCMC samples from a non-informative noise distribution throughout training. The method achieves the same results with the same computational budget as the non-convergent ML with informative MCMC initialization from previous studies (see Figure 1 and Figure 6). Unlike models that are learned with informative MCMC initialization, it is easy to obtain new and diverse samples from our model after training with only a noise signal.

Our quest to achieve ML training of ConvNet potentials with convergent MCMC reveals another dimension of ML learning. Overly aggressive energy minimization during MCMC sampling can lead to scenarios where convergence is easily achieved but network spectral norm decreases throughout training, greatly limiting the synthesis abilities of the network. Careful tuning of Langevin noise is needed to induce a learning phase with both convergent sampling and realistic synthesis via increasing spectral norm. Our implementations yield the first ConvNet potentials with realistic long-run and steady-state MCMC samples (see Figure 2 right, Figure 3 middle and right columns, and Figure 4 bottom).

1.2 Our Contributions

In this work, we address previously unrecognized complications that arise when learning ConvNet potential energy functions with MCMC-based ML and provide remedies. The main contributions of our paper are:

  • Identification of two distinct axes which characterize MCMC-based ML learning: 1) MCMC convergence and non-convergence, and 2) expansion or contraction of network spectral norm. Previous models are learned with non-convergent MCMC, resulting in invalid steady-state structure. See Figure 1, Figure 2 left, Figure 3 left column, Figure 4 top, and Figure 5.

  • The first ConvNet potentials trained using ML with purely noise-initialized MCMC. Our method has the same computational cost as the ML learning with informative MCMC initialization used in previous studies. Unlike models learned with informative initialization, our model can efficiently generate high-diversity and high-quality samples after training from a noise signal alone. See Figure 1 and Figure 6.

  • The first ConvNet potentials with realistic steady-state samples. To our knowledge, ConvNet potentials with stable MCMC sampling in the image space are unobtainable by previous training implementations. We refer to [21] for a discussion. See Figure 2 right, Figure 3 middle and right columns, and Figure 4 bottom.

  • Mapping the macroscopic structure of image space energy functions by the novel means of diffusion in a magnetized energy landscape for unsupervised cluster discovery. See Figure 7.

We hope these contributions pave the way towards establishing energy-based models as a central generative learning paradigm. In contrast to pre-dominant generative models which recruit additional neural networks for approximate variational inference (VAE) or ancestral samplers in the form of a generator (GAN), our energy-based model can be trained without auxiliary models. The models learned by VAE and GAN are essentially deterministic feed-forward mappings while energy-based learning yields a true probabilistic representation of the data.

1.3 Related Work

1.3.1 Energy-Based Image Models

Energy-based models define an unnormalized probability density over a state space to represent the distribution of states in a given system. Early energy-based models, such as the Ising model, use hand-defined potentials that represent ideal physical systems. The Hopfield network [17] adapted these physical energy models into a model capable of representing arbitrary observed data. Hopfield also introduced the interpretation of energy-based models as a form of associative memory where energy descent is analogous to memory recall.

The Hopfield network is not expressive enough to capture the complex structure of real images. The RBM [16] and FRAME (Filters, Random field, And Maximum Entropy) [44, 36] models introduce energy functions with greater representational capacity. The RBM model uses hidden units which have a joint density with the observable image pixels, while the FRAME model uses convolutional filters and histogram matching to learn data features. [43] refers to energy-based models as descriptive models.

The pioneering work [15] studies the hierarchical energy-based model. [29] is an important early work proposing feedforward neural networks to model energy functions. The energy-based model in the form of (2) is introduced in [4]. Deep variants of the FRAME model [39, 24] are the first to achieve realistic synthesis with a ConvNet potential and Langevin sampling. [6] applies similar methods.

The Multi-grid model [8] learns an ensemble of ConvNet potentials for images of different scales with finite-budget Langevin sampling. Synthesized images from smaller scales are used as the informative initialization for MCMC sampling at larger scales during training.

Learning a ConvNet potential with the help of a generator network as approximative direct sampler is explored in [19, 5, 37, 38, 12, 21].

The INN model [33] learns unnormalized densities in a discriminative framework. [18, 22] investigate a ConvNet parameterization of this model from the perspective of image classification and synthesis respectively. The W-GAN [1] framework is adapted to the INN method in the WINN model [23].

[41, 40] explore an adversarial interpretation of ML learning. These works show connections to W-GAN and herding [35].

Two common threads between these learning algorithms are the ML parameter update (10) and the Langevin image update (11). We emphasize that some of the above works do not use both.

Although many of these works claim to train the energy (2) to be an approximate unnormalized density for the observed images, the resulting energy functions do not have a steady-state that reflects the data (see Figure 5). Short-run Langevin samples from informative initialization are presented as approximate steady-state samples, but further investigation shows long-run Langevin consistently disrupts the realism of short-run images.

We emphasize that unrealistic image space steady-states are a central concern specifically when training ConvNet potentials. Earlier energy-based models such as RBM do not exhibit a dramatic difference in realism between short-run samples from informative initialization and steady-state images. Variational Walkback [11] is an energy-free method that can directly learn a MCMC sampling process with a realistic steady-state in the image space.

ConvNet potentials trained by prior methods cannot obtain diverse and realistic samples from noise initialization after training without auxiliary networks. Noise-initialized Langevin samples from previous implementations are significantly less diverse than samples from the informative initialization used in training.

W-GAN WINN Conditional EBM
Figure 5: Steady-state samples of recent energy-based models. From left to right: Wasserstein-GAN critic on Oxford flowers [1], WINN on Oxford flowers [23]

, conditional EBM on ImageNet

[6]. The W-GAN critic is not trained to be an unnormalized density but we include samples for reference.

1.3.2 Energy Landscape Mapping

The full potential of the energy-based model lies in the structure of the energy landscape. Hopfield observed that the energy landscape is a model of associative memory [17]. Diffusion along the potential energy manifold is analogous to memory recall because the diffusion process will gradually refine a high-energy image (an incomplete or corrupted memory) until it reaches a low-energy metastable state, which corresponds to the revised memory.

Techniques for mapping and visualizing the energy landscape of non-convex functions first appear in the physical chemistry literature [2, 34]. Similar methods have been applied to map a latent space of Cooperative Networks [13] using an energy function that is related to the DGNAM model [30]. However, defects in the energy function (2) from previous implementations prevent these techniques from being applied directly in the image space. Our convergent models pave the way for image space mapping.

1.3.3 Activation Maximization and Adversarial Attacks

Energy-based models with the form (2

) are related to the Activation Maximization (AM) branch of deep learning. AM applications study images which maximize or minimize the response of a neuron or channel in a trained network. Such images can be obtained by iterative backpropagation and gradient descent with respect to the network input. Finding the long-run MCMC samples and local modes of a ConvNet potential can be viewed as an AM application.

Previous AM research focuses on maximizing the response of pre-trained classifier networks

[7, 25, 27, 30]. The high-activation images are found to be dream-like patterns which resemble natural image patterns that a neuron or channel has learned to detect, much like in Figure 5. In energy-based modeling, the steady-state samples should correspond to natural image patterns rather than dream-like patterns because the network is specifically trained to focus probability mass on realistic images.

AM research is closely connected with the study of Adversarial Attacks. Any input image can be modified by an imperceptible change that causes a classifier network to have disproportionately high activation for a given neuron or channel. Although the classifier learns generalizable decision boundaries between classes during training, these boundaries contain many “leaks” that can be exploited to give incorrect classification.

We believe that the sensitivity of deep networks to undesirable activations is related to the difficulties encountered when training the energy (2). In particular, if the distribution of synthesized images remains distant from the steady-state throughout training, then the gradient update (10) can create and propagate pathological low-energy regions throughout the training process. Extra care must be taken so that the energy basins of do not contain “leaky” pathways to low-energy unrealistic images. We hope that energy-based models can be used to combat adversarial attacks because a well-formed energy should have strong activation only on realistic images.

2 Learning Energy-Based Models

In this section, we review the central principles of the MCMC-based ML learning used in previous studies such as [14, 44, 39].

2.1 Maximum Likelihood Estimation

An energy-based model is a Gibbs-Boltzmann density


over signals . The energy potential belongs to a parametric family . The intractable constant is never used explicitly because the potential provides sufficient information for MCMC sampling. In this paper we focus our attention on energy potentials with the form



is a convolutional neural network with weights


In ML learning, we seek to find

such that the parametric model

is a close approximation of the data distribution . One measure of “closeness” is the Kullback-Leibler (KL) divergence

. The loss function

for training is then


Equation (5) is equivalent to the traditional ML objective . We can minimize by finding the roots of the derivative


The term is intractable, but it can be expressed


The gradient used to learn then becomes


where are i.i.d. samples from the data distribution (called positive samples since probability is increased), and are i.i.d. samples from current learned distribution (called negative samples since probability is decreased). In practice, the positive samples are a batch of training images and the negative samples are obtained after iterations of MCMC sampling.

2.2 MCMC Sampling with Langevin Dynamics

Obtaining the negative samples from the current distribution is a computationally intensive task which must be performed for each update of . ML learning does not impose a specific MCMC algorithm. Early energy-based models such as the RBM and FRAME model use Gibbs sampling to obtain the MCMC updates. A Gibbs sampler visits and updates each dimension (one pixel of the image) individually. This is computationally infeasible when training an energy with the form (2) for standard image sizes.

Several works studying the energy (2) recruit Langevin Dynamics to obtain the negative samples [39, 24, 38, 8, 23]. The Langevin Equation


where and , has stationary distribution [9, 28]. A complete implementation of Langevin Dynamics requires a momentum update and Metropolis-Hastings update in addition to (11), but most authors find that these can be ignored in practice [3]. The gradient term in (11) helps Langevin sampling to converge more quickly than methods such as Gibbs sampling which do not incorporate landscape geometry.

Like most MCMC methods, Langevin dynamics exhibits high auto-correlation and has difficulty mixing between separate modes. On the other hand, the consistent appearance of long-run MCMC samples can actually be a useful feature of a learned potential because a metastable representation is needed for mapping applications [13].

In general it is not appropriate to describe long-run Langevin samples from a fixed low-energy starting image as steady-state samples because the chains cannot mix between modes in computationally feasible time scales. Even so, long-run Langevin samples with a suitable initialization can still be considered approximate steady-state samples, as discussed in the next section.

2.3 MCMC Initialization

We distinguish two main branches of MCMC initialization: informative initialization, where the initial states are chosen to somehow reflect the target density, and non-informative initialization, where initial states are obtained from a distribution that is unrelated to the target density. In this work we use non-informative initialization to refer exclusively to initialization from a high-dimensional noise distribution such as uniform or Gaussian.

In the most extreme case, a Markov chain initialized from its steady-state distribution will still follow its steady-state distribution after a single update. In more general cases, a Markov chain initialized from an image that is very likely under the steady-state can converge much more quickly than a Markov chain initialized from a noise distribution. Because of this observation, informative initialization is frequently used to justify short-run MCMC in ML learning. Although the validity short-run MCMC with informative initialization can be justified under restrictive assumptions [42], in general it is not consistent with theoretically correct ML.

Data-based initialization

is one common method of informative initialization. The first RBM models are trained with Contrastive Divergence (CD)

[14], which initializes MCMC samples from the training data. The Multigrid Model [8] learns an ensemble of energy functions (2) at different scales by using synthesized images from lower resolutions as the initial images for Langevin updates at higher resolutions. A single pixel of downsampled observed data is used as the original proposal. The learning process for at each individual scale is analogous to CD with downsampled initialization.

Persistent initialization is another widely-used method of informative initialization. The negative samples from the previous learning iteration are used as the initial states for MCMC sampling in the current iteration. This technique is used in works such as [44, 24, 39]. The initial images in these works are either the zero image or noise images. Persistent Contrastive Divergence (PCD) [32] uses persistent chains that are initialized from the training images. The authors of [23, 6] store a large set of persistent images. The Cooperative Learning model [38] learns a generator network alongside the energy function to propose initial images for Langevin updates. The generator network can be interpreted as a mechanism for storing and updating images in a manner that is analogous to persistent chains.

In this paper we consider long-run Langevin chains from both data-based initialization such as CD and persistent initialization such as PCD to be approximate steady-state samples, even when Langevin chains cannot mix between modes. Prior art indicates that both initialization types span the modes of the learned density, and long-run Langevin can obtain fair MCMC samples within each mode.

Informative MCMC initialization during ML training can limit the ability of the final model to generate new and diverse synthesized images after training. MCMC samples initialized from non-informative noise distributions after training tend to result in images with a similar type of appearance when informative initialization is used in training. Cooperative networks [38] can start from a noise signal and use the image synthesized by the generator network as initialization for the energy function. Multigrid models [8] can start from a single pixel distribution and generate full-size images using sequential proposals from energy functions modeling larger and larger scales. Models learned from all prior art are incapable of generating diverse samples from noise using only the final density .

In the present work we find that informative initialization is not necessary for stable ML learning with realistic synthesis. We successfully implement ML with non-informative MCMC initialization throughout training using the same computational budget as ML with informative MCMC initialization. This allows us to train the first energy-based model capable of generating diverse and realistic images from a noise signal alone after training. We also find that short-run MCMC with informative initialization is not sufficient for learning an energy with a realistic steady-state when Langevin noise is not precisely tuned. Finally, we find that informative initialization can indeed dramatically speed up steady-state convergence of the negative samples with proper Langevin implementation.

3 Two Axes of ML Learning

Inspection of the gradient (10) reveals the central role of the average difference of the energy of negative and positive samples. Let


where is the distribution of negative samples given the finite-step MCMC sampler and initialization used at training step . The difference measures whether the positive samples from the data distribution or the negative samples from are more likely under the model .

Although is not equivalent to the ML objective (5), it bridges the gap between the theoretical ML and the behavior encountered when MCMC approximation is used. Two learning cases occur over the parameter trajectory :

  1. Expansion: . In the extreme case where expansion occurs for all , the model learns that can be minimized when diverges to for positive samples and for negative samples.

  2. Contraction: . In the extreme case of contraction for all , the model learns that can be minimized by converging to a constant function .

Running MCMC sampling for too few steps can cause learning to stay in the expansion phase because the samples from are unlikely under the true steady state . On the other hand, running MCMC sampling with too many steps and low noise can restrict learning to the contraction phase. A healthy mix of sign for is an indicator of proper ML learning.

Guided by these observations, we characterize the behavior along two axes: 1) convergence or non-convergence of with respect to the steady-state , and 2) the sign of through the learning process.

3.1 MCMC Non-Convergence

The exploding gradients that can occur after many expansion updates suggest that MCMC convergence is needed for stable ML learning. If the negative samples cannot keep pace with the positive samples, an increasing gap in energy between the groups can cause learning to collapse. On the other hand, the convergence of to its steady state should be a sufficient condition to avoid instability from unchecked expansion. If has enough capacity, repeated updates of for which will eventually create new modes of with lower energy than the training data. In the literature, it is expected that the finite-step MCMC distribution must reach approximate steady-state convergence for learning to be effective.

On the contrary, we find that high-fidelity synthesis is possible, and actually easier to learn, when there is a drastic difference between the finite-step MCMC distribution and true steady-state MCMC samples of . An examination of ConvNet potentials learned by existing methods shows that in all cases, running the MCMC sampler for significantly longer than the number of training steps results in samples with dramatically lower energy and unrealistic appearance. Although synthesis is still possible without convergence, it is not appropriate to describe the learned model as an approximate density of the training data because the model lacks a stable representation of learned images. To our knowledge, prior authors are unaware that non-convergent ML learning can still yield effective synthesis.

Given the surprising observation that MCMC convergence is not needed for ML learning with high-fidelity synthesis, it is natural to ask why non-convergent ML learning works in the first place. One possibility is that consistent non-convergence of MCMC samples can act as a proxy for high-temperature convergence. We hypothesize that a sample from a non-convergent MCMC distribution can function as a steady-state sample of the tempered distribution


for a related system with and . In this case, the learning gradient (9) for is approximately a rescaling of the learning gradient (9) for , and learning proceeds in a similar manner for both systems. Consistent initialization and consistent finite-step MCMC implementation can be a stand-in for steady-state sampling. Non-convergent learning yields high-quality synthesis with an efficient sampling phase, but high-fidelity and stable sampling from is not possible for non-convergent models.

The simplest remedy for non-convergent MCMC is to increase the number of sampling steps. However, we must be careful not to sacrifice the fidelity of synthesized images when learning convergent models. Correct implementation of convergent ML requires a proper treatment of temperature. True sampling, as opposed to energy minimization via gradient descent, is needed.

3.2 Expansion, Contraction, and Temperature

We observe that the expansion and contraction phases tend to have opposite effects on the Lipschitz norm of . The extreme expansion case will cause and the extreme contraction case will cause . We can easily bound by writing in the compositional form


When is a rectifier unit, we obtain


where are the weights of [26]. In practice, can be obtained by means of power iteration. This shows that we can analyze the Lipschitz behavior of by measuring the network spectral norm. In the ideal case, we expect an increase of spectral norm for the early alterations of followed by the convergence of to some constant, as explained below.

Following prior art, we use the Langevin Equation (11) to obtain MCMC samples. Since the gradient appears directly in the Langevin equation, the maximum gradient strength plays a central role in sampling. Sampling at a constant step size (which is equivalent to a constant level of noise) will lead to very different behavior depending on the Lipschitz norm. Since the norm can vary drastically over training, balancing Langevin noise and gradient strength is a crucial challenge.

Non-convergent learning can easily become unstable when is updated in the expansion phase for many consecutive iterations. However, the tendency of to increase Lipschitz norm in the expansion phase provides a natural correction mechanism which ensures that non-convergent learning can still achieve a healthy mix of expansion and contraction. Consecutive updates in the expansion phase will increase so that the gradient term can better overcome noise and more quickly reach low-energy regions. Moreover, we find that ML learning can easily learn to tune so that the gradient is strong enough to obtain high-fidelity samples from virtually any initial distribution in a small number of steps. This insight is the foundation of our ML learning method with non-informative MCMC initialization.

A pitfall when learning convergent models is the possibility of becoming stuck in the contraction phase. We find that it is possible to learn both non-convergent and convergent models while remaining in the contraction phase throughout training. One can easily induce the contraction phase by removing noise and scaling the gradient coefficient in (11) with the inverse of gradient magnitude (i.e. ) to impose fixed displacement for each image update. Models which are repeatedly updated in the contraction phase exhibit rapid steady-state convergence but the network spectral norm shrinks to 0 and realistic synthesis is never achieved. This is consistent with the claim that increase in spectral norm is needed to learn high-frequency features [31].

When Langevin noise is tuned correctly, the learning dynamics naturally balance between the expansion and contraction phases. The expansion phase will cause gradients of to increase strength and negative samples from the next iteration will tend to have lower energy. The contraction phase has the opposite effect as the model learns the decrease gradient strength and so that the noise corrects the energy of the negative samples to a spectrum that is on par with the positive samples. Introducing enough noise ensures that an arbitrary number of MCMC steps can be used in training without becoming trapped in the contraction phase. In the ideal case, should eventually level out at a constant value, indicating that has learned the sampling temperature via gradient strength.

3.3 Learning Algorithm

We now present our algorithm for ML learning. The learning algorithm is essentially the same as earlier work such as [39] that investigates the potential (2). Our experiments show that various auxiliary techniques such as prior distributions, ADAM, weight decay, network normalization, and informative MCMC initializations used in previous studies can be safely removed without sacrificing training stability. We find that training can be more stable in the absence of additional techniques because the parameter gradient is a more direct representation of the loss dynamics. Our intention is not to introduce a novel algorithm but to demonstrate the range of phenomena that can occur with the ML objective based on changes to MCMC sampling.

input : ConvNet potential , number of training steps , initial weight , training images , Langevin step size , Langevin noise indicator , Langevin steps , learning rate .
output : Weights for energy .
for  do
       1. Draw batch images from training set. Draw initial negative samples from MCMC initialization method (non-informative/noise initialization or informative initialization, see Section 2.3). 2. Update with
where , for steps to obtain negative samples . 3. Update by
where is the approximate gradient in (10) and is the SGD or ADAM optimizer.
Algorithm 1 ML Learning

Beyond choice of MCMC initialization and number of updates, our learning method only has four hyper-parameters: 1) step size/magnitude of noise , 2) indicator to include or exclude Langevin noise, and 3) number of Langevin steps , and 4) learning rate . Below we give some guidelines and insights for our streamlined algorithm.

  • Step Size and Noise Indicator: We fix the Langevin step size throughout training. Changing can easily lead to becoming trapped in to the contraction phase or to unstable dynamics in the expansion phase.

    • For non-tempered dynamics with , MCMC “sampling” reduces to gradient descent with fixed step size. We find that in the absence of auxiliary training techniques, the model will learn to increase gradient strength in proportion to Euclidean distance between initialized and synthesized images. Once this is achieved the network Lipschitz norm automatically converges. ConvNet potentials appear naturally resistant to the unbounded gradients that can occur if the average negative sample energy cannot stay close to the average positive sample energy.

    • For tempered dynamics with , fixing also fixes the magnitude of noise. We find that the network naturally learns sampling temperature by adjusting gradient strength to approximately match the noise level. The temperature of our convergent ML models is quite cold in the sense that long-run MCMC samples have a consistent appearance. Stable MCMC representation is one of our main contributions.

  • Number of Steps: Higher values of lead to convergent learning and lower values of lead to non-convergent learning when the same MCMC initialization method is used.

  • Noise for Non-Convergent ML: For non-convergent training we find noise and step-size have little effect and that only learning rate needs tuning. In this case we set and . Non-convergence itself functions as a proxy for noise/temperature. The implicit temperature of negative samples for a non-convergent model is much higher than the steady state temperature regardless of the level of training noise.

  • Noise for Convergent ML: For convergent training, we find that it is essential to include noise with and precisely tune so that the network learns a true tempered dynamics through the gradient strength. For a given level of noise, we can calibrate learning-rate such that spectral norm converges to a constant, indicating that the network has learned to balance the gradient strength and noise. An effective noise magnitude for convergent training seems to lie around 0.015.

  • Informative Initialization: Informative MCMC initialization is not needed for stable non-convergent training even with as few as Langevin updates. The model can naturally learn fast pathways to realistic negative samples from an arbitrary initial distribution. On the other hand, informative initialization can be very helpful for convergent learning. We use persistent initialization starting from noise.

  • Optimization

    : Previous studies employ a variety of auxiliary training techniques such as prior distributions (e.g. Gaussian), weight regularization, batch normalization, layer normalization, and spectral normalization to stabilize sampling and weight updates. We find that additional techniques can actually destabilize training and obscure the true learning dynamics. ML learning with both convergent and non-convergent MCMC work quite naturally for a ConvNet potential with no additional tricks.

  • ADAM: We find that ADAM improves training speed and image quality for non-convergent ML. For convergent ML, ADAM can make steady-state convergence much more difficult and we use SGD instead.

  • Network structure: For the first convolutional layer, we observe that a

    convolution with stride

    helps to avoid checkerboard patterns or other artifacts. Increasing the number of filters and residual connections improve fidelity of synthesis. We do not use any normalization between layers. We strongly believe that batch normalization should

    not be used when learning energy-based models because energy evaluation becomes non-deterministic.

4 Experiments

Figure 6: Short-run samples obtained from a non-convergent energy after 100 Langevin updates from non-informative (uniform noise) initialization. Although the non-convergent energy does not have a valid steady-state, it is capable of high-quality synthesis from a noise signal. From left to right: MNIST, Oxford Flowers 102, CelebA, CIFAR-10.

In this section, we demonstrate the principles of ML learning discussed above and show that we can learn models with new sampling capabilities. We start by training a model with non-informative MCMC initialization throughout training and the same budget as previous ML methods with informative MCMC initialization. We show that the learned model is able to generate realistic and highly diverse samples from a noise signal after training. We then present an efficient recipe for learning convergent models with high-fidelity steady-state distributions. The section is concluded with an experiment demonstrating the unsupervised clustering capabilities of a convergent learned energy.

Figure 7: Illustration of the disconnectivity-graph depicting the basin structure of the learned energy function for the Oxford Flowers 102 dataset. Each column displays up to 12 randomly selected basins members ordered by energy. Circles indicate the total number of basin members. Vertical lines encode minima depth in terms of energy and horizontal lines depict the lowest known barrier at which two basins merge in the landscape.

4.1 Non-Convergent ML Learning with Synthesis from Noise

In this experiment, we learn an energy function (2) using ML learning with non-informative noise initialization and short-run MCMC. We apply our ML algorithm with Langevin steps starting from uniform noise images for each update of . We use ADAM with , noise indicator and step size . Noise is not needed because non-convergence already plays the role of temperature.

Previous authors argued that informative MCMC initialization is a key element for successful ML learning, but our learning method can sample from scratch with the same number of Langevin updates. Informative MCMC initialization severely limits the diversity of samples that can be drawn from a noise signal after training. Unlike the models learned by previous authors, our models can generate high-fidelity and diverse images from a noise signal. Our results are shown in Figure 1, Figure 4 (top), and Figure 6.

4.2 Convergent ML Learning

With the correct Langevin noise, one can run an arbitrary number of MCMC updates when generating negative samples during training without becoming trapped in the contraction phase. However, many updates of are needed () before high-quality images can be learned for both convergent and non-convergent models. One can implement stable ML training with non-informative initialization and update the chains until the MCMC samples converge for each parameter update, but we find that this requires steps. The inner loop of Langevin sampling lies at the borderline of computational feasibility when such a large number of MCMC updates are required.

Informative initialization can dramatically reduce the number of MCMC steps needed for convergent learning. By using SGD with a low learning rate , noise indicator and step size , we were able to train convergent models using persistent initialization and sampling steps. We store a large number of persistent images () but only update 100 images for each training batch. Our results are shown in Figure 3 (middle/right) and Figure 4 (bottom).

4.3 Mapping the Structure of the Image Space after Convergent Learning

A well-formed energy function partitions the image space into meaningful Hopfield basins of attraction. Following [13], we map the structure of a convergent energy. We first identify many metastable MCMC samples. We then sort the metastable samples from lowest energy to highest energy and sequentially group images if travel between samples is possible in a magnetized energy landscape. This process is continued until all minima have been clustered. Our mappings show that the convergent energy has meaningful metastable structures encoding recognizable concepts (Figure 7).

5 Conclusion and Future Work

Our experiments on energy-based models with the form (2) reveal two distinct axes of ML learning. The first axis relates to the convergence of MCMC samples used in learning and the second axis relates to the expansion, contraction, or convergence of the Lipschitz norm of . We use our insights to train models with sampling capabilities that are not obtainable by previous implementations. The informative MCMC initializations used by previous authors are not necessary for high-quality synthesis. By removing this technique we train the first energy functions capable of high-diversity and realistic synthesis from noise initialization after training. We identify a severe defect in the steady-state distributions of prior implementations and introduce the first ConvNet potentials of the form (2) for which steady-state samples have realistic appearance. We hope that our work paves the way for future unsupervised and weakly supervised applications with energy-based models.


The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063. We thank Prafulla Dhariwal and Anirudh Goyal for helpful discussions.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In

    International Conference on Machine Learning

    , pages 214–223, 2017.
  • [2] O. M. Becker and M. Karplus. The topology of multidimensional potential energy surfaces: Theory and application to peptide structure and kinetics. Journal of Chemical Physics, 106(4), 1997.
  • [3] T. Chen, E. Fox, and G. C. Stochastic gradient hamiltonian monte carlo. ICML, 2014.
  • [4] J. Dai, Y. Lu, and Y.-N. Wu. Generative modeling of convolutional neural networks. arXiv preprint arXiv:1412.6296, 2014.
  • [5] Z. Dai, A. Almahairi, P. Bachman, E. Hovy, and A. Courville. Calibrating energy-based generative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.
  • [6] Y. Du and I. Mordatch. Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689, 2019.
  • [7] D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a deep network. Technical Report, Univerisite de Montreal, 2009.
  • [8] R. Gao, Y. Lu, J. Zhou, S.-C. Zhu, and Y. N. Wu. Learning generative convnets via multi-grid modeling and sampling. CVPR, 2018.
  • [9] S. German and D. German. Stochastic relaxation, gibbs distribution, and the bayesian restoration of images. IEEE Trans. PAMI, 6:721–741, 1984.
  • [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • [11] A. G. A. P. Goyal, N. R. Ke, S. Ganguli, and Y. Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pages 4392–4402, 2017.
  • [12] T. Han, E. Nijkamp, X. Fang, M. Hill, S.-C. Zhu, and Y. N. Wu. Divergence triangle for joint training of generator model, energy-based model, and inference model. arXiv preprint arXiv:1812.10907, 2018.
  • [13] M. Hill, E. Nijkamp, and S.-C. Zhu. Building a telescope to look into high-dimensional image spaces. QAM, 2019.
  • [14] G. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, pages 1771–1800, 2002.
  • [15] G. Hinton, S. Osindero, M. Welling, and Y.-W. Teh. Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cognitive science, 30(4):725–731, 2006.
  • [16] G. E. Hinton.

    A practical guide to training restricted boltzmann machines.

    Tech. Rep. UTML TR 2010-003, Dept. Comp. Sci., Univ. Toronto, 2010.
  • [17] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982.
  • [18] L. Jin, J. Lazarow, and Z. Tu. Introspective learning for discriminative classification. In Advances in Neural Information Processing Systems, 2017.
  • [19] T. Kim and Y. Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016.
  • [20] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.
  • [21] R. Kumar, A. Goyal, A. Courville, and Y. Bengio. Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508, 2019.
  • [22] J. Lazarow, L. Jin, and Z. Tu. Introspective neural networks for generative modeling. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2774–2783, 2017.
  • [23] K. Lee, W. Xu, F. Fan, and Z. Tu. Wasserstein introspective neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [24] Y. Lu, S. C. Zhu, and Y. N. Wu. Learning frame models using cnn filters.

    Thirtieth AAAI Conference on Artificial Intelligence

    , 2016.
  • [25] A. Mahendran and A. Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5188–5196, 2015.
  • [26] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
  • [27] A. Mordvintsev, C. Olah, and M. Tyka. Inceptionism: Going deeper into neural networks. Google Research Blog, 2015.
  • [28] R. M. Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, Chapter 5, 2011.
  • [29] J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng. Learning deep energy models. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1105–1112, 2011.
  • [30] A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune. Synthesizing the preferred inputs for neuron in neural networks via deep generator networks. NIPS, 2016.
  • [31] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks. 2018.
  • [32] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. ICML, pages 1064–1071, 2008.
  • [33] Z. Tu. Learning generative models via discriminative approaches. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
  • [34] D. J. Wales. The energy landscape as a unifying theme in molecular science. Phil. Trans. R. Soc. A, 363:357–377, 2005.
  • [35] M. Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128. ACM, 2009.
  • [36] Y. N. Wu, S. C. Zhu, and X. Liu. Equivalence of julesz ensembles and frame models. International Journal of Computer Vision, 38(3):247–265, 2000.
  • [37] J. Xie, Y. Lu, R. Gao, S.-C. Zhu, and Y. N. Wu. Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence (PAMI), 2018.
  • [38] J. Xie, Y. Lu, and Y. N. Wu. Cooperative learning of energy-based model and latent variable model via mcmc teaching. AAAI, 2018.
  • [39] J. Xie, Y. Lu, S. C. Zhu, and Y. N. Wu. A theory of generative convnet. International Conference on Machine Learning, 2016.
  • [40] J. Xie, Z. Zheng, R. Gao, W. Wang, S.-C. Zhu, and Y. Nian Wu. Learning descriptor networks for 3d shape synthesis and analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8629–8638, 2018.
  • [41] J. Xie, S.-C. Zhu, and Y. Nian Wu. Synthesizing dynamic patterns by spatial-temporal generative convnet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7093–7101, 2017.
  • [42] A. L. Yuille. The convergence of contrastive divergences. NIPS, pages 1593–1600, 2004.
  • [43] S.-C. Zhu. Statistical modeling and conceptualization of visual patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(6):691–712, 2003.
  • [44] S.-C. Zhu, Y. N. Wu, and D. Mumford. Filters, random fields and maximum entropy (frame): Toward a unified theory for texture modeling. International Journal of Computer Vision, 27(2):107–126, 1998.