Neural Variational Inference and Learning in Undirected Graphical Models

11/07/2017 ∙ by Volodymyr Kuleshov, et al. ∙ Stanford University 0

Many problems in machine learning are naturally expressed in the language of undirected graphical models. Here, we propose black-box learning and inference algorithms for undirected models that optimize a variational approximation to the log-likelihood of the model. Central to our approach is an upper bound on the log-partition function parametrized by a function q that we express as a flexible neural network. Our bound makes it possible to track the partition function during learning, to speed-up sampling, and to train a broad class of hybrid directed/undirected models via a unified variational inference framework. We empirically demonstrate the effectiveness of our method on several popular generative modeling datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many problems in machine learning are naturally expressed in the language of undirected graphical models. Undirected models are used in computer vision 

(Zhang et al., 2001), speech recognition (Bilmes, 2004), social science (Scott, 2012)

, deep learning 

(Salakhutdinov and Hinton, 2009), and other fields. Many fundamental machine learning problems center on undirected models (Wainwright et al., 2008); however, inference and learning in this class of distributions give rise to significant computational challenges.

Here, we attempt to tackle these challenges via new variational inference and learning techniques aimed at undirected probabilistic graphical models . Central to our approach is an upper bound on the log-partition function of parametrized by a an approximating distribution that we express as a flexible neural network (Mnih and Gregor, 2014). Our bound is tight when and is convex in the parameters of for interesting classes of . Most interestingly, it leads to a lower bound on the log-likelihood function , which enables us to fit undirected models in a variational framework similar to black-box variational inference (Ranganath et al., 2014).

Our approach offers a number of advantages over previous methods. First, it enables training undirected models in a black-box manner, i.e. we do not need to know the structure of the model to compute gradient estimators (e.g., as in Gibbs sampling); rather, our estimators only require evaluating a model’s unnormalized probability. When optimized jointly over

and , our bound also offers a way to track the partition function during learning (Desjardins et al., 2011). At inference-time, the learned approximating distribution may be used to speed-up sampling from the undirected model my initializing an MCMC chain (or it may itself provide samples). Furthermore, our approach naturally integrates with recent variational inference methods (Mnih and Gregor, 2014; Mnih and Rezende, 2016) for directed graphical models. We anticipate our approach will be most useful in automated probabilistic inference systems Tran et al. (2016).

As a practical example for how our methods can be used, we study a broad class of hybrid directed/undirected models and show how they can be trained in a unified black-box neural variational inference framework. Hybrid models like the ones we consider have been popular in the early deep learning literature (Hinton et al., 2006; Salakhutdinov and Hinton, 2009) and take inspiration from the principles of neuroscience (Hinton et al., 1995). They also possess a higher modeling capacity for the same number of variables; quite interestingly, we identify settings in which such models are also easier to train.

2 Background

Undirected graphical models.

Undirected models form one of the two main classes of probabilistic graphical models (Koller and Friedman, 2009)

. Unlike directed Bayesian networks, they may express more compactly relationships between variables when the directionality of a relationship cannot be clearly defined (e.g., as in between neighboring image pixels).

In this paper, we mainly focus on Markov random fields (MRFs), a type of undirected model corresponding to a probability distribution of the form

where is an unnormalized probability (also known as energy function) with parameters , and is the partition function, which is essentially a normalizing constant. Our approach also admits natural extensions to conditional random field (CRF) undirected models.

Importance sampling.

In general, the partition function of an MRF is often an intractable integral over . We may, however, rewrite it as

(1)

where is a proposal distribution. Integral can in turn be approximated by a Monte-Carlo estimate , where . This approach, called importance sampling (Srinivasan, 2013)

, may reduce the variance of an estimator and help compute intractable integrals. The variance of an importance sampling estimate

has a closed-form expression: By Jensen’s inequality, it equals when .

Variational inference.

Inference in undirected models is often intractable. Variational approaches approximate this process by optimizing the evidence lower bound

over a distribution ; this amounts to finding a that approximates in terms of . Ideal ’s should be expressive, easy to optimize over, and admit tractable inference procedures. Recent work has shown that neural network-based models possess many of these qualities (Kingma and Welling, 2013; Rezende et al., 2014; Burda et al., 2015).

Auxiliary-variable deep generative models.

Several families of have been proposed to ensure that the approximating distribution is sufficiently flexible to fit . This work makes use of a class of distributions that contain auxiliary variables (Maaløe et al., 2016; Ranganath et al., 2016); these are latent variables that make the marginal multimodal, which in turn enables it to approximate more closely a multimodal target distribution .

3 Variational Bounds on the Partition Function

This section introduces a variational upper bound on the partition function of an undirected graphical model. We analyze its properties and discuss optimization strategies. In the next section, we use this bound as an objective for learning undirected models.

3.1 A Variational Upper Bound on

We start with the simple observation that the variance of an importance sampling estimator (1) of the partition function naturally yields an upper bound on :

(2)

As mentioned above, this bound is tight when . Hence, it implies a natural algorithm for computing : minimize (2) over in some family .

We immediately want to emphasize that this algorithm will not be directly applicable to highly peaked and multimodal distributions (such as an Ising model near its critical point). If is initially very far from , Monte Carlo estimates will tend to under-estimate the partition function.

However, in the context of learning , we may expect a random initialization of to be approximately uniform; we may thus fit an initial to this well-behaved distribution, and as we gradually learn or anneal , should be able to track and produce useful estimates of the gradients of and of . Most importantly, these estimates are black-box and do not require knowing the structure of to compute. We will later confirm that our intuition is correct via experiments.

3.2 Properties of the Bound

Convexity properties.

A notable feature of our objective is that if is an exponential family with parameters , the bound is jointly log-convex in and . This lends additional credibility to the bound as an optimization objective. If we choose to further parametrize

by a neural net, the resulting non-convexity will originate solely from the network, and not from our choice of loss function.

To establish log-convexity, it suffices to look at for one , since the sum of log-convex functions is log-convex. Note that One can easily check that a non-negative concave function is also log-concave; since is in the exponential family, the second term is convex, and our claim follows.

Importance sampling.

Minimizing the bound on may be seen as a form of adaptive importance sampling, where the proposal distribution is gradually adjusted as more samples are taken (Srinivasan, 2013; Ryu and Boyd, 2014). This provides another explanation for why we need ; note that when , the variance is zero, and a single sample computes the partition function, demonstrating that the bound is indeed tight. This also suggests the possibility of taking as an estimate of the partition function, with the being all the samples that have been collected during the optimization of .

-divergence minimization.

Observe that optimizing (2) is equivalent to minimizing , which is the -divergence, a type of -divergence with (Minka et al., 2005; Dieng et al., 2017). This connections highlights the variational nature of our approach and potentially suggests generalizations to other divergences. Moreover, many interesting properties of the bound can be easily established from this interpretation, such as convexity in terms of (in functional space).

3.3 Auxiliary-Variable Approximating Distributions

A key part of our approach is the choice of approximating family : it needs to be expressive, easy to optimize over, and admit tractable inference procedures. In particular, since may be highly multi-modal and peaked, should ideally be equally complex. Note that unlike earlier methods that parametrized conditional distributions over hidden variables

(e.g. variational autoencoders

(Kingma and Welling, 2013)), our setting does not admit a natural conditioning variable, making the task considerably more challenging.

Here, we propose to address these challenges via an approach based on auxiliary-variable approximations (Maaløe et al., 2016): we introduce a set of latent variables into making the marginal multi-modal. Computing the marginal may no longer be tractable; we therefore apply the variational principle one more time and introduce an additional relaxation of the form

(3)

where, is a probability distribution over that lifts to the joint space of . To establish the first inequality, observe that

The factor is an instantiation of bound (2) for the distribution , and is therefore lower-bounded by .

This derivation also sheds light on the role of : it is an approximating distribution for the intractable posterior . When , the first inequality in (3) is tight, and we are optimizing our initial bound.

3.3.1 Instantiations of the Auxiliary-Variable Framework

The above formulation is sufficiently general to encompass several different variational inference approaches. Either could be used to optimize our objective, although we focus on the latter, as it admits the most flexible approximators for .

Non-parametric variational inference.

First, as suggested by Gershman et al. (2012), we may take to be a uniform mixture of exponential families:

This is equivalent to letting

be a categorical random variable with a fixed, uniform prior. The

may be either Gaussians or Bernoulli, depending on whether is discrete or continuous. This choice of lets us potentially model arbitrarily complex given enough components. Note that for distributions of this form it is easy to compute the marginal (for small ), and the bound in (3) may not be needed.

MCMC-based variational inference.

Alternatively, we may set to be an MCMC transition operator (or a sequence of operators) as in Salimans et al. (2015). The prior may be set to a flexible distribution, such as normalizing flows (Rezende and Mohamed, 2015) or another mixture distribution. This gives a distribution of the form

(4)

For example, if

is a Restricted Boltzmann Machine (RBM;

Smolensky (1986)), the Gibbs sampling operator has a closed form that can be used to compute importance samples. This is in contrast to vanilla Gibbs sampling, where there is no closed form density for weighting samples.

The above approach also has similarities to persistent contrastive divergence (PCD;

Tieleman and Hinton (2009)), a popular approach for training RBM models, in which samples are taken from a Gibbs chain that is not reset during learning. The distribution may be thought of as a parametric way of representing a persistent distribution from which samples are taken throughout learning; like the PCD Gibbs chain, it too tracks the target probability during learning.

Auxiliary-variable neural networks.

Lastly, we may also parametrize by an flexible function approximator such as a neural network (Maaløe et al., 2016). More concretely, we set to a simple continuous prior (e.g. normal or uniform) and set to an exponential family distribution whose natural parameters are parametrized by a neural net. For example, if is continuous, we may set , as in a variational auto-encoder. Since the marginal is intractable, we use the variational bound (3) and parametrize the approximate posterior with a neural network. For example, if , we may again set .

3.4 Optimization

In the rest of the paper, we focus on the auxiliary-variable neural network approach for optimizing bound (3). This approach affords us the greatest modeling flexibility and allows us to build on previous neural variational inference approaches.

The key challenge with this choice of representation is optimizing (3) with respect to the parameters of . Here, we follow previous work on black-box variational inference (Ranganath et al., 2014; Mnih and Gregor, 2014) and compute Monte Carlo estimates of the gradient of our neural network architecture.

The gradient with respect to has the form and can be estimated directly via Monte Carlo. We use the score function estimator to compute the gradient of , which can be written as and estimated again using Monte Carlo samples. In the case of a non-parametric variational approximation , the gradient has a simple expression , where is the difference of and its expectation under .

Note also that if our goal is to compute the partition function, we may collect all intermediary samples for computing the gradient and use them as regular importance samples. This may be interpreted as a form of adaptive sampling (Ryu and Boyd, 2014).

Variance reduction.

A well-known shortcoming of the score function gradient estimator is its high variance, which significantly slows down optimization. We follow previous work (Mnih and Gregor, 2014) and introduce two variance reduction techniques to mitigate this problem.

We first use a moving average of to center the learning signal. This leads to a gradient estimate of the form ; this yields the correct gradient by well known properties of the score function (Ranganath et al., 2014). Furthermore, we use variance normalization, a form of adaptive step size. More specifically, we keep a running average of the variance of the and use a normalized form of the original gradient .

Note that unlike the standard evidence lower bound, we cannot define a sample-dependent baseline, as we are not conditioning on any sample. Likewise, many advanced gradient estimators (Mnih and Rezende, 2016) do not apply in our setting. Developing better variance reduction techniques for this setting is likely to help scale the method to larger datasets.

4 Neural Variational Learning of Undirected Models

Next, we turn our attention to the problem of learning the parameters of an MRF. Given data , our training objective is the log-likelihood

(5)

We can use our earlier bound to upper bound the log-partition function by By our previous discussion, this expression is convex in if is an exponential family distribution. The resulting lower bound on the log-likelihood may be optimized jointly over ; as discussed earlier, by training and jointly, the two distributions may help each other. In particular, we may start learning at an easy (where is not too peaked) and use to slowly track , thus controlling the variance in the gradient.

Linearizing the logarithm.

Since the log-likelihood contains the logarithm of the bound (2), our Monte Carlo samples will produce biased estimates of the gradient. We did not find this to pose problems in practice; however, to ensure unbiased gradients one may further linearize the log using the identity which is tight for . Together with our bound on the log-partition function, this yields

(6)

This expression is convex in each of and , but is not jointly convex. However, it is straightforward to show that equation (6) and its unlinearized version have a unique point satisfying first-order stationarity conditions. This may be done by writing out the KKT conditions of both problems and using the fact that at the optimum. See Gopal and Yang (2013) for more details.

4.1 Variational Inference and Learning in Hybrid Directed/Undirected Models

We apply our framework to a broad class of hybrid directed/undirected models and show how they can be trained in a unified variational inference framework.

The models we consider are best described as variational autoencoders with a Restricted Boltzmann Machine (RBM; Smolensky (1986)) prior. More formally, they are latent-variable distributions of the form , where is an exponential family whose natural parameters are parametrized by a neural network as a function of , and is an RBM. The latter is an undirected latent variable model with hidden variables and unnormalized log-probability , where are parameters.

We train the model using two applications of the variational principle: first, we apply the standard evidence lower bound with an approximate posterior ; then, we apply our lower bound on the RBM log-likelihood , which yields the objective

(7)

Here, denotes our bound (3) on the partition function of parametrized with . Equation (7) may be optimized using standard variational inference techniques; the terms and do not appear in and their gradients may be estimated using REINFORCE and standard Monte Carlo, respectively. The gradients of and are obtained using methods described above. Note also that our approach naturally extends to models with multiple layers of latent directed variables.

Such hybrid models are similar in spirit to deep belief networks

(Hinton et al., 2006). From a statistical point of view, a latent variable prior makes the model more flexible and allows it to better fit the data distribution. Such models may also learn structured feature representations: previous work has shown that undirected modules may learn classes of digits, while lower, directed layers may learn to represent finer variation (Rolfe, 2016). Finally, undirected models like the ones we study are loosely inspired by the brain and have been studied from that perspective (Hinton et al., 1995). In particular, the undirected prior has been previously interpreted as an associative memory module (Hinton et al., 2006).

5 Experiments

5.1 Tracking the Partition Function

We start with an experiment aimed at visualizing the importance of tracking the target distribution using during learning.

We use Equation 6 to optimize the likelihood of a Ising MRF with coupling factor and unaries chosen randomly in . We set , sampled 1000 examples from the model, and fit another Ising model to this data. We followed a non-parametric inference approach with a mixture of Bernoullis. We optimized (6) using SGD and alternated between ten steps over the and one step over . We drew 100 Monte Carlo samples per . Our method converged in about 25 steps over . At each iteration we computed via importance sampling.

The adjacent figure shows the evolution of during learning. It also plots computed by exact inference, loopy BP, and Gibbs sampling (using the same number of samples). Our method accurately tracks the partition function after about 10 iterations. In particular, our method fares better than the others when

, which is when the Ising model is entering its phase transition.

5.2 Learning Restricted Boltzmann Machines

Next, we use our method to train Restricted Boltzmann Machines (RBMs) on the UCI digits dataset (Alimoglu et al., 1996), which contains 10,992 images of handwritten digits; we augment this data by moving each image 1px to the left, right, up, and down. We train an RBM with 100 hidden units using ADAM (Kingma and Ba, 2014) with batch size 100, a learning rate of , , and ; we choose to be a uniform mixture of Bernoulli distributions. We alternate between training and , performing either 2 or 10 gradient steps on for each step on and taking 30 samples from per step; the gradients of are estimated via adaptive importance sampling.

We compare our method against persistent contrastive divergence (PCD; Tieleman and Hinton (2009)), a standard method for training RBMs. The same ADAM settings were used to optimize the model with the PCD gradient. We used

Gibbs steps and 100 persistent chains. Both PCD and our method were implemented in Theano

(Bastien et al., 2012).

Figure 1: Learning curves for an RBM trained with PCD-3 and with neural variational inference on the UCI digits dataset. Log-likelihood was computed using annealed importance sampling.

In Figure 1, we plot the true log-likelihood of the model (computed with annealed importance sampling with step size

) as a function of the epoch; we use 10 gradient steps on

for each step on . Both PCD and our method achieve comparable performance. Interestingly, we may use our approximating distribution to estimate the log-likelihood via importance sampling. Figure 1 (right) shows that this estimate closely tracks the true log-likelihood; thus, users may periodically query the model for reasonably accurate estimates of the log-likelihood. In our implementation, neural variational inference was approximately eight times slower than PCD; when performing two gradient steps on , our method was only 50% slower with similar samples and pseudo-likelihood; however log-likelihood estimates were noisier. Annealed importance sampling was always more than order of magnitude slower than neural variational inference.

Visualizing the approximating distribution.

Next, we trained another RBM model performing two gradient steps for for each step of . The adjacent figure shows the mean distribution of each component of the mixture of Bernoullis ; one may distinguish in them the shapes of various digits. This confirms that indeed approximates .

Speeding up sampling from undirected models.

After the model has finished training, we can use the approximating to initialize an MCMC sampling chain. Since is a rough approximation of , the resulting chain should mix faster. To confirm this intuition, we plot in the adjacent figure samples from a Gibbs sampling chain that has been initialized randomly (top), as well as from a chain that was initialized with a sample from (bottom). The latter method reaches a plausible-looking digit in a few steps, while the former produces blurry samples.

5.3 Learning Hybrid Directed/Undirected Models


Binarized MNIST Omniglot
Model Ber(200) RBM(64,8) RBM(64,64) Ber(200) RBM(64,8) RBM(64,64)
VAE 111.9 105.4 102.3 135.1 130.2 128.5
ADGM 107.9 104.3 100.7 136.8 134.4 131.1
Table 1: Test set negative log likelihood on binarized MNIST and Omniglot for VAE and ADGM models with Bernoulli (200 vars) and RBM priors with 64 visible and either 8 or 64 hidden variables.

Next, we use the variational objective (7) to learn two types of hybrid directed/undirected models: a variational autoencoder (VAE) and an auxiliary variable deep generative model (ADGM) (Maaløe et al., 2016). We consider three types of priors: a standard set of 200 uniform Bernoulli variables, an RBM with 64 visible and 8 hidden units, and an RBM with 64 visible and 64 hidden units. In the ADGM, the approximate posterior includes auxiliary variables . All the conditional probabilities are parametrized with dense neural networks with one hidden layer of size 500.

We train all neural networks for 200 epochs with ADAM (same parameters as above) and neural variational inference (NVIL) with control variates as described in Mnih and Rezende (2016). We parametrize with a neural network mapping 10-dimensional auxiliary variables to via one hidden layer of size 32. We show in Table 1 the test set negative log-likelihoods on the binarized MNIST (Larochelle and Murray, 2011) and Omniglot (Burda et al., 2015) datasets; we compute these using Monte Carlo samples and using annealed importance sampling for the RBM.

Overall, adding an RBM prior with as little as latent variables results in significant log-likelihood improvements. Most interestingly, this prior greatly improves sample quality over the discrete latent variable VAE (Figure 2). Whereas the VAE failed to generate correct digits, replacing the prior with a small RBM resulted in smooth MNIST images. We note that both methods were trained with exactly the same gradient estimator (NVIL). We observed similar behavior for the ADGM model. This suggests that introducing the undirected component made the models more expressive and easier to train.

Figure 2: Samples from a deep generative model using different priors over the discrete latent variables . On the left, the prior is a Bernoulli distribution (200 vars); on the right, is an RBM (64 visible and 8 hidden vars). All other parts of the model are held fixed.

6 Related Work and Discussion

Our work is inspired by black-box variational inference Ranganath et al. (2014) for variational autoencoders and related models (Kingma and Welling, 2013), which involve fitting approximate posteriors parametrized by neural networks. Our work presents analogous methods for undirected models. Popular classes of undirected models include Restricted and Deep Boltzmann Machines (Smolensky, 1986; Salakhutdinov and Hinton, 2009) as well as Deep Belief Networks (Hinton et al., 2006). Closest to our work is the discrete VAE model; however, Rolfe (2016) seeks to efficiently optimize , while the RBM prior is optimized using PCD; our work optimizes using standard techniques and focuses on . Our bound has also been independently studied in directed models (Dieng et al., 2017).

More generally, our work proposes an alternative to sampling-based learning methods; most variational methods for undirected models center on inference. Our approach scales to small and medium-sized datasets, and is most useful within hybrid directed-undirected generative models. It approaches the speed of the PCD method and offers additional benefits, such as partition function tracking and accelerated sampling. Most importantly, our algorithms are black-box, and do not require knowing the structure of the model to derive gradient or partition function estimators. We anticipate that our methods will be most useful in automated inference systems such as Edward Tran et al. (2016).

The scalability of our approach is primarily limited by the high variance of the Monte Carlo estimates of the gradients and the partition function when does not fit sufficiently well. In practice, we found that simple metrics such as pseudo-likelihood were effective at diagnosing this problem. When training deep generative models with RBM priors, we noticed that weak ’s introduced mode collapse (but training would still converge). Increasing the complexity of and using more samples resolved these problems. Finally, we also found that the score function estimator of the gradient of does not scale well to higher dimensions. Better gradient estimators are likely to further improve our method.

7 Conclusion

In summary, we have proposed new variational learning and inference algorithms for undirected models that optimize an upper-bound on the partition function derived from the perspective of importance sampling and divergence minimization. Our methods allow training undirected models in a black-box manner and will be useful in automated inference systems Tran et al. (2016).

Our framework is competitive with sampling methods in terms of speed and offers additional benefits such as partition function tracking and accelerated sampling. Our approach can also be used to train hybrid directed/undirected models using a unified variational framework. Most interestingly, it makes generative models with discrete latent variables both more expressive and easier to train.

Acknowledgements.

This work is supported by the Intel Corporation, Toyota, NSF (grants 1651565, 1649208, 1522054) and by the Future of Life Institute (grant 2016-158687).

References