In this paper, we are interested in deep generative models. There is a family of directed deep generative models which can be trained by back-propagation (e.g., Kingma & Welling, 2013; Goodfellow et al., 2014)2006) and deep Boltzmann machine (Salakhutdinov & Hinton, 2009). The building block of deep energy-based models is a bipartite graphical model called restricted Boltzmann machine (RBM). The RBM model consists of two layers, visible and hidden layers, which can model higher-order correlation of the visible units (visible layer) using the hidden units (hidden layer). It also makes the inference easier that there are no interactions between the variables in each layer.
The conventional RBM uses Bernoulli units for both the hidden and visible units (Smolensky, 1986). One extension is using Gaussian visible units to model general natural images (Freund & Haussler, 1994). For hidden units, we can also generalize Bernoulli units to the exponential family (Welling et al., 2004; Ravanbakhsh et al., 2016).
Nair & Hinton (2010)
propose one special case by using Rectified Linear Unit (ReLU) for the hidden layer with the heuristic sampling procedure, which has promising performance in terms of reconstruction error and classification accuracy. Unfortunately, due to its lack of strict monotonicity, ReLU RBM does not fit within the framework of exponential family RBMs(Ravanbakhsh et al., 2016). Instead we study leaky-ReLU RBM (leaky RBM) in this work and address two important issues i) a better training (sampling) algorithm for ReLU RBM and; ii) a better quantification of leaky RBM –i.e., its performance in terms of likelihood.
We study some of the fundamental properties of leaky RBM, including its joint and marginal distributions (Section 2). By analyzing these distributions, we show that the leaky RBM is a union of truncated Gaussian distributions. In this paper we will show that training leaky RBM involves underlying positive definite constraints. Because of this, the training can diverge if these constrains are not satisfied. This is an issue that was previously ignored in ReLU RBM, as it was mainly used for pre-training rather than generative modeling. Our contribution in this paper is three-fold: I) We systematically identify and address model constraints in leaky RBM (Section 3); II) For the training of leaky RBM, we propose a meta algorithm for sampling, which anneals leakiness during the Gibbs sampling procedure (Section 3) and empirically show that it can boost contrastive divergence with faster mixing (Section 5); III) We demonstrate the power of the proposed sampling algorithm on estimating the partition function. In particular, comparison on several benchmark datasets shows that the proposed method outperforms the conventional AIS (Salakhutdinov & Murray, 2008) (Section 4). Moreover, we provide an incentive for using leaky RBM by showing that the leaky ReLU hidden units perform better than the Bernoulli units in terms of the model log-likelihood (Section 4).
2 Restricted Boltzmann Machine and ReLU
The Boltzmann distribution is defined as
where is the partition function. Restricted Boltzmann Machine (RBM) is a Boltzmann distribution with a bipartite structure It is also the building block for many deep models (e.g., Hinton et al., 2006; Salakhutdinov & Hinton, 2009; Lee et al., 2009), which are widely used in numerous applications (Bengio, 2009). The conventional Bernoulli
RBM, models the joint probabilityfor the visible units and the hidden units as , where
The parameters are , and . We can derive the conditional probabilities as
is the sigmoid function.
One extension of Bernoulli RBM is replacing the binary visible units by linear units with independent Gaussian noise. The energy function in this case is given by
To simplify the notation, we eliminate and in this paper, and then the energy function is simplified to be . Note that the elimination does not influence the discussion and one can easily extend all the results in this paper to the model that includes and .
The conditional distributions are as follows:
where is a Gaussian distribution with mean
2.1 ReLU RBM with Continuous Visible Units
From (1) and (2), we can see that the mean of the is actually the evaluation of a sigmoid function at the response , which is the non-linearity of the hidden units. From this perspective, we can extend the sigmoid function to other functions and thus allow RBM to have more expressive power (Ravanbakhsh et al., 2016). Nair & Hinton (2010) propose to use rectified linear unit
(ReLU) to replace conventional sigmoid hidden units. The activation function is defined as a one-sided function.
However, as it has been shown in Ravanbakhsh et al. (2016), only the strictly monotonic activation functions can derive feasible joint and conditional distributions111Nair & Hinton (2010) use the heuristic noisy ReLU for sampling.. Therefore, we consider the leaky ReLU (Maas et al., 2013) in this paper. The activation function of leaky ReLU is defined as , where is the leakiness parameter.
To simplify the notation, we define . By Ravanbakhsh et al. (2016), the conditional probability of the activation is defined as , where is a Bregman Divergence and is the base measure. The Bergman divergence of is given by , where with is the anti-derivative of and is the anti-derivative of . We then get the conditional distributions of leaky RBM as
Note that the conditional distribution of the visible unit is
which can also be written as , where and . By having these two conditional distributions, we can train and do inference on a leaky RBM model by using contrastive divergence (Hinton, 2002) or other algorithms (Tieleman, 2008; Tieleman & Hinton, 2009).
3 Training and Sampling from leaky RBM
First, we explore the joint and marginal distribution of the leaky RBM. Given the conditional distributions and
, the joint distributionfrom the general treatment for MRF model given by Yang et al. (2012) is
By (5), we can derive the joint distribution of the leaky-ReLU RBM as
and the marginal distribution as
where is the -th column of .
3.1 Leaky RBM as Union of Truncated Gaussian Distributions
From (6), the marginal probability is determined by the affine constraints or for all . By combinatorics, these constraints divide into at most convex regions . An example with and is shown in Figure 3. If , then we have at most regions.
We discuss the two types of these regions. For bounded regions, such as in Figure 3, the integration of (6) is also bounded, which results in a valid distribution. Before we discuss the unbounded cases, we define , where . For the unbounded region, if is a positive definite (PD) matrix, then the probability density is proportional to a multivariate Gaussian distribution with mean and precision matrix (covariance matrix ) but over an affine-constrained region. Therefore, the distribution of each unbounded region can be treated as a truncated Gaussian distribution.
On the other hand, if is not PD, and the region, the integration of (6) over
is divergent (infinite), which can not result in a valid probability distribution. In practice, with this type of parameter, when we do Gibbs sampling on the conditional distributions, the sampling will diverge. However, it is unfeasible to check exponentially many regions for each gradient update.
If is positive definite, then is also positive definite, for all . The proof is shown in Appendix 3.1. From Theorem 3.1 we can see that if the constraint is PD, then one can guarantee that the distribution of every region is a valid truncated Gaussian distribution. Therefore, we introduce the following projection step for each after the gradient update.
The above projection step (7
) can be done by shrinking the singular values to be less than 1. The proof is shown in AppendixB. The training algorithm of the leaky RBM is shown in Algorithm 1. By using the projection step (7), we could treat the leaky RBM as the union of truncated Gaussian distributions
, which uses weight vectors to divide the space of visible units into several regions and use a truncated Gaussian distribution to model each region. Note that the leaky RBM model is different fromSu et al. (2016), which uses a truncated Gaussian distribution to model the conditional distribution instead of the marginal distribution. The empirical study about the divergent values and the necessity of the projection step is shown in Appendix C.
3.2 Sampling from Leaky-ReLU RBM
Gibbs sampling is the core procedure for RBM, including training, inference, and estimating the partition function (Fischer & Igel, 2012; Tieleman, 2008; Salakhutdinov & Murray, 2008). For every task, we start from randomly initializing by an arbitrary distribution , and iteratively sample from the conditional distributions. Gibbs sampling guarantees the procedure result in the stationary distribution in the long run for any initialized distribution . However, if is close to the target distribution , it can significantly shorten the number of iterations to achieve the stationary distribution.
If we set the leakiness to be , then (6) becomes a simple multivariate Gaussian distribution , which can be easily sampled without Gibbs sampling. Also, the projection step (7) guarantees it is a valid Gaussian distribution. Then we decrease the leakiness with a small , and use samples from the multivariate Gaussian distribution when as the initialization to do Gibbs sampling. Note that the distribution of each region is a truncated Gaussian distribution. When we only decrease the leakiness with a small amount, the resulted distribution is a “similar” truncated Gaussian distribution with more concentrated density. From this observation, we could expect the original multivariate Gaussian distribution serves as a good initialization. The one-dimensional example is shown in Figure 3. We then repeat this procedure until we reach the target leakiness. The algorithm can be seen as annealing the leakiness during the Gibbs sampling procedure. The meta algorithm is shown in Algorithm 2. Next, we show the proposed sampling algorithm can help both the partition function estimation and the training of leaky RBM.
4 Partition Function Estimation
It is known that estimating the partition function of RBM is intractable (Salakhutdinov & Murray, 2008). Existing approaches, including Salakhutdinov & Murray (2008); Grosse et al. (2013); Liu et al. (2015); Carlson et al. (2016) focus on using sampling to approximate the partition function of the conventional Bernoulli RBM instead of the RBM with Gaussian visible units and non-Bernoulli hidden units. In this paper, we focus on extending the classic annealed importance sampling (AIS) algorithm (Salakhutdinov & Murray, 2008) to leaky RBM.
Assuming that we want to estimate the partition function of with and , Salakhutdinov & Murray (2008) start from a initial distribution , where computing the partition of is tractable and we can draw samples from . They then use the “geometric path” to anneal the intermediate distribution as , where they grid from to . If we let , we can draw samples from by using samples from for via Gibbs sampling. The partition function is then estimated via = , where
Salakhutdinov & Murray (2008) use the initial distribution with independent visible units and without hidden units. Therefore, we extend Salakhutdinov & Murray (2008) to the leaky-ReLU case with , which results in a multivariate Gaussian distribution . Compared with the meta algorithm shown in Algorithm 2 which anneals between leakiness, the extension of Salakhutdinov & Murray (2008) anneals between energy functions.
4.1 Study on Toy Examples
As we discussed in Section 3.1, leaky RBM with hidden units is a union of truncated Gaussian distributions. Here we perform a study on the leaky RBM with a small number hidden units. Since in this example the number of hidden units is small, we can integrate out all possible configurations of . However, integrating a truncated Gaussian distribution with general affine constraints does not have analytical solutions, and several approximations have been developed (e.g., Pakman & Paninski, 2014). To compare our results with the exact partition function, we consider a special case that has the following form:
Compared to (6), it is equivalent to the setting where . Geometrically, every passes through the origin. We further put the additional constraint . Therefore. we divide the whole space into equally-sized regions. A three dimensional example is shown in Figure 3. Then the partition function of this special case has the analytical form
We randomly initialize and use SVD to make each column orthogonal to each other. Also, we scale to satisfy . The leakiness parameter is set to be . For Salakhutdinov & Murray (2008) (AIS-Energy), we use particles with intermediate distributions. For the proposed method (AIS-Leaky), we use only particles with intermediate distributions. In this small problem we study the cases when the model has and hidden units and visible units. The true log partition function is shown in Table 1 and the difference between and the estimates given by the two algorithms are shown in Table 2.
|Log partition function|
The difference between the true partition function and the estimations of two algorithms with standard deviation.
From Table 1, we observe that AIS-Leaky has significantly better and more stable estimations than AIS-Energy especially when is large. For example, when we increase from to , the bias (difference) of AIS-Leaky only increases from to ; however, the bias of AIS-Energy increases from to . Moreover, we note that AIS-Leaky uses less particles and less intermediate distributions, and therefore is more computationally efficient than AIS-Energy. We further study the implicit connection between the proposed AIS-Leaky and AIS-Energy in Appendix D, which shows AIS-Leaky is a special case of AIS-Energy under certain conditions.
4.2 Comparison between leaky-ReLU RBM and Bernoulli-Gaussian RBM
It is known that the reconstruction error is not a proper approximation of the likelihood (Hinton, 2012). By having an accurate estimation of the partition function, we can study the power of leaky RBM when our goal is to use under the likelihood function as our objective instead of the reconstruction error.
We compare the Bernoulli-Gaussian RBM222Our GPU implementation with gnumpy and cudamat can reproduce the results of http://www.cs.toronto.edu/ tang/code/GaussianRBM.m, which has Bernoulli hidden units and Gaussian visible units. We trained both models with CD-20333CD-n means that contrastive divergence was run for n steps and momentum. For both model, we all used hidden units. We initialized by sampling from , , and . The momentum parameter was and the batch size was set to . We tuned the learning rate between and . We studied two benchmark data sets, including CIFAR10 and SVHN. The data was normalized to have zero mean and standard deviation of for each pixel. The results of the log-likelihood values are reported in Table 3.
From Table 3, leaky RBM outperforms Bernoulli-Gaussian RBM significantly. The unsatisfactory performance of Bernoulli-Gaussian RBM may be in part due to the optimization procedure. If we tune the decay schedule of the learning-rate for each dataset in an ad-hoc way, we observe the performance of Bernoulli-Gaussian RBM can be improved by nats for both datasets. Also, increasing CD-steps brings slight improvement. The other possibility is the bad mixing during the CD iterations. The advanced algorithms Tieleman (2008); Tieleman & Hinton (2009) may help. Although Nair & Hinton (2010) demonstrate the power of ReLU in terms of reconstruction error and classification accuracy, it does not imply its superior generative capability. Our study confirms leaky RBM could have a much better generative performance compared to Bernoulli-Gaussian RBM
5 Better Mixing by Annealing Leakiness
In this section, we show the idea of annealing between leakiness benefit the mixing in Gibbs sampling in other settings. A common procedure for comparison of sampling methods for RBM is through visualization. Here, we are interested in more quantitative metrics and the practical benefits of improved sampling. For this, we consider optimization performance
as the evaluation metric.
The gradient of the log-likelihood function of general RBM models is
In this section, we compare two gradient approximation procedure. The first one is the conventional contrastive divergence (CD) (Hinton, 2002). The second method is using Algorithm 2 (Leaky) with the same number of mixing steps as CD. The experiment setup is the same as that of Section 4.
The results are shown in Figure 4. The proposed sampling procedure is slightly better than typical CD steps. The reason is we only anneals the leakiness for steps. To get accurate estimation requires thousands of steps as shown in Section 4 when we estimate the partition function. Therefore, the estimated gradient is still inaccurate. However, it still outperforms the conventional CD algorithm, which can demonstrate the better mixing power of the proposed sampling algorithm as we expect.
The drawback of using Algorithm 2 is sampling from requires computing mean, covariance and the Cholesky decomposition of the covariance matrix in every iteration, which are computationally expensive. We study a mixture algorithm by combining CD and the idea of annealing leakiness. The mixture algorithm is replacing the sampling from with sampling from the empirical data distribution. The resulted mix algorithm is almost the same as CD algorithm while it anneals the leakiness over the iterations as Algorithm 2. The results of the mix algorithm is also shown in Figure 4.
The mix algorithm is slightly worse than the original leaky algorithm, but outperforms the conventional CD algorithm. Starting from the data distribution is biased to , which cause the mix algorithm perform worse than Algorithm 2. However, by sampling from the data distribution, it is as efficient as the CD algorithm (without additional computation cost). Annealing the leakiness helps the mix algorithm explore different modes of the distribution, which benefits the training. The idea could also be combined with more advanced algorithms (Tieleman, 2008; Tieleman & Hinton, 2009)444We studied the PCD extension of the proposed sampling algorithm. However, the performance is not as stable as CD..
In this paper, we study the properties of the distributions of leaky RBM. The study links the leaky RBM model and truncated Gaussian distributions. Also, our study shows and addresses an underlying positive definite constraint of training leaky RBM. Based on our study, we further propose a meta sampling algorithm, which anneals between leakiness during the Gibbs sampling procedure. We first demonstrate the proposed sampling algorithm is more effective and more efficient in estimating the partition function than the conventional AIS algorithm. Second, we show the proposed sampling algorithm has better mixing property under the evaluation via optimization.
A few direction worth further studying. For example, one is how to speed up the naive projection step. Some potential direction is using the barrier function as shown in Hsieh et al. (2011) to avoid the projection step.
- Bengio (2009) Y. Bengio. Learning deep architectures for ai. Found. Trends Mach. Learn., 2009.
- Burda et al. (2015) Y. Burda, R. B. Grosse, and R. Salakhutdinov. Accurate and conservative estimates of mrf log-likelihood using reverse annealing. In AISTATS, 2015.
- Carlson et al. (2016) D. E. Carlson, P. Stinson, A. Pakman, and L. Paninski. Partition functions from rao-blackwellized tempered sampling. In ICML, 2016.
- Fischer & Igel (2012) A. Fischer and C. Igel. An introduction to restricted boltzmann machines. In CIARP, 2012.
- Freund & Haussler (1994) Y. Freund and D. Haussler. Unsupervised learning of distributions on binary vectors using two layer networks. Technical report, 1994.
- Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In ICML. 2014.
Grosse et al. (2013)
R. B. Grosse, C. J. Maddison, and R. Salakhutdinov.
Annealing between distributions by averaging moments.In NIPS, 2013.
- Hinton (2002) G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 2002.
- Hinton (2012) G. E. Hinton. A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the Trade (2nd ed.). 2012.
- Hinton et al. (2006) G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 2006.
- Hsieh et al. (2011) C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estimation using quadratic approximation. In NIPS, 2011.
- Kingma & Welling (2013) D. P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR, 2013.
- Lee et al. (2009) H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.
- Liu et al. (2015) Q. Liu, J. Peng, A. Ihler, and J. Fisher III. Estimating the partition function by discriminance sampling. In UAI, 2015.
Maas et al. (2013)
A. L. Maas, A. Y. Hannun, and A. Y. Ng.
Rectifier nonlinearities improve neural network acoustic models.
ICML Workshop on Deep Learning for Audio, Speech, and Language Processing, 2013.
- Nair & Hinton (2010) V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
- Pakman & Paninski (2014) A. Pakman and L. Paninski. Exact hamiltonian monte carlo for truncated multivariate gaussians. Journal of Computational and Graphical Statistics, 2014.
- Parikh & Boyd (2014) N. Parikh and S. Boyd. Proximal algorithms. Found. Trends Optim., 2014.
- Ravanbakhsh et al. (2016) S. Ravanbakhsh, B. Póczos, J. G. Schneider, D. Schuurmans, and R. Greiner. Stochastic neural networks with monotonic activation functions. In AISTATS, 2016.
- Salakhutdinov & Hinton (2009) R. Salakhutdinov and G. Hinton. Deep Boltzmann machines. In AISTATS, 2009.
- Salakhutdinov & Murray (2008) R. Salakhutdinov and I. Murray. On the quantitative analysis of Deep Belief Networks. In ICML, 2008.
- Smolensky (1986) P. Smolensky. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. 1986.
- Su et al. (2016) Q. Su, X. Liao, C. Chen, and L. Carin. Nonlinear statistical learning with truncated gaussian graphical models. In ICML, 2016.
- Theis et al. (2016) L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. In ICLR, 2016.
- Tieleman (2008) T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML, 2008.
- Tieleman & Hinton (2009) T. Tieleman and G.E. Hinton. Using Fast Weights to Improve Persistent Contrastive Divergence. In ICML, 2009.
- Welling et al. (2004) M. Welling, M. Rosen-Zvi, and G. E. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, 2004.
- Yang et al. (2012) E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via generalized linear models. In NIPS, 2012.
Appendix A Proof of Theorem 3.1
Since , we have . Therefore, . ∎
Appendix B Proof of Theorem 3.1
Appendix C Necessity of the Projection Step
We conduct a short comparison to demonstrate the projection step is necessary for the leaky RBM on generative tasks. We train two leaky RBM as follows. The first model is trained by the same setting in Section 4. We use the convergence of log likelihood as the stopping criteria. The second model is trained by CD-1 with weight decay and without the projection step. We stop the training when the reconstruction error is less then . After we train these two models, we run Gibbs sampling with 1000 independent chains for several steps and output the average value of the visible units. Note that the visible units are normalized to zero mean. The results on SVHN and CIFAR10 are shown in Figure 5.
From Figure 5, the model trained by weight decay without projection step is suffered by the problem of the diverged values. It confirms the study shown in Section 3.1. It also implies that we cannot train leaky RBM with larger CD steps when we do not do projection; otherwise, we would have the diverged gradients. Therefore, the projection is necessary for training leaky RBM for the generative purpose. However, we also oberseve that the projection step is not necessary for the classification and reconstruction tasks. he reason may be the independency of different evaluation criteria (Hinton, 2012; Theis et al., 2016) or other implicit reasons to be studied.
Appendix D Equivalence between Annealing the Energy and Leakiness
We analyze the performance gap between AIS-Leaky and AIS-Energy. One major difference is the initial distribution. The intermediate marginal distribution of AIS-Energy has the following form:
Here we eliminated the bias terms for simplicity. Compared with Algorithm 2, (11) not only anneals the leakiness when , but also in the case when , which brings more bias to the estimation. In other words, AIS-Leaky is a one-sided leakiness annealing while AIS-Energy is a two-sided leakiness annealing method.
To address the higher bias problem of AIS-Energy, we replace the initial distribution with the one used in Algorithm 2. By elementary calculation, the marginal distribution becomes
which recovers the proposed Algorithm 2. From this analysis, we understand AIS-Leaky is a special case of conventional AIS-Energy with better initialization inspired by the study in Section 3. Also, by this connection between AIS-Energy and AIS-Leaky, we note that AIS-Leaky can be combined with other extensions of AIS (Grosse et al., 2013; Burda et al., 2015) as well.