Estimating the log-likelihood gradient with respect to the parameters of an undirected graphical model, such as a Restricted Boltzmann Machine (RBM, Smolensky, 1986; Hinton, 2002; Fischer & Igel, 2014), is a challenging task. As the analytic calculation of the gradient is infeasible, it is often approximated using Markov Chain Monte Carlo (MCMC) techniques. Getting unbiased samples from the model distribution requires to run the Markov chain until convergence, but in practice the chain is only iterated for a fixed amount of steps and the quality of the samples is often unknown.
Arguably, -step Contrastive Divergence (CD-, Hinton, 2002) is the algorithm most often used for RBM training. In each training iteration of CD- a new Markov chain is initialized with a sample from the dataset, and steps of block-Gibbs-sampling are performed. In practice, is often set to one, which obviously results in a considerable bias of the gradient approximation. Even though the bias is bounded (Fischer & Igel, 2011), it was reported to cause divergence in the long run of the optimization (Fischer & Igel, 2010)
, which is difficult to detect by heuristics(Schulz et al., 2010). Moreover, it has been shown that the gradient field of the CD approximation does not belong to any objective function (Sutskever & Tieleman, 2010).
After the introduction of CD, other sampling schemes were proposed for the gradient approximation, the most prominent ones being Persistent Contrastive Divergence (PCD, Tieleman, 2008) and Parallel Tempering (PT, Salakhutdinov, 2009; Desjardins et al., 2010). The former uses a persistent Monte Carlo Chain during training. While initially started from a training example as in CD, the chain is not discarded after each gradient update but successively reused in following learning iterations. This is done with the hope that the chain stays close to the model distribution, while performing PCD-based gradient ascent. However, this requires a very small learning rate to guarantee that the model distribution changes slowly enough to compensate the small mixing rate of Gibbs-sampling. To solve this problem, PT was suggested for RBM training (Salakhutdinov, 2009; Desjardins et al., 2010). This sampling method runs multiple tempered replica chains. A Metropolis-based swapping operator allows samples to swap between the chains in order to achieve faster mixing. This is paid for by a higher computational cost. While CD is inherently a parallel algorithm that can be applied simultaneously to the full data set, PT performs a step of all parallel Markov chains for every sample. This makes PT a serial algorithm that is difficult to transfer to GPUs, whereas implementing CD on a GPU is straightforward.
This paper introduces a different direction for RBM training that is based on the Population Monte Carlo (PMC) method (Cappé et al., 2004). In PMC sampling, a set of samples is re-weighted using importance sampling after each application of the transition operator so that the weighted samples are unbiased. Furthermore, the weights can be used to re-sample the points. We will present a new algorithm combining importance sampling as in PMC and CD learning. It can be implemented as efficiently as CD without suffering from a high bias in RBMs with a small number of hidden neurons.
2 Rbms and Contrastive Divergence
An RBM is an undirected graphical model with a bipartite structure. The standard binary RBM consists of visible variables taking states and hidden variables taking states
. The joint distribution is a Gibbs distributionwith energy , where
are the weight matrix and the visible and hidden bias vectors, respectively. The normalization constant(also referred to as partition function) is typically unknown, because it is calculated by summing over all possible states of hidden and visible units, which is exponential in .
In this paper, we focus on the problem of maximum log-likelihood fitting of the RBM distribution to a data set . The gradient of w.r.t. the model parameters is given by
While the first term of the derivative can be computed analytically, the calculation of the second term requires to sum over all , which is intractable. Instead, samples are used to obtain an estimate of the expectation under the RBM distribution. The RBM training algorithm most often used in literature is CD, which approximates the expectation over by a sample gained after steps of Gibbs-sampling starting from sample .
2.1 Population Monte Carlo
Being a Markov chain Monte Carlo (MCMC) method, Population Monte Carlo (PMC, Cappé et al., 2004) aims at estimating expectations of the form
When samples from are available, this expectation can be approximated by
Often it is not possible to generate samples from directly. In this case, MCMC methods provide a way to sample from by running steps of a Markov chain that has as its stationary distribution.
In practice, it is computationally too demanding to choose large enough to guarantee convergence. Consequently, the samples generated by the Markov chain are not drawn from but from another distribution which we will refer to as . Thus, using the samples directly in (3) leads to a biased estimate for (2).
In this case, importance sampling, where
is regarded as proposal distribution, could allow for getting an unbiased estimate of (2) based on samples from :
However, the distribution that results from running a Markov chain for -steps is usually unknown and therefore a weighting scheme as in equation (4) can not be applied. Instead one can use any known conditional distribution as proposal distribution for importance sampling since
This re-weighting scheme does not require the knowledge of the probabilitiesbut just the ability to sample from .
Equation (5) requires that we know , but in probabilistic modelling often only the unnormalized probability is available. Thus, needs to be estimated as well. Since for all , we get a biased importance sampling estimate by
where , , and . The estimate becomes unbiased in the limit of and thus the estimator is consistent.
A PMC method makes use of equations (5) and (6) to create a Markov chain, where in step states , are sampled. For the states importance weights are estimated and the states are then re-sampled accordingly. The re-sampling procedure results in the samples which corresponds to the -th state of the chain. The chain thus only depends on the choice of which should be chosen to have heavy tails as to reduce the variance of the estimate.
In this section, we introduce the main contribution of this paper, the Population-Contrastive-Divergence (pop-CD) algorithm for RBM training. This algorithm exploits the features of CD and Population-MCMC by combining the efficient implementation of CD- with the low bias re-weighting scheme introduced by PMC.
Let us rewrite the log-likelihood gradient in equation (1) by setting and estimating based on the re-weighting scheme of equation (5). For this, we choose to be the distribution of the hidden variables generated by -steps of block-wise Gibbs-sampling when starting from the -th training example , which we will call in the following. The proposal distribution is set to the conditional distribution of the visible variables given the state of the hidden, i.e., to . This yields
In the case of we can write and for the distribution can be written as a recursive relation over the distribution of :
Now, based on the considerations in the previous section, it follows from (7) that we can estimate the second term of the gradient given the training set by (6), where , and . This results in the following estimate of the log-likelihood gradient
which we will refer to as pop-CD-. Setting or leads to CD-.
In an actual implementation, it is advisable to work with for increased numerical stability. Algorithm 1 in the Appendix describes pop-CD in pseudo code.
In the following, we will analyse the properties of pop-CD- in more detail. We will analyse the runtime costs as well as the bias of the method, take a closer look at the behavior of the weights, and discuss alternative weighting schemes.
Equation (8) does not produce a noteworthy computational overhead compared to CD, because the only entities to compute additionally are the weights . The weights in turn are inexpensive to compute as the distribution is already computed (see Algorithm 1, line 1) and calculating can reuse , which has already been computed for determining for the gradient update. Thus the cost to compute is negligible compared to the sampling cost.
It is known that equation (6) is biased for two reasons. The first reason is that the same samples are used to estimate the gradient as for the estimation of . The second one is that, even assuming an unbiased estimator for , this does not lead to an unbiased estimate of . It is easy to remove the bias from using the same samples by rewriting (6) as
that is, by simply removing the -th sample from the estimator of when estimating . This yields an estimator which converges in expectation over all possible sets of samples to
which is still biased in the regime of finite sample size since . Applying this result to the problem of estimating the log-likelihood gradient (1), the pop-CD gradient converges to
Thus, using(9), only the length of the right term of the gradient will be biased and not its direction.
In practice, it is not advisable to employ (9) instead of (6) as usually only a small number of samples is used. In this case, the estimate of will have a high variance and the length of the gradient obtained by can vary by several orders of magnitude. This variance leads to a small effective sample size in (6).
3.3 Analysis of Weights
This section gives a deeper insight into the weights of the samples. First, we rewrite the weight formula as
Now it becomes clear that the weight of a sample is large if also has a high probability under and that the weight is small otherwise. Therefore, when puts more probability mass on samples less likely under , this is partly counterbalanced by the weighting scheme. The only case where a low probability sample could lead to a large weight of a sample is when is large for some . However, this would require a much smaller probability of the sample under than under , and, thus, the probability of drawing such a from would be small.
3.4 Alternative formulations
In our approach, the weights solely depend on the current samples . Therefore, only little information about the distribution enters the learning process. Now, one could ask the question whether it is beneficial to use the full sample from the sampling distribution instead and defined in equation (7) the proposal distribution to be . In this case, the weights become . As the typical choice for is the block-wise Gibbs-sampler, we get and , which again depends on a single .
A different approach is to tie several samples of together in a mixture proposal distribution . The advantage of this approach is that the proposal distribution takes distribution information of into account. In fact, . The latter is a good proposal distribution assuming , however, the computational cost of computing the importance weights rises linearly with the number of samples in the mixture.
|Bars & Stripes||1||16||5.34874e-05||0.0242761||0.000527033||0.000156001|
We compared the new pop-CD-
algorithm experimentally to previously proposed methods. We implemented the algorithms in the machine learning libraryShark (Igel et al., 2008) and will make the code of the experiments available in the case of acceptance of the manuscript.
4.1 Experimental Setup
We chose two small artificial datasets, Bars & Stripes (MacKay, 2002) and the data set used by Desjardins et al. (2010) and Brügge et al. (2013) which we will refer to as Artificial Modes, as well as two bigger real world datasets, MNIST (LeCun et al., 1998) and Letters111http://ai.stanford.edu/btaskar/ocr/. Artificial Modes was created to generate distributions for which Gibbs-sampling has problems to mix by setting is control parameter to . The datasets Bars & Stripes and Artificial Modes are known to lead to a bias causing the log-likelihood to diverge for CD and PCD (Fischer & Igel, 2010; Desjardins et al., 2010). This makes them good benchmarks for analysing the bias of different gradient estimation strategies.
In all experiments, except the ones on Bars & Stripes, we performed mini batch learning. We a priori set the size of the batches, and thus also the number of samples generated for the estimation of the model mean, to be a divisor of the full dataset size. The batch sizes were 32 for Bars & Stripes, 500 for Artificial Modes, 246 for Letters, and 500 for MNIST. We did not use a momentum term in all our experiments.
The first set of experiments considered the behaviour of the algorithm in settings in which the normalisation constant of the RBM can be computed analytically. We selected the number of hidden neurons to be 16 and we compared the algorithms CD-1, CD-10, PCD-1, pop-CD-1, pop-CD-10, and PT with 10 parallel chains (and inverse temperatures uniformly distributed in). We choose the learning rates to be and for all datasets. We let the algorithm run for 50000 and 10000 iterations (where one iteration corresponds to one step of gradient ascent) on Bars & Stripes and Artificial Modes, respectively. On MNIST we trained for 5000 iterations with and for 20000 with , and on Letters we trained for 10000 iterations with and 30000 iterations with . We evaluated the log-likelihood every 100 iterations. All experiments were repeated 25 times and the reported log-likelihood values are the means over all trials.
In our second set of experiments, we trained RBMs with a large number of hidden neurons. Here the normalisation constant was estimated using Annealed Importance Sampling (AIS, Neal, 2001) with 512 samples and 50000 intermediate distributions. The learning parameters were with iterations and with iterations for Letters and for MNIST with iterations and with iterations. Due to the long running times we do not consider PCD and PT and only show one trial for MNIST. For Letters, the results are the average over 10 trials.
In a third set of experiments, we investigated the bias and the variance of the methods. We first trained an RBM using CD-1 with and 16 hidden neurons on all four problems. The number of training iterations was 10000 for Bars & Stripes, Artificial Modes and Letters while 5000 on MNIST. Furthermore, we repeated the large experiments by training an RBM with and 500 hidden units for 15000 iterations on MNIST and with , 100 hidden units and iterations on Letters. After that, we calculated a “ground truth “ estimate of the true gradient. This was done by PT with 100 intermediate chains. The chain was given 50000 iterations burn-in time and afterwards 200000 samples were taken to compute the estimate. Then, we generated 50000 gradient estimates of CD- and pop-CD- for . Finally, we estimated the bias of the methods as the squared distance between the empirical average of the methods and the estimated ground truth and the variance as the expectation of the squared distance between the single samples and their mean.
The results of the first set of experiments can be seen in Figure 1 for the artificial datasets and in Figure 2 for MNIST and Letters. Figure 1 shows that pop-CD-1 as well as pop-CD-10 outperformed CD-1 and CD-10. The proposed algorithm reached significantly higher log-likelihood values and, while CD and PCD diverged, pop-CD did not show any sign of divergence. Using and sampling steps lead to almost the same performance of pop-CD. The efficient pop-CD-1 performed on a par or better than the computationally more expensive PT-10. On the two real datasets CD did not cause any divergence. However, while all algorithms performed almost the same on Letters, pop-CD-1 still reaches higher likelihood values than CD-1 on MNIST, which stopped showing improvement after 3000 iterations in Figure 2(a). Furthermore, one can observe that pop-CD performed slightly better with a smaller learning rate of than with a learning rate of .
The results for the second set of experiments are given in Figure 3. The plots for the Letters data set (Figures 3(a) and 3(b)) look qualitatively the same, aside from the scaling of the abscissa. In both cases, CD-1 and CD-10 performed better than pop-CD-1 and pop-CD-10. On MNIST (Figures 3(c) and 3(d)), both CD-1 and CD-10 diverged after the initial learning phase. For pop-CD showed a very slow learning progress, while for all four algorithms exhibit similar behaviour. While all algorithms diverge in the long run, divergence was less pronounced for pop-CD- than CD-.
Table 1 shows the measured bias and variance for the third experimental setup. For the experiments with 16 hidden neurons, CD- always had a smaller variance than pop-CD-, often by an order of magnitude or more. The bias of pop-CD- was in contrast much smaller, which explains the experimental results we observed. In the experiments with 500 hidden units, pop-CD- had a larger bias as well.
In most experiments, the new method performed on a par or even better compared to computational much more expensive sampling techniques such as PT. However, for problems where the bias of CD does not notably impair the learning, pop-CD can be slower than CD, because it may require smaller learning rates for stochastic gradient descent.
The fact that pop-CD fares better with smaller learning rates can be explained by a larger variance, as measured in all our experiments, see Table 1. This is due to the bias-variance-trade-off, which is a well known issue for importance sampling based estimators. While having potentially low bias, the variance of importance sampling based estimators depends on how well the proposal function approximates the target distribution. Thus, our results indicate that the used proposal function can lead to a high variance and may thus not be the optimal choice for complex target distributions. When working with a small sample size compared to the difficulty of the estimation problem or with a bad proposal distribution, the estimate of the normalisation constant can have a large variance, which leads to a biased estimate of the inverse normalisation constant. This is supported by our results in the experiments with a large number of hidden units as well as by the bias estimates, see Table 1.
The observation that more steps of Gibbs-sampling help when using pop-CD for larger models can be explained by the analysis in Section 3.3. The variance of the estimator depends on how well the sampled —and, thus, the proposal distribution —approximates the RBM distribution. The higher the closer gets to . This reduces the variance of the weights and affects the estimator of the gradient as well as the estimator of the normalisation constant. This is supported by the results in Table 1 for the different values of .
Improving the sample quality of the Markov chain (by increasing the number of sampling steps or employing advanced sampling techniques like PT) is not the only direction leading to low bias estimates—importance sampling can also reduce the bias efficiently.
We have proposed a new variant of the Contrastive Divergence (CD) algorithm inspired by the Population Monte Carlo method. The new learning algorithm, termed Population-CD (pop-CD), incorporates importance weights into the sampling scheme of CD. This makes the bias of the log-likelihood gradient estimate independent of the Markov chain and sampling operators used, with negligible computational overhead compared to CD. However, this comes at the cost of an increased variance of the estimate. Further the bias of the method depends on how well the empirical mean of the importance weights estimates the normalisation constant of the distribution.
In contrast to CD, pop-CD is consistent. For problems with a small number of hidden neurons it clearly outperforms CD with the same computational cost, leading to higher likelihood values. However, for RBMs with many hidden neurons, pop-CD may require a large number of samples to achieve a smaller bias than CD—therefore, CD is still recommended in that case. The reason for this is that our current estimator relies on a set of simple proposal distributions, , which do not use information about the distribution of the samples . Thus, the estimator can be seen as replacing the unknown distribution of by a known distribution, which is worse than the distribution of the samples that is available through the Markov Chain. Hence, future work must investigate whether there exists a better choice for the proposal distribution bridging the gap between and the true distribution .
Christian and Oswin acknowledge support from the Danish National Advanced Technology Foundation through project “Personalized breast cancer screening”.
- Brügge et al. (2013) Brügge, K, Fischer, A., and Igel, C. The flip-the-state transition operator for restricted Boltzmann machines. Machine Learning, 13:53–69, 2013.
- Cappé et al. (2004) Cappé, O., Guillin, A., Marin, J.-M., and Robert, C. P. Population Monte Carlo. Journal of Computational and Graphical Statistics, 13(4), 2004.
Desjardins et al. (2010)
Desjardins, G., Courville, A. C., Bengio, Y., Vincent, P., and Delalleau, O.
Tempered Markov chain Monte Carlo for training of restricted
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9 of JMLR: W&C, pp. 145–152, 2010.
Fischer & Igel (2010)
Fischer, A. and Igel, C.
Empirical analysis of the divergence of Gibbs sampling based
learning algorithms for Restricted Boltzmann Machines.
In Diamantaras, K., Duch, W., and Iliadis, L. S. (eds.),
International Conference on Artificial Neural Networks (ICANN 2010), volume 6354 of LNCS, pp. 208–217. Springer-Verlag, 2010.
- Fischer & Igel (2011) Fischer, A. and Igel, C. Bounding the bias of contrastive divergence learning. Neural Computation, 23(3):664–673, 2011.
- Fischer & Igel (2014) Fischer, A. and Igel, C. Training restricted Boltzmann machines: An introduction. Pattern Recognition, 47:25–39, 2014.
- Hinton (2002) Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
- Igel et al. (2008) Igel, C., Glasmachers, T., and Heidrich-Meisner, V. Shark. Journal of Machine Learning Research, 9:993–996, 2008.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- MacKay (2002) MacKay, D. J. C. Information theory, inference, and learning algorithms, volume 7. Cambridge University Press, 2002.
- Neal (2001) Neal, R. M. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.
- Salakhutdinov (2009) Salakhutdinov, R. Learning in Markov random fields using tempered transitions. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22 (NIPS), pp. 1598–1606, 2009.
Schulz et al. (2010)
Schulz, H., Müller, A., and Behnke, S.
Investigating convergence of restricted Boltzmann machine learning.
In Lee, H., Ranzato, M., Bengio, Y., Hinton, G. E, LeCun, Y., and Ng,
A. Y. (eds.),
NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
- Smolensky (1986) Smolensky, P. Information processing in dynamical systems: Foundations of harmony theory. In Rumelhart, D. E. and McClelland, J. L. (eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1: Foundations, pp. 194–281. MIT Press, 1986.
- Sutskever & Tieleman (2010) Sutskever, I. and Tieleman, T. On the convergence properties of contrastive divergence. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9 of JMLR: W&C, pp. 789–795, 2010.
- Tieleman (2008) Tieleman, T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), pp. 1064–1071. ACM, 2008.