The maximum likelihood estimation is one of the methods to learn the parameters of an RBM. However, evaluating the gradient (w.r.t. the parameters of the model) of the log-likelihood is computationally expensive (exponential in minimum of the number of visible/hidden units in the model) since it contains an expectation term w.r.t.
the model distribution. This expectation is approximated using the samples from the model distribution. The samples are obtained using Markov Chain Monte Carlo (MCMC) methods which are efficient for RBMs due to their bipartite connectivity structure. The popular Contrastive Divergence (CD) algorithm uses samples obtained through an MCMC procedure. However, the resulting estimated gradient may be poor when the RBM model is high dimensional. The poor estimate can make the simple stochastic gradient descent (SGD) based algorithms such as CD(k) to even diverge in some cases(Fischer & Igel, 2010).
There are two approaches to make the learning of RBM more efficient. The first is to design an efficient MCMC method to get good representative samples from the model distribution and thereby reduce the variance of the estimated gradient(Desjardins et al., 2010; Tieleman & Hinton, 2009). However, advanced MCMC methods are computationally intensive, in general. The second approach is to design better optimization strategies which are robust to the noise in estimated gradient (Martens, 2010; Desjardins et al., 2013; Carlson et al., 2015). Most approaches to design better optimization methods for learning RBMs are second order optimization techniques that either need approximate Hessian inverse or an estimate of the inverse Fisher matrix. (The two approaches differ for the RBM since it contains hidden units). The AdaGrad (Duchi et al., 2011) method uses diagonal approximation of the Hessian matrix while TONGA (Roux et al., 2008) assumes block diagonal structure. The Hessian-Free (H-F) method (Martens, 2010)
is an iterative procedure which approximately solves a linear system to obtain the curvature through matrix-vector product. InDesjardins et al. (2013) H-F method is used to design natural gradient descent for learning Boltzmann machines. A sparse Gaussian graphical model is proposed in Grosse & Salakhutdinov (2015) to estimate the inverse Fisher matrix to devise factorized natural gradient descent procedure. All these methods either need additional computations to solve an auxiliary linear system or are computationally intensive methods to directly estimate the inverse Fisher matrix.
Recently, an algorithm called stochastic- difference of convex functions programming (S-DCP) (Upadhya & Sastry, 2017)
is proposed for training RBMs. The S-DCP approach, which uses only first order derivatives of the log likelihood, essentially solves a series of convex optimization problems and is shown to perform well compared to other algorithms. More importantly, the computational cost of S-DCP algorithm can be made identical to that of CD based algorithms with a proper choice of hyperparameters.
Motivated by the simplicity and the efficiency of the S-DCP algorithm, in this work, we modify the S-DCP algorithm by using the diagonal approximation of the Hessian of the log-likelihood to obtain parameter-specific adaptive learning rates for the gradient descent process. We show that the diagonal terms of the Hessian can be expressed in terms of the covariances of visible and hidden units and can be estimated using the same MCMC samples used to get gradient estimates. Therefore, additional computational cost incurred is negligible. Thus the main contribution of the paper is a well-motivated method that can automatically adopt the step-size to improve the efficiency of learning an RBM. Through extensive empirical investigations we show the effectiveness of the proposed algorithms.
The rest of the paper is organized as follows. In section 2, we first briefly describe the RBM model and the maximum likelihood (ML) learning approach for RBM. We explain the proposed algorithm, the diagonally scaled S-DCP, in section 3. In section 4, we describe the simulation setting and then present the results of our study. Finally, we conclude the paper in section 5.
2.1 Restricted Boltzmann Machines
The Restricted Boltzmann Machine (RBM) is an energy based model with a two layer architecture, in whichvisible stochastic units in one layer are connected to hidden stochastic units in the other layer (Smolensky, 1986; Freund & Haussler, 1994; Hinton, 2002). There are no connections from visible to visible and hidden to hidden nodes and the connections between the layers are undirected. The RBM with parameters represents a probability distribution
where, is the normalizing constant which is called the partition function and is the energy function. Depending on the units (discrete or continuous) the energy function is defined. In this work, we consider binary units, i.e., and for which the energy function is defined as
where is the set of model parameters. The , the element of , is the weight of the connection between the hidden unit and the visible unit. The and denote the bias for the hidden unit and the visible unit respectively.
2.2 Maximum Likelihood Learning
One of the methods to learn the RBM parameters, , is through the maximization of the log-likelihood over the training samples. The log-likelihood, for a given training sample (), is given by,
The optimal RBM parameters can be found by solving the following optimization problem.
Since there is no closed form solution for the above optimization problem, iterative gradient ascent procedure is used , i.e.,
where denotes the expectation w.r.t. the distribution . The expectation under the conditional distribution, , for a given , has a closed form expression and hence, is easily evaluated analytically. However, expectation under the joint density, , is computationally intractable since the number of terms in the expectation summation grows exponentially with (minimum of) the number of hidden units/visible units present in the model. Hence, sampling methods are used to obtain .
The contrastive divergence (Hinton, 2002), a popular algorithm to learn RBM, uses a single sample, obtained after running a Markov chain for steps, to approximate the expectation as,
Here is the sample obtained after transitions of the Markov chain (defined by the current parameter values ) initialized with the training sample . There exist many variations of this CD algorithm in the literature, such as persistent (PCD) (Tieleman, 2008), fast persistent (FPCD) (Tieleman & Hinton, 2009), population (pop-CD) (Oswin Krause, 2015), and average contrastive divergence (ACD) (Ma & Wang, 2016). Another popular algorithm, parallel tempering (PT) (Desjardins et al., 2010), is also based on MCMC. All these algorithms differ in the way they obtain representative samples from the model distribution for estimating the gradient.
The centered gradient (CG) algorithm (Montavon & Müller, 2012) also uses the same principle as that of CD algorithm to obtain the samples; however, while estimating the gradient it removes the mean of the training data and the mean of the hidden activations from the visible and the hidden variables respectively. This approach has been seen to improve the conditioning of the underlying optimizing problem (Montavon & Müller, 2012).
Another recent algorithm for learning RBMs is the S-DCP (Upadhya & Sastry, 2017). This approach exploits the fact that the RBM log-likelihood is a difference of two convex functions and formulates a learning algorithm using the difference of convex functions program (DCP) method. The S-DCP approach is advantageous since a non-convex problem is solved by iteratively solving a sequence of convex optimization problems. In this work, we propose a diagonally scaled version of S-DCP as explained in the next section.
3 Diagonally scaled S-DCP
where, both the functions and are convex and smooth but is non-convex. It is an iterative procedure defined by
In the RBM setting, corresponds to the negative log-likelihood function and the functions are as defined in (3).
In the S-DCP algorithm, the convex optimization problem given by RHS of (8) is (approximately) solved using a few iterations of gradient descent on for which the is estimated using samples obtained though MCMC (as in Contrastive Divergence). Thus it is a stochastic gradient descent for the objective function for a fixed number of iterations (denoted as ). A detailed description of this S-DCP algorithm is given as Algorithm 1. Note that, it is possible to choose the hyperparameters and such that the amount of computation required is identical to CD algorithm with steps (Upadhya & Sastry, 2017).
The S-DCP algorithm can be viewed as two loops. The outer loop is the iteration given by (8). Each iteration here involves a convex optimization which is (approximately) solved by the inner loop of S-DCP through stochastic gradient descent on this convex function. In this paper we propose a scaling of this stochastic gradient descent by using the diagonal elements of the Hessian of this convex function.
The Hessian of the objective function can be obtained as,
By substituting from eq. (5) in the above equation, we get,
Note that a typical element in is where refers to the parameters of the RBM, namely, all the . The diagonal element corresponding to :
We have used the property that and (Since
are binary random variables) in the above derivation. Similarly, the diagonal terms corresponding to the bias terms are given by,
Therefore, in general, the diagonal terms in are given by,
where represents element-wise multiplication, represents vector of all ones and Diag represents the vector consisting of the diagonal elements of matrix .
By using the above equations the diagonal elements of the Hessian of can be estimated simply by using the corresponding gradient estimates. These estimates are used in obtaining the gradient ascent updates (in the inner loop S-DCP) as,
where represents element of vector and is a constant added for numerical stability. A detailed description of the proposed algorithm is given as Algorithm 2.
The inverse of the diagonal approximation of the Hessian provides parameter-specific learning rates for the gradient descent process. In case of S-DCP algorithm the objective function for the gradient descent is convex and the diagonal terms of the are greater than zero since is convex. We show that the diagonal scaling will aid in learning better models through empirical analysis on benchmark datasets.
4 Experiments and Discussions
In this section, we give a detailed comparison between the diagonally scaled S-DCP and other algorithms like centered gradient (CG)(Melchior et al., 2016), S-DCP and CS-DCP (centered S-DCP). We do not consider the CD, PCD or other variants since CG and S-DCP based algorithms are already shown to perform well compared to them (Melchior et al., 2016; Upadhya & Sastry, 2017). We compare the algorithms by keeping the computational complexity roughly the same (for each mini-batch). This is achieved by the choice of the parameters and in S-DCP and diagonally scaled S-DCP. This computational architecture makes sure that the learning speed in terms of actual time is proportional to the speed in terms of iterations.
4.1 The Experimental Set-up
We consider three benchmark datasets in our analysis namely Bars & Stripes (MacKay, 2003), MNIST111 statistically binarized as in
statistically binarized as inSalakhutdinov & Murray (2008) (LeCun et al., 1998) and CalTech 101 Silhouettes DataSet (Marlin, 2009). The Bars & Stripes dataset of data dimension is generated using a two step procedure as follows. In the first step, all the pixels in each row are set to zero or one with equal probability and then the pattern is rotated by degrees with a probability of in the second step. We have chosen , for which we get distinct patterns. Both the MNIST and the CalTech 101 Silhouettes dataset have data dimension of .
For Bars & Stripes dataset, we consider RBMs with hidden units and for large datasets, we consider RBMs with
hidden units. We evaluate the algorithms using the performance measures obtained from multiple trials, where each trial fixes the initial configuration of the weights and biases. The biases of visible units and hidden units are initialized to the inverse sigmoid of the training sample mean and zero respectively. The weights are initialized to samples drawn from a Gaussian distribution with mean zero and standard deviation. We use trials for the Bars & Stripes dataset and
trials for the large datasets. We learn the RBM for a fixed number of epochs and avoid using any stopping criterion. The training is performed forepochs for Bars & Stripes and epochs for the large datasets. The mini-batch learning procedure is used and the training dataset is shuffled after every epoch. However, for Bars Stripes dataset full batch training procedure is used.
We compare the performance of diagonally scaled S-DCP with centered gradient (CG), S-DCP and CS-DCP (centered S-DCP). We keep the computational complexity of S-DCP/CS-DCP roughly same as that of CG by choosing and such that (Upadhya & Sastry, 2017). Since previous works stressed on the necessity of using large for CD based algorithms to get a sensible generative model (Carlson et al., 2015; Salakhutdinov & Murray, 2008), we use in CG (with for S-DCP/CS-DCP) for large datasets and in CG (with for S-DCP/CS-DCP) for Bars & Stripes dataset. In order to get an unbiased comparison, we did not use momentum and weight decay for any of the algorithms.
For comparison with the centered gradient method, we use the Algorithm given in Melchior et al. (2016) which corresponds to in their notation. However we use CD step size . In that algorithm, the hyperparameters and are set to . The initial value of is set to mean of the training data and is set to . The CS-DCP algorithm also use the same hyperparameter settings.
4.2 Evaluation Criterion
The performance comparison is based on the log-likelihood achieved on the training and test samples. For comparing the speed of learning of different algorithms, the average train log-likelihood is a reasonable measure. The average test log likelihood also indicates how well the learnt model generalizes. We show the maximum (over all trials) of the average train and test log-likelihood. The average test log-likelihood (denoted as ATLL) is evaluated as,
We evaluate the average train log-likelihood similarly by using the training samples rather than the test samples. For small RBMs the above expression can be evaluated exactly. However for large RBMs, we estimate the ATLL with annealed importance sampling (Neal, 2001) with particles and intermediate distributions according to a linear temperature scale between and .
It is possible that the learnt model achieves high log-likelihood though the learnt distribution is not close to the empirical distribution of the training samples. Therefore, for Bars & Stripes dataset we evaluate the probability that the model assigns to the training samples and the samples which are one Hamming distance away from them. As the learning progresses, ideally we expect the model to assign a higher probability to the training samples compared to the other binary vectors. However, the distribution that RBM represents is smooth since it is exponential of an energy function. Therefore, we also evaluate the probability assigned to all the samples which are one Hamming distance away from the training sample. Intuitively, the combined probability assigned on these two sets should be closer to as the learning progresses. Therefore, we provide plots for
total probability assigned by the model to the training samples (denoted as ), and
total probability assigned by the model to the training samples and the samples that are one Hamming distance away from them (denoted as ).
However, evaluating the above measures are intractable for the models learnt on large datasets. Therefore we provide the traditional average test log-likelihood score to evaluate those learnt models. In addition, we also provide samples generated by the learnt models by randomly initializing the states of visible units and running alternating Gibbs Sampler for 5000 steps.
4.3 Performance Comparison
In this section, we present experimental results to illustrate the performance of diagonally scaled S-DCP (denoted as S-DCP-D) in comparison with the other methods (CG, S-DCP and CS-DCP). We provide only the best performance measures for all the algorithms by cross validating the hyperparameters involved for each of the algorithms. We observe that the ATLL achieved by diagonally scaled S-DCP is equal or greater than that of S-DCP.
Figure 0(b) shows the evolution of the ATLL for the Bars & Stripes dataset and figure 0(a) also plots for the measures discussed in section 4.2, namely, and . We observe that the probability assigned to valid samples, is higher for S-DCP-D compared to S-DCP and CG across epochs. However, the total probability assigned to valid samples and the samples that are one Hamming distance away, approaches almost the same value for all the S-DCP based algorithms and it is better han that achieved by the CG algorithm. From this we can infer that the proposed algorithms are able to minimize the probability assigned to non-valid but similar (Hamming distance wise) examples. We observe similar performance in terms of log-likelihood measure, i.e., all the S-DCP based algorithms achieve almost the same ATLL which is higher compared to CG algorithm, as shown in figure 0(b).
Figure 2 shows the evolution of the maximum average log-likelihood of test and training set for the MNIST and the CalTech dataset. In both the cases, the proposed S-DCP-D performs better compared to S-DCP and CG algorithm. Also, we observe in figure 1(a) that the S-DCP-D evolution is smoother compared to S-DCP which indicate the effectiveness of using the parameter specific learning rates. Further, the ATLL evolution in figure 1(a) indicates that the learnt models using S-DCP-D generalizes better compared to that learnt by other algorithms. The maximum ATLL achieved by S-DCP-D is which is significantly higher compared to other methods. The provided maximum ATLL score for S-DCP and CS-DCP matches with the earlier study in Upadhya & Sastry (2017)
. Also, the ATLL achieved by the learnt models are comparable to that of the VAE (Variational Autoencoder) and IWAE (Importance Weighted Autoencoder) models(Burda et al., 2015).
We observe similar behaviour for the CalTech dataset, as shown in figure 1(b) and 1(b). The performance of S-DCP-D is similar to CS-DCP and significantly better compared to that of S-DCP and CG algorithms on training dataset as shown in figure 1(b). However, in terms of generalization capacity, the models learnt using S-DCP-D is similar to S-DCP as seen from 1(b).
The samples generated by the learnt models (on MNIST dataset) are given in figure 3. The samples generated by S-DCP-D are sharp compared to that produced by CG based model. Also, it can be observed that the samples generated by CG and S-DCP-D are more diverse compared to that produced by S-DCP and CS-DCP. It is important to note that there exist no precise measure to evaluate the generative model based on the generated samples.
Learning an RBM is difficult due to noisy gradient estimates of the log-likelihood which are obtained through MCMC procedure. Even with the noisy gradient estimates, several advanced optimization techniques are proposed which obtain fast and stable learning. In this work, we proposed a way to estimate the diagonal approximation of the Hessian to scale the S-DCP gradient estimates, in order to achieve higher efficiency of learning compared to S-DCP and CG algorithms, the current popular methods for learning RBMs.
The divergence or instability of the log-likelihood during learning may be due to the fact that estimate of Hessian of the log-likelihood function of an RBM is poor. We exploited the fact that the function S-DCP tries to optimize is convex and hence its Hessian is positive definite. The resulting algorithm is very similar to S-DCP and has a negligible additional computational complexity. Moreover, the computational complexity of S-DCP can be made identical to that of standard CD based algorithms. Through empirical studies, we illustrated the advantages of diagonally scaled S-DCP over CG on three benchmark datasets.
The main attraction of diagonally scaled S-DCP, in our opinion, is its simplicity compared to other sophisticated optimization techniques which use computationally intensive methods to estimate the Hessian. It is possible to learn RBM more efficiently if the learning rate is reduced with iterations using a heuristically devised schedule. It has to be often fixed through cross validation. The proposed approach automatically provides parameter specific learning rates which makes the learning procedure both stable and efficient and it does not need any cross validation procedures to fix learning rate schedules. The only hyper parameters of the proposed method iswhich does not affect the learning dynamics much and is there only to control numerical underflows.
- An & Tao  Le Thi Hoai An and Pham Dinh Tao. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of Operations Research, 133(1):23–46, 2005. ISSN 1572-9338.
- Burda et al.  Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. CoRR, abs/1509.00519, 2015. URL http://arxiv.org/abs/1509.00519.
Carlson et al. 
David Carlson, Volkan Cevher, and Lawrence Carin.
Stochastic spectral descent for restricted Boltzmann machines.
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pp. 111–119, 2015.
- Desjardins et al.  Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. Adaptive parallel tempering for stochastic maximum likelihood learning of RBMs. arXiv preprint arXiv:1012.3476, 2010.
- Desjardins et al.  Guillaume Desjardins, Razvan Pascanu, Aaron C. Courville, and Yoshua Bengio. Metric-free natural gradient for joint-training of Boltzmann machines. CoRR, abs/1301.3545, 2013.
Duchi et al. 
John Duchi, Elad Hazan, and Yoram Singer.
Adaptive subgradient methods for online learning and stochastic
Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
Fischer & Igel 
Asja Fischer and Christian Igel.
Empirical analysis of the divergence of Gibbs sampling based
learning algorithms for restricted Boltzmann machines.
Artificial Neural Networks–ICANN 2010, pp. 208–217. Springer, 2010.
- Fischer & Igel  Asja Fischer and Christian Igel. An introduction to restricted Boltzmann machines. In
- Freund & Haussler  Yoav Freund and David Haussler. Unsupervised learning of distributions of binary vectors using two layer networks. Computer Research Laboratory [University of California, Santa Cruz], 1994.
- Grosse & Salakhutdinov  Roger B. Grosse and Ruslan Salakhutdinov. Scaling up natural gradient by sparsely factorizing the inverse fisher matrix. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 2304–2313. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045363.
- Hinton  Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
- Hinton et al.  Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006.
- LeCun et al.  Yann LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Effiicient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pp. 9–50, London, UK, UK, 1998. Springer-Verlag. ISBN 3-540-65311-2.
- Ma & Wang  Xuesi Ma and Xiaojie Wang. Average contrastive divergence for training restricted Boltzmann machines. Entropy, 18(1):35, 2016.
- MacKay  David JC MacKay. Information theory, inference, and learning algorithms, volume 7. Cambridge university press Cambridge, 2003.
- Marlin  Benjamin M. Marlin. CalTech 101 Silhouettes Data Set. https://people.cs.umass.edu/~marlin/data.shtml, 2009.
- Martens  James Martens. Deep learning via hessian-free optimization. In ICML, 2010.
- Melchior et al.  Jan Melchior, Asja Fischer, and Laurenz Wiskott. How to center deep Boltzmann machines. Journal of Machine Learning Research, 17(99):1–61, 2016.
- Montavon & Müller  Grégoire Montavon and Klaus-Robert Müller. Deep Boltzmann Machines and the Centering Trick, pp. 621–637. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8.
- Neal  Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.
- Oswin Krause  Christian Igel Oswin Krause, Asja Fischer. Population-contrastive-divergence: Does consistency help with RBM training? CoRR, abs/1510.01624, 2015.
- Roux et al.  Nicolas L. Roux, Pierre-antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In Advances in Neural Information Processing Systems 20, pp. 849–856. Curran Associates, Inc., 2008.
- Salakhutdinov & Hinton  Ruslan Salakhutdinov and Geoffrey E Hinton. Deep Boltzmann machines. In AISTATS, volume 1, pp. 3, 2009.
- Salakhutdinov & Murray  Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pp. 872–879. ACM, 2008.
- Smolensky  Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. 1986.
- Tieleman  Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pp. 1064–1071. ACM, 2008.
- Tieleman & Hinton  Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1033–1040. ACM, 2009.
- Upadhya & Sastry  Vidyadhar Upadhya and P. S. Sastry. Learning RBM with a DC programming approach. In Proceedings of the Ninth Asian Conference on Machine Learning, volume 77 of Proceedings of Machine Learning Research, pp. 498–513. PMLR, 15–17 Nov 2017.
- Yuille et al.  Alan L Yuille, Anand Rangarajan, and AL Yuille. The concave-convex procedure (CCCP). Advances in neural information processing systems, 2:1033–1040, 2002.