1 Introduction
The Restricted Boltzmann Machine (RBM) is an energy based generative model (Smolensky, 1986; Freund & Haussler, 1994; Hinton, 2002)
which is among the basic building blocks of several deep learning models including Deep Boltzmann Machine (DBM) and Deep Belief Networks (DBN)
(Salakhutdinov & Hinton, 2009; Hinton et al., 2006). It can also be used as a discriminative model with suitable modifications.The maximum likelihood estimation is one of the methods to learn the parameters of an RBM. However, evaluating the gradient (w.r.t. the parameters of the model) of the loglikelihood is computationally expensive (exponential in minimum of the number of visible/hidden units in the model) since it contains an expectation term w.r.t.
the model distribution. This expectation is approximated using the samples from the model distribution. The samples are obtained using Markov Chain Monte Carlo (MCMC) methods which are efficient for RBMs due to their bipartite connectivity structure. The popular Contrastive Divergence (CD) algorithm uses samples obtained through an MCMC procedure. However, the resulting estimated gradient may be poor when the RBM model is high dimensional. The poor estimate can make the simple stochastic gradient descent (SGD) based algorithms such as CD(k) to even diverge in some cases
(Fischer & Igel, 2010).There are two approaches to make the learning of RBM more efficient. The first is to design an efficient MCMC method to get good representative samples from the model distribution and thereby reduce the variance of the estimated gradient
(Desjardins et al., 2010; Tieleman & Hinton, 2009). However, advanced MCMC methods are computationally intensive, in general. The second approach is to design better optimization strategies which are robust to the noise in estimated gradient (Martens, 2010; Desjardins et al., 2013; Carlson et al., 2015). Most approaches to design better optimization methods for learning RBMs are second order optimization techniques that either need approximate Hessian inverse or an estimate of the inverse Fisher matrix. (The two approaches differ for the RBM since it contains hidden units). The AdaGrad (Duchi et al., 2011) method uses diagonal approximation of the Hessian matrix while TONGA (Roux et al., 2008) assumes block diagonal structure. The HessianFree (HF) method (Martens, 2010)is an iterative procedure which approximately solves a linear system to obtain the curvature through matrixvector product. In
Desjardins et al. (2013) HF method is used to design natural gradient descent for learning Boltzmann machines. A sparse Gaussian graphical model is proposed in Grosse & Salakhutdinov (2015) to estimate the inverse Fisher matrix to devise factorized natural gradient descent procedure. All these methods either need additional computations to solve an auxiliary linear system or are computationally intensive methods to directly estimate the inverse Fisher matrix.Recently, an algorithm called stochastic difference of convex functions programming (SDCP) (Upadhya & Sastry, 2017)
is proposed for training RBMs. The SDCP approach, which uses only first order derivatives of the log likelihood, essentially solves a series of convex optimization problems and is shown to perform well compared to other algorithms. More importantly, the computational cost of SDCP algorithm can be made identical to that of CD based algorithms with a proper choice of hyperparameters.
Motivated by the simplicity and the efficiency of the SDCP algorithm, in this work, we modify the SDCP algorithm by using the diagonal approximation of the Hessian of the loglikelihood to obtain parameterspecific adaptive learning rates for the gradient descent process. We show that the diagonal terms of the Hessian can be expressed in terms of the covariances of visible and hidden units and can be estimated using the same MCMC samples used to get gradient estimates. Therefore, additional computational cost incurred is negligible. Thus the main contribution of the paper is a wellmotivated method that can automatically adopt the stepsize to improve the efficiency of learning an RBM. Through extensive empirical investigations we show the effectiveness of the proposed algorithms.
The rest of the paper is organized as follows. In section 2, we first briefly describe the RBM model and the maximum likelihood (ML) learning approach for RBM. We explain the proposed algorithm, the diagonally scaled SDCP, in section 3. In section 4, we describe the simulation setting and then present the results of our study. Finally, we conclude the paper in section 5.
2 Background
2.1 Restricted Boltzmann Machines
The Restricted Boltzmann Machine (RBM) is an energy based model with a two layer architecture, in which
visible stochastic units in one layer are connected to hidden stochastic units in the other layer (Smolensky, 1986; Freund & Haussler, 1994; Hinton, 2002). There are no connections from visible to visible and hidden to hidden nodes and the connections between the layers are undirected. The RBM with parameters represents a probability distribution(1) 
where, is the normalizing constant which is called the partition function and is the energy function. Depending on the units (discrete or continuous) the energy function is defined. In this work, we consider binary units, i.e., and for which the energy function is defined as
where is the set of model parameters. The , the element of , is the weight of the connection between the hidden unit and the visible unit. The and denote the bias for the hidden unit and the visible unit respectively.
2.2 Maximum Likelihood Learning
One of the methods to learn the RBM parameters, , is through the maximization of the loglikelihood over the training samples. The loglikelihood, for a given training sample (), is given by,
(2)  
where
(3) 
The optimal RBM parameters can be found by solving the following optimization problem.
(4) 
Since there is no closed form solution for the above optimization problem, iterative gradient ascent procedure is used , i.e.,
The gradient of and are given as (Hinton, 2002; Fischer & Igel, 2012),
(5) 
where denotes the expectation w.r.t. the distribution . The expectation under the conditional distribution, , for a given , has a closed form expression and hence, is easily evaluated analytically. However, expectation under the joint density, , is computationally intractable since the number of terms in the expectation summation grows exponentially with (minimum of) the number of hidden units/visible units present in the model. Hence, sampling methods are used to obtain .
The contrastive divergence (Hinton, 2002), a popular algorithm to learn RBM, uses a single sample, obtained after running a Markov chain for steps, to approximate the expectation as,
(6)  
Here is the sample obtained after transitions of the Markov chain (defined by the current parameter values ) initialized with the training sample . There exist many variations of this CD algorithm in the literature, such as persistent (PCD) (Tieleman, 2008), fast persistent (FPCD) (Tieleman & Hinton, 2009), population (popCD) (Oswin Krause, 2015), and average contrastive divergence (ACD) (Ma & Wang, 2016). Another popular algorithm, parallel tempering (PT) (Desjardins et al., 2010), is also based on MCMC. All these algorithms differ in the way they obtain representative samples from the model distribution for estimating the gradient.
The centered gradient (CG) algorithm (Montavon & Müller, 2012) also uses the same principle as that of CD algorithm to obtain the samples; however, while estimating the gradient it removes the mean of the training data and the mean of the hidden activations from the visible and the hidden variables respectively. This approach has been seen to improve the conditioning of the underlying optimizing problem (Montavon & Müller, 2012).
Another recent algorithm for learning RBMs is the SDCP (Upadhya & Sastry, 2017). This approach exploits the fact that the RBM loglikelihood is a difference of two convex functions and formulates a learning algorithm using the difference of convex functions program (DCP) method. The SDCP approach is advantageous since a nonconvex problem is solved by iteratively solving a sequence of convex optimization problems. In this work, we propose a diagonally scaled version of SDCP as explained in the next section.
3 Diagonally scaled SDCP
The DCP (Yuille et al., 2002; An & Tao, 2005) is a method useful for solving optimization problems of the form,
(7) 
where, both the functions and are convex and smooth but is nonconvex. It is an iterative procedure defined by
(8) 
In the RBM setting, corresponds to the negative loglikelihood function and the functions are as defined in (3).
In the SDCP algorithm, the convex optimization problem given by RHS of (8) is (approximately) solved using a few iterations of gradient descent on for which the is estimated using samples obtained though MCMC (as in Contrastive Divergence). Thus it is a stochastic gradient descent for the objective function for a fixed number of iterations (denoted as ). A detailed description of this SDCP algorithm is given as Algorithm 1. Note that, it is possible to choose the hyperparameters and such that the amount of computation required is identical to CD algorithm with steps (Upadhya & Sastry, 2017).
The SDCP algorithm can be viewed as two loops. The outer loop is the iteration given by (8). Each iteration here involves a convex optimization which is (approximately) solved by the inner loop of SDCP through stochastic gradient descent on this convex function. In this paper we propose a scaling of this stochastic gradient descent by using the diagonal elements of the Hessian of this convex function.
The Hessian of the objective function can be obtained as,
By substituting from eq. (5) in the above equation, we get,
(9) 
where .
Note that a typical element in is where refers to the parameters of the RBM, namely, all the . The diagonal element corresponding to :
We have used the property that and (Since
are binary random variables) in the above derivation. Similarly, the diagonal terms corresponding to the bias terms are given by,
Therefore, in general, the diagonal terms in are given by,
where represents elementwise multiplication, represents vector of all ones and Diag represents the vector consisting of the diagonal elements of matrix .
By using the above equations the diagonal elements of the Hessian of can be estimated simply by using the corresponding gradient estimates. These estimates are used in obtaining the gradient ascent updates (in the inner loop SDCP) as,
(10) 
where represents element of vector and is a constant added for numerical stability. A detailed description of the proposed algorithm is given as Algorithm 2.
The inverse of the diagonal approximation of the Hessian provides parameterspecific learning rates for the gradient descent process. In case of SDCP algorithm the objective function for the gradient descent is convex and the diagonal terms of the are greater than zero since is convex. We show that the diagonal scaling will aid in learning better models through empirical analysis on benchmark datasets.
4 Experiments and Discussions
In this section, we give a detailed comparison between the diagonally scaled SDCP and other algorithms like centered gradient (CG)(Melchior et al., 2016), SDCP and CSDCP (centered SDCP). We do not consider the CD, PCD or other variants since CG and SDCP based algorithms are already shown to perform well compared to them (Melchior et al., 2016; Upadhya & Sastry, 2017). We compare the algorithms by keeping the computational complexity roughly the same (for each minibatch). This is achieved by the choice of the parameters and in SDCP and diagonally scaled SDCP. This computational architecture makes sure that the learning speed in terms of actual time is proportional to the speed in terms of iterations.
4.1 The Experimental Setup
We consider three benchmark datasets in our analysis namely Bars & Stripes (MacKay, 2003), MNIST^{1}^{1}1
statistically binarized as in
Salakhutdinov & Murray (2008) (LeCun et al., 1998) and CalTech 101 Silhouettes DataSet (Marlin, 2009). The Bars & Stripes dataset of data dimension is generated using a two step procedure as follows. In the first step, all the pixels in each row are set to zero or one with equal probability and then the pattern is rotated by degrees with a probability of in the second step. We have chosen , for which we get distinct patterns. Both the MNIST and the CalTech 101 Silhouettes dataset have data dimension of .For Bars & Stripes dataset, we consider RBMs with hidden units and for large datasets, we consider RBMs with
hidden units. We evaluate the algorithms using the performance measures obtained from multiple trials, where each trial fixes the initial configuration of the weights and biases. The biases of visible units and hidden units are initialized to the inverse sigmoid of the training sample mean and zero respectively. The weights are initialized to samples drawn from a Gaussian distribution with mean zero and standard deviation
. We use trials for the Bars & Stripes dataset andtrials for the large datasets. We learn the RBM for a fixed number of epochs and avoid using any stopping criterion. The training is performed for
epochs for Bars & Stripes and epochs for the large datasets. The minibatch learning procedure is used and the training dataset is shuffled after every epoch. However, for Bars Stripes dataset full batch training procedure is used.We compare the performance of diagonally scaled SDCP with centered gradient (CG), SDCP and CSDCP (centered SDCP). We keep the computational complexity of SDCP/CSDCP roughly same as that of CG by choosing and such that (Upadhya & Sastry, 2017). Since previous works stressed on the necessity of using large for CD based algorithms to get a sensible generative model (Carlson et al., 2015; Salakhutdinov & Murray, 2008), we use in CG (with for SDCP/CSDCP) for large datasets and in CG (with for SDCP/CSDCP) for Bars & Stripes dataset. In order to get an unbiased comparison, we did not use momentum and weight decay for any of the algorithms.
For comparison with the centered gradient method, we use the Algorithm given in Melchior et al. (2016) which corresponds to in their notation. However we use CD step size . In that algorithm, the hyperparameters and are set to . The initial value of is set to mean of the training data and is set to . The CSDCP algorithm also use the same hyperparameter settings.
4.2 Evaluation Criterion
The performance comparison is based on the loglikelihood achieved on the training and test samples. For comparing the speed of learning of different algorithms, the average train loglikelihood is a reasonable measure. The average test log likelihood also indicates how well the learnt model generalizes. We show the maximum (over all trials) of the average train and test loglikelihood. The average test loglikelihood (denoted as ATLL) is evaluated as,
(11) 
We evaluate the average train loglikelihood similarly by using the training samples rather than the test samples. For small RBMs the above expression can be evaluated exactly. However for large RBMs, we estimate the ATLL with annealed importance sampling (Neal, 2001) with particles and intermediate distributions according to a linear temperature scale between and .
It is possible that the learnt model achieves high loglikelihood though the learnt distribution is not close to the empirical distribution of the training samples. Therefore, for Bars & Stripes dataset we evaluate the probability that the model assigns to the training samples and the samples which are one Hamming distance away from them. As the learning progresses, ideally we expect the model to assign a higher probability to the training samples compared to the other binary vectors. However, the distribution that RBM represents is smooth since it is exponential of an energy function. Therefore, we also evaluate the probability assigned to all the samples which are one Hamming distance away from the training sample. Intuitively, the combined probability assigned on these two sets should be closer to as the learning progresses. Therefore, we provide plots for

total probability assigned by the model to the training samples (denoted as ), and

total probability assigned by the model to the training samples and the samples that are one Hamming distance away from them (denoted as ).
However, evaluating the above measures are intractable for the models learnt on large datasets. Therefore we provide the traditional average test loglikelihood score to evaluate those learnt models. In addition, we also provide samples generated by the learnt models by randomly initializing the states of visible units and running alternating Gibbs Sampler for 5000 steps.
4.3 Performance Comparison
In this section, we present experimental results to illustrate the performance of diagonally scaled SDCP (denoted as SDCPD) in comparison with the other methods (CG, SDCP and CSDCP). We provide only the best performance measures for all the algorithms by cross validating the hyperparameters involved for each of the algorithms. We observe that the ATLL achieved by diagonally scaled SDCP is equal or greater than that of SDCP.
Figure 0(b) shows the evolution of the ATLL for the Bars & Stripes dataset and figure 0(a) also plots for the measures discussed in section 4.2, namely, and . We observe that the probability assigned to valid samples, is higher for SDCPD compared to SDCP and CG across epochs. However, the total probability assigned to valid samples and the samples that are one Hamming distance away, approaches almost the same value for all the SDCP based algorithms and it is better han that achieved by the CG algorithm. From this we can infer that the proposed algorithms are able to minimize the probability assigned to nonvalid but similar (Hamming distance wise) examples. We observe similar performance in terms of loglikelihood measure, i.e., all the SDCP based algorithms achieve almost the same ATLL which is higher compared to CG algorithm, as shown in figure 0(b).
Figure 2 shows the evolution of the maximum average loglikelihood of test and training set for the MNIST and the CalTech dataset. In both the cases, the proposed SDCPD performs better compared to SDCP and CG algorithm. Also, we observe in figure 1(a) that the SDCPD evolution is smoother compared to SDCP which indicate the effectiveness of using the parameter specific learning rates. Further, the ATLL evolution in figure 1(a) indicates that the learnt models using SDCPD generalizes better compared to that learnt by other algorithms. The maximum ATLL achieved by SDCPD is which is significantly higher compared to other methods. The provided maximum ATLL score for SDCP and CSDCP matches with the earlier study in Upadhya & Sastry (2017)
. Also, the ATLL achieved by the learnt models are comparable to that of the VAE (Variational Autoencoder) and IWAE (Importance Weighted Autoencoder) models
(Burda et al., 2015).We observe similar behaviour for the CalTech dataset, as shown in figure 1(b) and 1(b). The performance of SDCPD is similar to CSDCP and significantly better compared to that of SDCP and CG algorithms on training dataset as shown in figure 1(b). However, in terms of generalization capacity, the models learnt using SDCPD is similar to SDCP as seen from 1(b).


The samples generated by the learnt models (on MNIST dataset) are given in figure 3. The samples generated by SDCPD are sharp compared to that produced by CG based model. Also, it can be observed that the samples generated by CG and SDCPD are more diverse compared to that produced by SDCP and CSDCP. It is important to note that there exist no precise measure to evaluate the generative model based on the generated samples.


5 Conclusions
Learning an RBM is difficult due to noisy gradient estimates of the loglikelihood which are obtained through MCMC procedure. Even with the noisy gradient estimates, several advanced optimization techniques are proposed which obtain fast and stable learning. In this work, we proposed a way to estimate the diagonal approximation of the Hessian to scale the SDCP gradient estimates, in order to achieve higher efficiency of learning compared to SDCP and CG algorithms, the current popular methods for learning RBMs.
The divergence or instability of the loglikelihood during learning may be due to the fact that estimate of Hessian of the loglikelihood function of an RBM is poor. We exploited the fact that the function SDCP tries to optimize is convex and hence its Hessian is positive definite. The resulting algorithm is very similar to SDCP and has a negligible additional computational complexity. Moreover, the computational complexity of SDCP can be made identical to that of standard CD based algorithms. Through empirical studies, we illustrated the advantages of diagonally scaled SDCP over CG on three benchmark datasets.
The main attraction of diagonally scaled SDCP, in our opinion, is its simplicity compared to other sophisticated optimization techniques which use computationally intensive methods to estimate the Hessian. It is possible to learn RBM more efficiently if the learning rate is reduced with iterations using a heuristically devised schedule. It has to be often fixed through cross validation. The proposed approach automatically provides parameter specific learning rates which makes the learning procedure both stable and efficient and it does not need any cross validation procedures to fix learning rate schedules. The only hyper parameters of the proposed method is
which does not affect the learning dynamics much and is there only to control numerical underflows.References
 An & Tao [2005] Le Thi Hoai An and Pham Dinh Tao. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of Operations Research, 133(1):23–46, 2005. ISSN 15729338.
 Burda et al. [2015] Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. CoRR, abs/1509.00519, 2015. URL http://arxiv.org/abs/1509.00519.

Carlson et al. [2015]
David Carlson, Volkan Cevher, and Lawrence Carin.
Stochastic spectral descent for restricted Boltzmann machines.
In
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
, pp. 111–119, 2015.  Desjardins et al. [2010] Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. Adaptive parallel tempering for stochastic maximum likelihood learning of RBMs. arXiv preprint arXiv:1012.3476, 2010.
 Desjardins et al. [2013] Guillaume Desjardins, Razvan Pascanu, Aaron C. Courville, and Yoshua Bengio. Metricfree natural gradient for jointtraining of Boltzmann machines. CoRR, abs/1301.3545, 2013.

Duchi et al. [2011]
John Duchi, Elad Hazan, and Yoram Singer.
Adaptive subgradient methods for online learning and stochastic
optimization.
Journal of Machine Learning Research
, 12(Jul):2121–2159, 2011. 
Fischer & Igel [2010]
Asja Fischer and Christian Igel.
Empirical analysis of the divergence of Gibbs sampling based
learning algorithms for restricted Boltzmann machines.
In
Artificial Neural Networks–ICANN 2010
, pp. 208–217. Springer, 2010. 
Fischer & Igel [2012]
Asja Fischer and Christian Igel.
An introduction to restricted Boltzmann machines.
In
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
, pp. 14–36. Springer, 2012.  Freund & Haussler [1994] Yoav Freund and David Haussler. Unsupervised learning of distributions of binary vectors using two layer networks. Computer Research Laboratory [University of California, Santa Cruz], 1994.
 Grosse & Salakhutdinov [2015] Roger B. Grosse and Ruslan Salakhutdinov. Scaling up natural gradient by sparsely factorizing the inverse fisher matrix. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pp. 2304–2313. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045363.
 Hinton [2002] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
 Hinton et al. [2006] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Genevieve B. Orr, and KlausRobert Müller. Effiicient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pp. 9–50, London, UK, UK, 1998. SpringerVerlag. ISBN 3540653112.
 Ma & Wang [2016] Xuesi Ma and Xiaojie Wang. Average contrastive divergence for training restricted Boltzmann machines. Entropy, 18(1):35, 2016.
 MacKay [2003] David JC MacKay. Information theory, inference, and learning algorithms, volume 7. Cambridge university press Cambridge, 2003.
 Marlin [2009] Benjamin M. Marlin. CalTech 101 Silhouettes Data Set. https://people.cs.umass.edu/~marlin/data.shtml, 2009.
 Martens [2010] James Martens. Deep learning via hessianfree optimization. In ICML, 2010.
 Melchior et al. [2016] Jan Melchior, Asja Fischer, and Laurenz Wiskott. How to center deep Boltzmann machines. Journal of Machine Learning Research, 17(99):1–61, 2016.
 Montavon & Müller [2012] Grégoire Montavon and KlausRobert Müller. Deep Boltzmann Machines and the Centering Trick, pp. 621–637. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 9783642352898.
 Neal [2001] Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.
 Oswin Krause [2015] Christian Igel Oswin Krause, Asja Fischer. Populationcontrastivedivergence: Does consistency help with RBM training? CoRR, abs/1510.01624, 2015.
 Roux et al. [2008] Nicolas L. Roux, Pierreantoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In Advances in Neural Information Processing Systems 20, pp. 849–856. Curran Associates, Inc., 2008.
 Salakhutdinov & Hinton [2009] Ruslan Salakhutdinov and Geoffrey E Hinton. Deep Boltzmann machines. In AISTATS, volume 1, pp. 3, 2009.
 Salakhutdinov & Murray [2008] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pp. 872–879. ACM, 2008.
 Smolensky [1986] Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. 1986.
 Tieleman [2008] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pp. 1064–1071. ACM, 2008.
 Tieleman & Hinton [2009] Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1033–1040. ACM, 2009.
 Upadhya & Sastry [2017] Vidyadhar Upadhya and P. S. Sastry. Learning RBM with a DC programming approach. In Proceedings of the Ninth Asian Conference on Machine Learning, volume 77 of Proceedings of Machine Learning Research, pp. 498–513. PMLR, 15–17 Nov 2017.
 Yuille et al. [2002] Alan L Yuille, Anand Rangarajan, and AL Yuille. The concaveconvex procedure (CCCP). Advances in neural information processing systems, 2:1033–1040, 2002.
Comments
There are no comments yet.