1 Problem Formulation
Restricted Boltzmann machines (RBMs) with low-precision discrete synapses are much appealing due to high energy efficiency. However, compared to full-precision RBMs, they are more difficult to train, which is essentially a discrete optimization problem. In a recent paper Huang (2019), the author addressed the problem of training RBMs with binary synaptic connections. The problem is formulated as follows. Consider RBMs where the random visible variables and hidden variables only take binary values
. Then the joint distribution of this RBM model is given by the Gibbs distribution
where is the normalization constant, is the temperature value, and is the energy function defined as
For simplicity and without loss of generality, assume a simple case where the biases and . The marginal distribution of could be obtained by marginalizing out the hidden states
where is the row of the synaptic connection matrix , is the receptive field of the
hidden neuron, andis the partition function depending on the synaptic connection matrix .
When we have input data samples which are weakly-correlated, then the likelihood distribution of data could be written as
where is the receptive field of the hidden neuron for the data sample From the Bayesian perspective, suppose that the prior distribution of is , according to Bayes’ rule, the posterior distribution could be obtained as
where is the partition function of the posterior and also known as the marginal data likelihood.
The goal of training RBMs with binary synapses is to learn the synaptic connection matrix from the observed data samples , subject to the discrete constraint that each element in also takes binary value, i.e., . If the posterior distribution could be computed, then the learning problem is solved. However, exact computation of is intractable.
For RBMs with full-precision synaptic connections, some classical training methods have been proposed such as the contrastive divergence (CD) algorithmHinton (2002). However, in the case of RBMs with binary synaptic connections, it is essentially a challenging discrete optimization problem. As a result, the previous full-precision learning algorithms such as CD could not be used due to the discrete nature of the synapses.
2 Review of Huang’s Method in Huang (2019)
Recently, Huang (2019) addressed this challenging problem using a combination of gradient ascent 111It could be also equivalently understood as minimizing the negative ELBO using gradient descent (GD). and the message passing algorithm under the variational inference (VI) framework. Specifically, instead of computing the posterior directly, VI tries to find an approximate distribution that maximizes a lower bound of the log marginal likelihood , which is called the evidence lower bound (ELBO), i.e.,
where is the Kullback-Leibler (KL) divergence and is the prior distribution which is assumed to be factorized as
where is the prior mean of
and also controls the probability. In practice, it is usually assumed that when no informative prior information is available about the synapses. Alternatively, in (7) could be rewritten as
so that and maximizing is equivalent to minimizing the KL divergence . Hence, the problem of posterior inference problem in (6) is transformed to the optimization of with respect to (w.r.t.) the variational parameters of , which is the core of VI.
To model the binary synaptic connections weights , in Huang (2019) the variational distribution is chosen to be a mean-filed symmetric Bernoulli distribution
where is the posterior mean of and it controls the probability of the value of binary synaptic connection , i.e., the probability of is while the probability of is .
Then, Huang (2019) uses gradient ascent to update the variational parameters , i.e., in the -th iteration, each parameter is updated as
which seems easy to implement as long as the gradient term
is obtained. However, in contrast to the case of supervised learning, it is far from trivial to obtain the gradient. To be clear, according to (7), the gradient consists of two terms
The gradient of the KL regularization term could be easily computed as
However, the gradient of the expected log-likelihood term is intractable as it involves the computation of another log partition function , i.e.,
, where the mean and variance are defined as
As a result, similar to the local reparameterization trick Kingma et al. (2015)
, the expected log-likelihood could be approximated using the Monte-Carlo estimation
are samples drawn from standard normal distribution, andand are the number of samples used to estimate different terms of the expected log-likelihood, respectively. However, even with MC sampling, the computation of expected log-likelihood is still difficult due to the existence . Interestingly. as pointed out in Huang (2019), the term corresponds to the log partition function of an equivalent RBM whose synaptic connections are and biases of hidden neurons are . As a result, the could be efficiently computed by resorting to the message passing algorithm. To this end, denote by the messages from visible neuron to hidden neuron and the message from hidden neuron to the visible neuron, respectively, then the message passing equation reads
and , , and .
Finally, the update equation in Huang (2019) for the variational parameters is
Since , the update in (27) could not guarantee such constraint. As a result, similar to Baldassi et al. (2018), a heuristic clipping operation is introduced in Huang (2019), which forces the when and when . This trick is heuristic and but works well empirically. One natural question is that: are any principled explanations for the heuristic clipping operation? Or are there any other algorithms without such clipping operation?
3 Training RBMs with Binary Synapses using the Bayesian Learning Rule
In this section, we propose one alternative method to train RBMs with binary synaptic connections using the Bayesian Learning Rule Khan and Lin (2017), which is obtained by optimizing the variational objective by using natural gradient descent Amari (1998); Hoffman et al. (2013); Khan and Lin (2017). As demonstrated in Khan and Rue (2019)
, the Bayesian learning rule can be used to derive and justify many existing learning-algorithms in fields such as optimization, Bayesian statistics, machine learning and deep learning. Note that recently the Bayesian learning rule has been applied inMeng et al. (2020)
to train binary neural networks for supervised learning. Therefore, this note could be viewed as an extension ofMeng et al. (2020)
to the case of unsupervised learning222However, despite using the same Bayesian learning rule, the resultant algorithm for unsupervised learning in this note is quite different from that in Meng et al. (2020) for supervised learning. .
where is the natural parameter,
is the vector of sufficient statistics,is the log-partition function, and is the base measure. When the prior distribution follows the same distribution as in (28), and the base measure , the Bayesian learning uses the following update of the natural parameter Khan and Rue (2019)
where is the learning rate, is the expectation parameter of , and is the natural parameter of the prior distribution . The main idea is to update the natural parameters using the natural gradient. Below we briefly show how to obtain the Bayesian learning rule; for more details, please refer to Khan and Rue (2019); Khan and Lin (2017).
To apply the Bayesian learning rule, the posterior approximation is also chosen to be the fully factorized symmetric Bernoulli distribution in (10), which is in fact belonging to the minimal exponential family distribution. In particular, in (10) could be reformulated as follows
where the natural parameter , sufficient statistics , log partition function , and the associated expectation parameter are as follows
As a result, instead of optimizing the expectation parameters using gradient ascent in (11) as Huang (2019), we could update the natural parameters using the Bayesian learning rule in (29). Interestingly, as shown in (29), although the natural parameters are updated, the gradient is computed w.r.t. the expectation parameters , which is already obtained in (22). When the prior is set to be the form in (8), each element of the natural parameters could be written as
It is easy to verify that
Note that there is no need in (38) to explicitly compute the right hand side term of (39), which is different from (27). The resultant algorithm to train RBMs with binary synaptic connections with (38) is termed as Bayesian Binary RBMs (BayesBRBM). Note that in BayesBRBM, the update formula (38) is similar to (27) used in Huang (2019). However, there are two fundamental differences. First, BayesBRBM updates the natural parameters of the symmetric Bernoulli distribution while Huang (2019) updates the expectation parameters . One direct advantage is that since , no additional clipping operation is needed as Huang (2019). Second, although the update equations (38) and (27) appear the same, they actually correspond to two fundamentally different optimization methods: the former uses natural gradient ascent while the latter uses gradient ascent.
Interestingly, the algorithm in Huang (2019) could be viewed as one kind of first-order approximation of BayesBRBM. Specifically, using first-order Taylor expansion, the expectation parameters could be approximated as
where the relation in (39) is explicitly substituted for ease of comparison. It could be seen that the update formula in (41) has exactly the same form as (27) except the exchange of variables between and . Since , using first-order approximation (40), the values should also be constrained into the range by using clipping, which is exactly the algorithm in Huang (2019). As a result, the proposed algorithm provides a different perspective on Huang (2019) which justifies its efficacy with heuristic clipping.
In this technical note, building on the work in Huang (2019), we propose one optimization method called BayesBRBM (Bayesian Binary RBM) to train RBM with binary Synapses using the Bayesian learning rule. As opposed to Huang (2019), no additional clipping operation is needed for BayesBRBM. Interestingly, the method in Huang (2019) could be viewed as a first-order approximation of BayesBRBM, which provides an alternative perspective and justifies its efficacy with heuristic clipping. One possible future work is to extend it to deep RBMs with binary synapses and make some detailed comparison of the two algorithms.
X. Meng would like to thank Haiping Huang (Sun Yat-sen University) for helpful discussions, and Mohammad Emtiyaz Khan (RIKEN AIP) for explanations on the Bayesian learning rule.
- Natural gradient works efficiently in learning. Neural computation 10 (2), pp. 251–276. Cited by: §3.
- Role of synaptic stochasticity in training low-precision neural networks. Physical review letters 120 (26), pp. 268103. Cited by: §2.
- Training products of experts by minimizing contrastive divergence. Neural computation 14 (8), pp. 1771–1800. Cited by: §1.
- Stochastic variational inference. The Journal of Machine Learning Research 14 (1), pp. 1303–1347. Cited by: §3.
- How data, synapses and neurons interact with each other: a variational principle marrying gradient ascent and message passing. arXiv preprint arXiv:1911.07662. Cited by: Training Restricted Boltzmann Machines with Binary Synapses using the Bayesian Learning Rule, §1, §2, §2, §2, §2, §2, §2, §2, §2, §3, §3, §3, §4.
- Conjugate-computation variational inference: converting variational inference in non-conjugate models to inferences in conjugate models. AISTATS. Cited by: §3, §3, Appendix.
- Learning-algorithms from Bayesian principles. Note: Available onlinehttps://emtiyaz.github.io/papers/learning_from_bayes.pdf Cited by: §3, §3, Appendix.
- Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pp. 2575–2583. Cited by: §2.
- Training binary neural networks using the Bayesian learning rule. In International Conference on Machine Learning, Cited by: §3, footnote 2.
In this appendix, we briefly introduce the Bayesian learning rule. please refer to Khan and Rue (2019); Khan and Lin (2017) for more details. According to the definition of natural gradient ascent, the update equation follows
where denotes the natural gradient of with respect to (w.r.t) at , where is the gradient of w.r.t at and is the Fisher information matrix (FIM)
As a result, to update natural parameters using the natural gradient we need to compute the inverse FIM, which is intractable in general. Fortunately, for minimal exponential family distribution in (28), there exists a concise result since where is the expectation parameter of exponential family distribution . As a result, so that the natural gradient update in (42) could be equivalently written as
where, from the definition of in (7), there is