I Training RBMs with MCMC
RBMs are undirected graphical models with a bipartite structure that differentiates between visible nodes, and a set of latent, or ‘hidden’ nodes, , not directly constrained by the data Goodfellow et al. (2016). These states are usually taken to be binary but can be generalized. Each state of the machine corresponds to an energy of the form
where the biases , , and weights are the learnable parameters. Note that an RBM does not allow connections within a layer. This defines a distribution over states given by a Boltzmann-Gibbs distribution,
The normalizing factor, , is the formidable partition function that involves the sum over an exponentially scaling number of states, thus making the exact computation of its value infeasible. Additionally, the bipartite structure of the RBM connectivity implies that the hidden nodes are conditionally independent given any visible nodes (and vice versa), with a closed form conditional distribution given by Hinton (2002) , where .
We indicate the unique elements of the dataset for training and testing of the network as , where is the space of all binary sequences of length . We can then write the data distribution as
where is the delta function that evaluates to if and otherwise. This effectively defines a probability mass function (PMF) over with non-zero values only for values . We then call the support of .
Let us assume further that all data points have equal amplitude over the support, i.e., . Since most real world datasets consist of unique elements with no exact repeats, this class of distributions includes all relevant ones. However, we will see in Sec. III
that our method seems to work equally well also for non-uniform distributions.
Training an RBM then amounts to a search for the appropriate weights and biases, , that will minimize the quantity
This is known as the Kullback-Leibler (KL) divergence between the data distribution, , and the model distribution of the RBM over the visible layer, , with hidden nodes traced out. The latter can be written as,
The first term on the rhs of Eq. (6) is an expectation with the hidden nodes driven by the data, and hence is referred to as the data term. Since the conditional distributions across the hidden nodes are factorial and known in closed form, this inference problem is easy in the RBM case. The second term on the rhs of Eq. (6), instead, is an expectation over the model distribution with no nodes fixed, and called the model
term. The exact calculation of this term requires computing the partition function of the RBM, which is proved to be hard even to estimateLong and Servedio (2010). It is this term that MCMC algorithms (including CD) attempt to approximate.
i.1 Contrastive Divergence
A popular method to training RBMs is CD, which is a special case of an MCMC method known as Gibbs sampling Hinton (2002). The Markov chain is initialized from a point in the dataset, , then the hidden and visible nodes are sequentially re-sampled a number of times. A distorted model expectation is then computed from the reconstructed . In practice, choosing some finite introduces a bias, but empirically it is found that using gives a sufficient signal for learning Carreira-Perpinan and Hinton (2005). Since the CD chain starts from a point in the dataset (i.e., a sample from the data distribution), difficulties arise when the model distribution represented by the RBM contains modes at points where the data distribution has negligible probability. CD will have a hard time finding and hence pushing down these spurious modes. This, coupled with the prohibitively slow mixing of this MCMC method due to random-walk exploration, and typical high dimensionality of the problem, renders CD not a particularly effective method for unsupervised learning.
Ii Mode Informed Weight Updates
After these preliminaries we can now describe our mode-training method. In a nutshell, it consists in replacing the average in the model term of Eq. (6) with the mode of at appropriate steps of the training procedure. However, is very cumbersome to compute (see Eq. (5)), thereby adding a considerable computational burden. Instead, we sample the mode of , the model distribution of the RBM.
The rationale for replacing the mode, , of with the visible configuration of the mode, , of is because the two modes are equivalent with high probability under scenarios typical for different stages of the RBM pre-training. We prove this rigorously in the appendix, while here we provide numerical evidence of this fact.
ii.1 Mode Correspondence of Joint and Marginal PMF
To illustrate the equivalence between the modes of and , let us begin by expressing the joint PMF in terms of the product of the marginal PMF over the visible layer and the conditional PMF over the hidden layer . For any given visible configuration , we then have . We can then define the hidden “activation” of to be
which allows us to write . Note that we can interpret as a measure of the “certainty” that the hidden nodes acquire the value or .
It is then clear that we can write the probability amplitude of the mode of the joint PMF as , where we have used the fact that and is the mode of the marginal PMF, . If then we have modal equivalence of the joint and marginal PMFs.
In Fig. 1, we plot the evolution of as a function of the number of CD-1 training iterations for a shifting bar synthetic dataset, which is small enough that we can compute the exact mode of at any iteration. The figure indeed shows that approaches rather quickly as pre-training proceeds. The activation of a random visible configuration is being used as comparison.
In the appendix we also prove that the condition of being close to 1 is not necessary for establishing modal equivalence. In fact, we prove that it is still possible for the two modes to be equal even when the weights are small (thus a smaller value). Additionally, we show in the appendix that mode training is more effective in exploring the PMF of the model distribution for RBM instances of greater frustration. The latter is a measure of the degeneracy of the low-energy states of an RBM, and thus the difficulty of finding the ground state configuration. Since it was shown that the frustration of the RBM increases as pre-training proceeds Pei et al. (2019), in order to effectively utilize the power of mode training, the frequency of mode updates should be higher at the later stages of the training than the earlier stages.
ii.2 Optimal Mode-Training Schedule
The results from the previous subsection then suggest a schedule for the mode training routine that performs mode updates more frequently the longer the pre-training routine has elapsed.
To realize this, we use a sigmoid, , to calculate the probability of replacing the data driven hidden CD term with a mode driven term at the iteration step
Here, is the maximum probability of employing a mode update, and and are parameters that control how the mode updates are introduced into the pre-training. They are chosen such that the frequency of mode updates becomes dominant only when both the conditions of large weights and frustration are met (see Sec. III for the value of these parameters). Initially, will be small, since the joint- and marginal-distribution modes are unequal, and gradually rises to its maximal value when the modes are of equal magnitude. Note that one may employ different functions to quantify the degree to which the joint- and marginal-distribution modes equalize during training. However, we have found that the sigmoid works well enough in practice.
ii.3 Combining MCMC with mode updates
We are now ready to outline the full procedure of mode training, that combines a MCMC method with the mode updates following the schedule (8). Although one may choose any variation of the MCMC method to train RBMs, for definiteness of discussion, we consider here the standard training method, CD Hinton (2002). In this case, weight updates follow the modified KL gradient. As discussed in Section I, it evaluates to a difference of two expectations called the data term and model term which we can write as
where is the CD update learning rate, and the expectation in the second term is taken over the reconstructed distribution over a Markov chain initialized from the dataset after Gibbs samples ( in most cases). When driving the weights with samples of the RBM ground state with the schedule (8), we use instead the following update,
where is the mode of the joint RBM model distribution. Note that the mode update learning rate, , may be different from the CD learning rate, .
We also stress that the updates in Eq. (10) are in an off-gradient direction. As we show now, this is the reason for the increased stability of the training over MCMC approaches, and its convergence to arbitrarily small KL divergences.
ii.4 Stability and Convergence
The data term, which is identical in both Eq. (9) and Eq. (10), tends to increase the weights associated with the visible node configurations in the dataset, thereby increasing their relative probabilities compared to states not in the support set, . Instead, the model term decreases the weights/probability corresponding to highly-probable model states. CD does this poorly and often diverges, while mode training achieves this with better stability and faster convergence (see Fig 2). We provide here an intuitive explanation of this phenomenon, while a formal treatment on this topic will be provided in the appendix.
The pre-training routine can be broken down in two phases. In the first phase, the training procedure attempts to discover the support of the data distribution . We call this phase the discovery phase. To better see this, consider a randomly initialized RBM with small weights. These small and uncorrelated weights give rise to RBM energies close to zero for all nodal states, or for all and , see Eq. (1). This results in the model distribution being almost uniform.
Therefore, we see that in the discovery phase of training, the model term plays little role in the training as it simply pushes down on the weights in a practically uniform manner, with . On the other hand, the data term drives the initial phase of the training by increasing the marginal probability of the visible states in the support, . We can then employ a large learning rate (say, ) in the beginning of the training, driving the visible layer configurations in the dataset, , to high probability versus configurations outside the support. Empirically, we find that CD training performs in the discovery phase reasonably well, and is quickly able to “find” the visible states in the support.
Now, having discovered the support, we arrive at the second phase of the training where we have to bring the model distribution as close to uniformity as possible over the support in order to minimize the KL-divergence. We call this phase the matching phase of the training, where we bring the model distribution as close to the data distribution as possible. CD usually performs poorly in this phase (see Fig. 2). To see this most directly, we simply have to consider a visible state with a slightly larger probability than the other states. It should then be necessary for the model term to locate and “push down” on this state to increase the uniformity of the distribution over the support. However, for any CD approximation of the model term, this rarely happens in a timely manner as the mixing rate of the MCMC chain is far too slow to locate this state before the training diverges.
This is where samples of the mode are most effective, and can assist in the correction of the states’ amplitudes. As we have discussed in Sec. II.1, finding the modal state, , of the model distribution, allows us to immediately locate the mode, , of the marginal probability, , and “push” down on this state through an iteration of weight updates. This “push” may result in another state “popping” up and becoming the new modal state; however, often times the probability amplitude of this new state will be less than that of the previous mode (see also the appendix). This results in a training routine that “cycles” through any dominant state that emerges at each iteration, and the probability amplitude of the mode decreases as training proceeds until the probability amplitudes of all the states in the support become equal (see the formal demonstration of this in the appendix), which results in the desired uniform distribution over the support. This can be visualized as a “seesaw” between the dominant states, with the oscillation amplitude of this seesaw decaying to zero in time.
As it should now be clearer, these mode-driven updates are deviations from the gradient direction, since in general the mode over the model distribution is different from the expected value. This makes the mode-training algorithm, which mixes mode driven samples and data-driven ones, distinct from gradient descent. This is also supported by the fact that our method tends toward a particular class of distributions (uniform), when gradient descent would settle in some local minima or saddle points in the KL landscape.
The free parameters in this method are the schedules of the mode sample using (defined by , and in Eq. (8)) and the CD learning rate, . With fixed, we set , where , with being the ground state of the corresponding RBM with nodal values . This particular choice of
is an upper bound to the learning rate which minimizes the RBM energy variance over all states (see the appendix for the proof of this statement).
Finally, we find that the mode training method is not very sensitive to the parameters chosen. In fact, as long as the mode samples are incorporated after the joint and marginal mode equilibration, the training is stabilized and the learned distribution will tend to uniformity (see also the appendix). This result reinforces the intuitive notion that the pushes on the mode provide a stabilizing quality to the training over CD (or any other MCMC approach), which can otherwise diverge when mixing rates grow too large at later times during training.
ii.5 Importance of Representability
Note that since mode training is driven to distributions of a particular form, instead of local minima as in the case of CD or other gradient approaches, the representability of the RBM becomes important. The ability of a RBM to represent a given data distribution is given by the amount of hidden nodes, where one is guaranteed universal representability with hidden nodes Le Roux and Bengio (2008). In other words, one more hidden node than the number of visible configurations with non-zero probability is sufficient (but perhaps not necessary) for general representability. In practice, this bound is found to be very conservative and typically much fewer nodes are needed for a reasonable solution.
Representability can become an issue in mode training when the parameter space of the RBM does not include the uniform distribution over the support (or a reasonable approximation). Since the mode training is generally in a non-gradient direction, this means that it may settle to a worse solution than a local optimum obtainable by CD. This is a signal that more hidden nodes are required for an optimal solution.
Since most natural datasets live on a very small dimensional manifold of the total visible phase space, , the amount of hidden nodes required typically scales polynomially with the size of the problem, versus the exponential scaling of the visible phase space. This makes representability not an insurmountable problem for mode training, even in full size problems. To this end, the examples of Fig. 2 and Fig. 3 show that mode training does not necessarily fail if the number of hidden nodes is less than that needed to guarantee representability.
As examples of our method, we have computed the log-likelihoods achieved with mode training across two synthetic and one realistic (MNIST) datasets, and compared the results against the best achieved log-likelihoods with CD-1, PCD-1 and PT on standard RBMs, E-RBMs, and C-RBMs Melchior et al. (2016). For the small synthetic datasets we could also compute the exact log-likelihoods, thus providing an even stronger comparison. For the larger MNIST case, mode sampling was done via simulation of a digital memcomputing machine based on Ref. Bearden et al., 2019. The specific details of our implementation can be found in the appendix.
For synthetic data, we use the commonly employed binary shifting bar and bars and stripes datasets, both described in Ref. MacKay, 2003
. The former is defined by two parameters: the total length of the vector,, and the amount of consecutive elements (with periodic boundary conditions), , set to one, with the rest set to zero. This results in unique elements in the dataset with uniform probability, giving a maximum likelihood of . The inverted shifting bar set is obtained by swapping ones and zeros. The bars and stripes dataset is constructed by setting each row of a binary pattern to one with probability 1/2, and then rotating the resulting pattern with probability 1/2. This produces elements, with the all-zero and all-one patterns being twice as likely as the others.
For a direct comparison to previous work, we followed the same setup as Ref. Melchior et al., 2016. A 94 RBM was tested on a shifting bar dataset with , and a bars and stripes dataset. Both synthetic sets were trained for 50,000 parameter updates, with no mini-batching, and a constant . For the MNIST dataset, a
sized model was trained for 100 epochs, with batch sizes of 100. The mode samples in both cases are slowly incorporated into training in a probabilistic way following Eq. (8), initially with and driven to for the shifting bar and MNIST datasets, and for the bars and stripes dataset. In both cases, we chose and , where is the total number of parameter updates.
We plot an example of training progress in a moderately large synthetic problem in Fig. 3. Reported is the KL divergence (which differs from the log-likelihood by a constant factor independent of the RBM parameters Goodfellow et al. (2016)) of a slightly bigger RBM as a function of number of parameter updates on a , shifting bar set, for both CD-1 and mode training. We consider two learning rate schedules, constant and exponential decay . Additionally, every time a mode sample is taken, CD is allowed to run with , a number scaled to the equivalent computational cost of taking a mode sample. The details of the computational equivalence between a mode sample using memcomputing and iterations of CD are discussed in greater detail in the appendix. In both cases, even when computational cost is factored in, mode training leads to better solutions and proceeds in a much more stable way across runs (lower KL variance at convergence). Importantly, mode training never diverges while CD oftentimes does. Following our intuition about mode training established in Sec. II, using larger learning rates in the CD-dominated phase accelerates the convergence of mode training.
It is known that using CD to train RBMs can result in poor models for the data distribution Hinton (2010), for which PCD and PT are recommended. We note that for the mode training employed in this paper, CD-1 was employed as the gradient approximation (except in the case for MNIST where PCD-1 was used). Impressively, in all cases tested, the mode samples were able to stabilize the CD algorithm sufficiently to overcome the other, more involved approximations (PT) and model enhancements (centering).
|S. Bar||Inv. S. Bar||Bars & Stripes||MNIST|
In addition, it is clear that mode training exhibits several desirable properties over CD (or other gradient approaches). Most significantly, it seems to perform better with larger learning rates during the gradient dominated phase, and smaller learning rates when using mode samples. CD and other gradient methods generally perform better with smaller learning rates, as their approximation to the exact gradient gets better. Irrespective, even in this regime, the mode training eventually drives the system to the uniform solution compared to the local optimum of CD. The main advantage is that with mode training, one can (and often should) use larger learning rates, resulting in fewer required iterations.
For further comparison, we report in Table 1 results for the shifting and inverted bar, bars and stripes, and MNIST datasets obtained with mode training and those reported in Ref. Melchior et al., 2016. The results show mode training with a standard RBM always converges to models with log-likelihoods higher than E-RBMs, and C-RBMs trained with CD-1, PCD-1, or PT. Furthermore, the mode training log-likelihood increases with an increasing number of hidden nodes (better representability). Empirically, we also find the incredible result that with sufficient representability and the proper learning rate, mode training can find solutions arbitrarily close to the exact distribution.
In this paper we have introduced an unsupervised non-gradient training method that stabilizes gradient based methods by utilizing samples of the ground state of the RBM, and is empirically seen to get as close as desired to a given uniform data distribution. It relies on the realization that as training proceeds, the RBM becomes increasingly frustrated, leading to the modes of the visible layer distribution and joint model distribution becoming effectively equal. As a consequence, by using the mode (or ground state) of the RBM during training, our approach is able to “flatten” all modes in the model distribution that do not correspond to modes in the data distribution, reaching a steady state only when all modes are of equal magnitude. In this sense, the ground state of the RBM can be thought of as ‘supervising’ the gradient approximation during training, preventing any pathological evolution of the model distribution.
Our results are valid if the representability of the RBM is enough to include good approximations of the data distribution. Once the representability is sufficient, a properly annealed learning-rate schedule will take the KL divergence as low as desired. Increasing the number of hidden nodes increases the non-convexity of the KL-divergence landscape, easily trapping standard algorithms in sub-optimal states. In practice, after some point, increasing the number of hidden nodes will not decrease the KL divergence that a pre-training procedure actually converges to, as the trade-off between effective gradient update and representation quality is reached. We here claim that this point of tradeoff for our mode-assisted procedure is reached at far greater number of nodes than standard procedures, thus allowing us to find representations with far smaller KL-divergence. The mode training we suggest then provides an extremely powerful tool for unsupervised learning, which i) prevents a divergence of the model, ii) promotes a more stable learning, and iii) for data distributions uniform across some support, can lead to solutions with arbitrary high quality.
To scale our approach, one would need an efficient way to sample low-energy configurations of the RBM, a difficult optimization problem on its own. This is equivalent to a weighted MAX-SAT problem, for which there are several industrial-scale solvers available. Also, the recent successes of memcomputing on these kind of energy landscapes in large cases (million of variables) are fodder for optimism Traversa et al. (2018); Sheldon et al. (2018).
Finally, fitting general discrete distributions (with modes of different height) with this technique seems also within reach. In this respect, we can point to our results on the bars and stripes dataset (a non-uniform ) for inspiration. We have found the best log-likelihood on that set with mode training with a lower frequency of the mode sampling, , compared to the shifting bar (a uniform set). This suggests that a general update, which properly weighs the mode sample in combination with the dataset samples, may extend this technique to general non-uniform probabilities, with the weight analogous to a tunable demand for uniformity.
Our method is useful from a number of perspectives. First, from the unsupervised learning point of view, it opens the door to the training of RBMs with unprecedented accuracy in a novel, non-gradient approach. Second, many unsupervised models are used as ‘feature learners’ in a downstream supervised training task (e.g., classification), where the unsupervised learning is referred to as pre-training. We suspect that starting backpropagation from an initial condition obtained through mode training would be highly advantageous. Third, the mode training we suggest can be done on models with any kind of pairwise connectivity, which include deep, convolutional, and fully-connected Boltzmann machines. We leave the analysis of these types of networks for future work.
Work supported by DARPA under grant No. HR00111990069. H.M. acknowledges support from a DoD-SMART fellowship. M.D. and Y.R.P. acknowledge partial support from the Center for Memory and Recording Research at the University of California, San Diego. All memcomputing simulations reported in this paper have been done on a single core of an AMD EPYC server.
- Ackley et al. (1985) D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, Cognitive science 9, 147 (1985).
- Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, Vol. 1 (MIT press Cambridge, 2016).
- Le Roux and Bengio (2008) N. Le Roux and Y. Bengio, Neural computation 20, 1631 (2008).
- Bengio et al. (2009) Y. Bengio et al., Foundations and Trends in Machine Learning 2, 1 (2009).
- Carleo and Troyer (2017) G. Carleo and M. Troyer, Science 355, 602 (2017).
- Gao and Duan (2017) X. Gao and L.-M. Duan, Nature communications 8, 662 (2017).
- Hinton (2002) G. E. Hinton, Neural computation 14, 1771 (2002).
- Szegedy et al. (2013) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, arXiv preprint arXiv:1312.6199 (2013).
- Galloway et al. (2019) A. Galloway, A. Golubeva, T. Tanay, M. Moussa, and G. W. Taylor, arXiv preprint arXiv:1905.02161 (2019).
- Erhan et al. (2010) D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, Journal of Machine Learning Research 11, 625 (2010).
- Manukian et al. (2019) H. Manukian, F. L. Traversa, and M. Di Ventra, Neural Networks 110, 1 (2019).
- Di Ventra and Pershin (2013) M. Di Ventra and Y. V. Pershin, Nature Physics 9, 200 (2013).
- Traversa and Di Ventra (2017a) F. L. Traversa and M. Di Ventra, Chaos: An Interdisciplinary Journal of Nonlinear Science 27, 023107 (2017a).
- Di Ventra and Traversa (2018) M. Di Ventra and F. L. Traversa, J. Appl. Phys. 123, 180901 (2018).
- Sminchisescu and Welling (2007) C. Sminchisescu and M. Welling, in Artificial Intelligence and Statistics (2007) pp. 516–523.
- Lan et al. (2014) S. Lan, J. Streets, and B. Shahbaba, in AAAI (2014) pp. 1953–1959.
- Arora and Barak (2009) S. Arora and B. Barak, Computational Complexity: A Modern Approach (Cambridge University Press, 2009).
- Traversa et al. (2018) F. L. Traversa, P. Cicotti, F. Sheldon, and M. Di Ventra, Complexity 2018 (2018).
- Sheldon et al. (2018) F. Sheldon, F. L. Traversa, and M. Di Ventra, arXiv preprint arXiv:1810.03712 (2018).
- Melchior et al. (2016) J. Melchior, A. Fischer, and L. Wiskott, The Journal of Machine Learning Research 17, 3387 (2016).
Fischer and Igel (2012)
A. Fischer and C. Igel, in
iberoamerican congress on pattern recognition(Springer, 2012) pp. 14–36.
- Long and Servedio (2010) P. M. Long and R. A. Servedio, (2010).
- Carreira-Perpinan and Hinton (2005) M. A. Carreira-Perpinan and G. E. Hinton, in Aistats, Vol. 10 (Citeseer, 2005) pp. 33–40.
- Pei et al. (2019) Y. R. Pei, H. Manukian, and M. Di Ventra, arXiv preprint arXiv:1905.05334 (2019).
- Bearden et al. (2019) S. R. B. Bearden, F. Sheldon, and M. Di Ventra, EPL (Europhysics Letters) 127, 30005 (2019).
- MacKay (2003) D. J. MacKay, Information theory, inference and learning algorithms (Cambridge university press, 2003).
- Hinton (2010) G. Hinton, Momentum 9, 926 (2010).
- Traversa and Di Ventra (2017b) F. L. Traversa and M. Di Ventra, Chaos: An Interdisciplinary Journal of Nonlinear Science 27, 023107 (2017b).
- (29) This is a justified assumption if the system is highly frustrated, as the energy gaps near the ground state are generally very small for such system.
- (30) From here on, ‘iid’ will serve as the abbreviation of ‘independent and identically distributed’.
Appendix A Sampling with Memcomputing
The mode-training method introduced in the main text requires sampling the mode of the model distribution of a given RBM. This task can be transformed to sampling the optimum of an equivalent weighted, mixed maximum satisfiability (MAX-2-SAT) optimization problem Manukian et al. (2019). To obtain high-quality samples for large models, we employ the memcomputing approach Di Ventra and Pershin (2013); Traversa and Di Ventra (2017b); Di Ventra and Traversa (2018), a novel computing paradigm that employs memory to both store and process information.
a.1 Memory Dynamics
Our implementation is based on the approach used in Ref. Bearden et al., 2019 for the satisfiability (SAT) problem, appropriately modified for the MAX-2-SAT optimization problem. For a MAX-2-SAT with variables, 1-SAT clauses, and 2-SAT clauses we have and . In this case, the equations used to simulate a digital memcomputing machine read
The voltages, , are continuous representations of the Boolean variables of the problem, , with a false assignment represented as , a true assignment represented as , and is ambiguous. Rather than thresholding the voltages to check the clause states, we use the clause function directly. A 2-SAT clause in Boolean form is comprised of two literals, , where a literal in the -th clause, , is either a negated, , or unnegated, , variable. The Boolean clause is represented as a continuous clause function,
The factor contains the information about the relation between the literal in the -th clause, , and its associated variable, ; it evaluates to if , and if . The function is bounded, , and we consider a clause to be satisfied when . By thresholding the clause function we also avoid the ambiguity associated with .
Each clause has a “fast”, , and a “slow”, , memory variable that serve as indicators of the history of the state of . The memory is “fast” in the sense that it contains information of the recent history of , and “slow” in the sense that it contains information on the entire history of . Both memory variables are bounded, and . The offset in Eq. (12) is used to remove spurious steady-state solutions.
The gradient-like term in Eq. (11) is if variable is not associated with any literal in clause . Otherwise,
where is the value of the voltage corresponding to the other literal in the clause. The “rigidity” term in Eq. (11) is
This term only influences the voltage that is closest to the satisfying assignment in the clause.
The weight of each 2-SAT clause, , is incorporated in the dynamics of the slow memory variable and the dynamics of voltages. The weights of the 1-SAT clauses are used to bias the voltage dynamics in Eq. (11) as , where is the weight of the 1-SAT with a literal that is equivalent to variable and is the weight of the 1-SAT with a literal that is the negation of variable . The weight is zero if no corresponding 1-SAT exists.
The parameter values used for the simulations reported in the main text are , , . At , voltages are randomly initialized with and . The equations are then numerically integrated with the forward Euler method using an adaptive time step, , until a total integration time of is reached. Then, we take the configuration with the lowest number of unsatisfied clauses as the sample.
a.2 Computational Cost of Sampling with Dynamics
For simplicity, let us assume the form of an RBM, resulting in voltage variables and clauses as a MAX-2-SAT instance. The time complexity of a forward Euler integration step is dominated by the sparse matrix-vector multiplication of a sparse matrix. Since each node connects to other nodes this matrix contains non-zero elements encoding the connectivity structure of the problem. This matrix multiplies the gradient vector of length , for a total of floating point operations per second per time step. If the maximum number of timesteps is independent of system size, the total time complexity is then . Memory complexity also scales as , since the algorithm requires the storage of 4 length floating point elements, and a sparse matrix with non-zero elements.
a.3 Performance Comparison Between CD and Memcomputing
Here, we want to show that the RBM energy sampled by the memcomputing approach is consistently better than the one found by CD independently of the size of the RBM. Even though memcomputers are ideally realized as physical devices in hardware Di Ventra and Pershin (2013); Traversa and Di Ventra (2017b), here we compare their performance as numerically integrated dynamical systems versus traditional algorithmic methods (e.g., CD).
We then first compute an “exchange rate” between one iteration of the numerical integration and steps of CD, such that the resulting computational complexity (i.e., wall time on the same processor) will be essentially identical. We discover empirically, across a large range of system sizes, that this exchange rate is about 30 steps of CD per iteration of the dynamics described by Eqs. 11, 12, and 13.
We choose as our test problems a set of randomly initialized
RBMs, with all weights sampled from a normal distribution withand . The system sizes ranged from to 1000, which we chose to be large enough to observe the scaling in time. We then compute the relative energy differences, in percentage, , between the energy obtained with the memcomputing ODEs described above, as compared to the energy obtained using CD-. For a direct comparison, we have run the memcomputing solver for integration steps, scaled with the system size. Contrastive divergence was then run using the empirical exchange rate, , resulting in the same computational cost seen in Fig. 4 (right panel).
The energy results are plotted in Fig. 4 (left panel), where the memcomputing dynamics perform very favorably in terms of energies obtained compared CD-, consistently above 400%, often showing an improvement of more than 1000%. In terms of time complexity (right plot), both algorithms follow the same linear trend on a log-log plot, indicating a polynomial scaling. Indeed, the best fit asymptotic behavior of both algorithms is almost cubic. This is consistent with our complexity analysis in the last section. Since both algorithms have a leading order scaling of for a fixed number of iterations, they would scale cubically if we allowed the number of iterations to grow as , the system size. Finally, we want to stress that the set of equations used in the present work are only an example of how to implement a memcomputing solver and have not been optimized in terms of both speed and performance.
Appendix B Stability of Mode-Assisted Training
The stability of a pre-training procedure to training neural networks is a very desirable feature. This is because the KL divergence cannot be monitored during the pre-training process for a realistically sized RBM, so it is crucial for us to ensure that the KL divergence does not diverge. In this section, we show that using the mode in the model update term will guarantee convergence to a uniform distribution, and there is an optimal learning rate that provides the largest rate of convergence, with the learning rate being easily computable.
Note that in this work, the data term of a mode-assisted update is the same as traditional CD algorithms, so the difference is entirely in the way that the model term is approximated. Therefore, we only have to focus on the model term (which is “approximated” by the mode of the joint distribution), and point out some of its key properties, in particular those pertaining to the stability of the pre-training procedure.
b.1 Gauging the RBM
For the sake of simplicity, we consider an unbiased RBM with nodal values of and , then the RBM energy is given by:
Note that an RBM with nodal values can be trivially transformed into one with nodal values . For the analysis in this appendix, we will always assume (unless specifically mentioned) that an RBM is unbiased and equipped with nodal values .
Since in our work, we are interested in particular to the mode of the joint distribution, which is equivalently the nodal configuration that minimizes the RBM energy, we give a special denotation to this configuration, , and name it the ground state of the RBM energy.
Definition B.1 (Ground State Energy).
Given an RBM with weights , we denote the ground state of this RBM to be
Furthermore, we denote the ground state energy to be
Note that in practice, the ground state of an RBM can be thought of as being unique. In fact, for randomly initialized weights, the probability of having two or more minimal energy states, or degenerate ground states, is of measure zero. In theory, if there were to be multiple ground states, we can randomly select one of them to be , and our analysis will not be affected at all.
Note that for any RBM, we can always map it to an equivalent RBM such that the ground state is . This is called a gauge operation, which we formally define as follows
Definition B.2 (Gauged RBM).
Given an RBM with weights and ground state , we define the gauge mapping such that satisfies the following condition:
Then we call a gauged RBM.
Remark. Note that by this definition, it is easy to see that the ground state of any gauged RBM must be . This means that the ground state energy of a gauged RBM is simply the sum of its weights
Furthermore, note that the form of the weight update equation is invariant under conjugation. In other words, if we let denote one iteration of weight update, then it is clear that
This means that the dynamics of can be analyzed in terms of the dynamics of . In this section, we will always assume that the RBM is gauged.
For a gauged RBM, the change of the weight elements (under unit learning rate) as a result of an iteration of mode-informed update is:
Therefore, we see that every weight element is decremented by 1 uniformly across the entire weight matrix, and the energy change of the ground state energy is:
In order to investigate how the joint probability mass function (joint PMF), or , evolves under mode training, we have to look at how the energy changes for all nodal states. To do so, it is useful to define a distance measure between two states, to have a sense of how “far apart” the two states are. We then propose the following distance measure.
Definition B.3 (Metric).
We define a spin state to be an ordered -tuple given by . Given two spin states, and , we let and , we define the distance to be
Note that simply counts the number of visible nodes that are different between the two states, and counts the number of different hidden nodes that are different. Note that the space of with this distance definition is a pseudometric space, in the sense that it is possible for the distance between two distinct points to be zero, in particular states that are related by symmetry (or global spin flips). This can be easily verified by letting , giving us and , and . In this pseudometric space, the distance is a measure of how “different” two spin states are up to a symmetry. A formal discussion of this metric, including a proof of triangle inequality, is provided in Appendix C of our related work Pei et al. (2019). The usefulness of defining the metric this way will be apparent in proposition B.1.
Remark. It is important to note that is not uniquely determined by . To see this clearly, we rewrite Eq. (19) in terms of the following Diophantine equation
solving for integers and . It is easy to see that this equation is over-determined by realizing that it is possible for the RHS to have multiple prime factors.
b.3 Energy Change
Equation (18) gives the change in the energy of the ground state under a mode-assisted update iteration. However, to analyze the stability of the training procedure, it is necessary to look at the energy change of all states. To simplify our discussion, instead of looking at the energy of each individual state, let us consider the average energy of all the states distance from the modal configuration, which we denote as . Note that the average is not the expected value over the joint PMF . Rather, it is an unweighted average (or the expected value over a uniform probability measure). It is interesting to note that this average energy only depends on the ground state energy and the distance from the ground state.
Proposition B.1 (Average Energy).
The average energy of states distance from the ground state is:
Given some distance , there can be multiple assignments of that correspond to this distance. However, if given a particular tuple , we show that the average energy of all states with spins differing from the ground state by is only dependent on the distance corresponding to the tuple, then the average energy of states of distance from the ground state is simply the average energy of states of with spins differing from the mode by .
The average energy of states with spins differing from the ground state by can be expressed as
where in the last equality, we used the linearity of the expected value and the symmetry of the RBM. We easily see that the marginal probability distribution of a single spin is given by (with the underlying joint distribution being uniform)
which gives us
Therefore, the average energy of states distance from the mode is also . ∎
Since the average energy distance from the ground state is only dependent on , we expect this to be true also for the change in energy for a state at a distance from the ground state, under the weight update routine given in Eq. (17).
Proposition B.2 (Energy Change).
Given any state distance from the ground state, the change in the energy of that state is given by
Again, we only have to focus on one particular assignment of the tuple which corresponds to the distance , and show that the change in energy of a state corresponding to that tuple depends only on . Without loss of generality (WLOG), we assume that the first visible nodes are of value , and the first hidden nodes are of value . Then the change in energy is given by:
where we have used the fact that from Eq. (17). ∎
Remark. Note that the energy change is only dependent on the size of the RBM and the distance from the ground state, so all the states at distance experience the same energy change. Under a given learning rate , the actual energy change is then
At this point, it is necessary to take an intermission to look at the role that the mode update term plays in the pre-training procedure. From Eq. (20), we see that the energy change of a state distance from the ground state is proportional to the average energy of the states at the same distance . In the context of the entire pre-training procedure, this energy change can be interpreted as a constant drift term that pulls the energy back to zero with strength proportional to the average energy of all the states of the same distance. Loosely speaking, the joint distribution will become more uniform under an iteration of mode-assisted update.
Note that this behavior can also be achieved with standard regularization procedures such as an exponential weight decay term like . However, such regularization techniques are usually undesirable as they do not induce an effective sampling of a multi-modal distribution. Our procedure, however, does not suffer from such drawbacks, and in fact promotes the effective sampling of a multi-modal distribution (see section C).
b.4 Approaching Uniformity
In this section, we formalize the argument that the RBM energies over all states become more uniform under a mode-assisted update iteration. To do so, we mainly focus on the energy variance across all states, and show that it must decrease under a suitable learning rate. This statement can be made more precise as follows.
Theorem B.3 (Decrease in Energy Variance).
If , then the variance of the energies over all spin states decreases. The largest decrease in variance occurs when .
We reiterate the fact that the underlying PMF for the states is assumed to be uniform, or for every nodal configuration
. We can then define a random variablewith its PMF being:
which can be interpreted as the probability of a randomly chosen state to be a distance from the ground state. From this PMF expression, we can easily derive the expected value and the variance of the distance of two randomly chosen states
where we see that the variance is small relative to the expectation value for a large system. We then use the law of total variance to write the variance of the energies over all states as
We first begin by focusing on the first term. Note that the term is the conditional variance of energies of the states distance from the mode. If we update the energies according to Eq. (20), then the new variance can be written as . The term is dependent only on but not the specific nodal configuration , so it is just a constant offset in the context of the conditional variance, and the variance will remain constant. Therefore, the first term of the variance decomposition is constant, and we only have to focus on the second term, which can be conveniently written as:
After a weight update, this variance becomes
In this form, it is easy to see that the variance decreases when the learning rate satisfies
with the largest decrease being , which occurs at the learning rate . This is then our optimal learning rate. ∎
Remark. To avoid confusion, note that is negative, so is positive, so the learning rate is bounded in some positive interval. Note that the two biases for the visible and hidden spins can be expressed as two ghost spins Pei et al. (2019), thereby effectively adding one more spin to each layer. By taking into account the biases, we see that the largest decrease of the variance occurs when
which is what we use in the main text.
There are two important things to note here. First, the learning rate, as presented in Eq. (25) is generally very large and is only optimal in the sense that it provides the fastest convergence to a uniform joint PMF, which is desirable for a stable pre-training routine, but not necessarily optimal for minimizing the KL divergence. The practical usefulness of Eq. (25) is to mainly provide an upper bound to the learning rate that ensures stability. It should be noted that the analysis ignores the presence of the data term (see Eq. (6) in the main text) and is only carried out over a single iteration; in other words, it may be possible that a large learning rate will force the system into a local minimum in the KL divergence rather quickly. Therefore, in the practical setting a smaller learning rate would be more beneficial. In the main paper, we then normalized this learning rate with the learning rate of CD, which results in (as ).
The second thing to note is that Eq. (25) is not exact as the ghost spins are fixed nodes that cannot be “flipped”, so theorem B.1 no longer applies, meaning that the average energy of states distance from the ground state can no longer be uniquely determined by and alone. Nonetheless, for large RBMs, the contribution from biases are relatively small, and the approximation is close to exact.
b.5 Suboptimal Updates
Before we conclude this section, we make two final remarks concerning suboptimal updates, or updates that are not informed by the global mode directly. The first remark pertains to a practical setting where locating the global mode is difficult or too computationally expensive, and only an approximate mode can be obtained, or a state with energy close to the ground state. We discuss how an update informed by this state still ensures stability. The second remark compares a mode-assisted update with an update with the model term sampled by some form of stochastic algorithm (such as CD), and we show that the latter update procedure does not ensure stability.
Note that in Eq. (B.1), we transformed the weight elements such that the ground state is and . However, this procedure is general and can be done for any given state. Given any two states, and with some associated energy , it is always possible to gauge the RBM in a way such that and . The previous proofs will still carry through for as long as . This means that the mode training procedure does not hinge on the fact that the weight update has to be informed by the exact ground state, and any state sufficiently close to the ground state should suffice. However, it should be noted that using the ground state to inform the weight update provides the greatest decrease in energy variance since the maximum of scales quadratically with (see Eq. (23)).
Note that in theorem B.3, the argument that the conditional variance of the energies conditioned on some distance from the ground state does not change is based on the fact that the weights are updated uniformly across the RBM according to Eq. (17). However, for a stochastic algorithm, the weight updates are clearly not uniform (or even deterministic for that matter), so nothing can be said about the change of the conditional variance. It is possible for the conditional variance to increase under a stochastic update, thus pulling the energies away from uniformity if the magnitude of the increase overcomes the decrease in the second term in Eq. (22) (the ground state variance).
To conclude this subsection, we discuss briefly the contribution of the data term in updating the weight matrix. Clearly, if we look at the gauged RBM matrix, the change in each element generated by the data term is bounded above by , meaning that its contribution cannot overcome the guaranteed decrease generated by the mode update term. This means that it is impossible for the ground state energy to decrease even in the presence of the data term, so the mode of the joint distribution must not increase, thus the training never diverges. This effectively ensures the global stability of our mode-assisted training method.
Appendix C Efficient Sampling of Multi-modal Distributions
So far, we have shown that our update procedure guarantees stability. However, as briefly mentioned at the end of section B.3, stability is also guaranteed by standard regularization terms such as the weight decay term, . In this section, we make the crucial distinction between our procedure and standard weight regularization procedures by pointing out the key phenomenon that our procedure is capable of efficiently exploring the landscape of a multi-modal PMF.
This property of the mode-training method is most readily analyzed from the perspective of the frustration index of the RBM instance. The frustration index can be interpreted as a measure of the difficulty of discovering the nodal ground state of a given RBM instance, and interestingly, an increase in the frustration index is correlated with an increased rate of exploration of the multi-modal distribution. Therefore, in some sense, for a given iteration of weight updates, the difficulty of finding the mode of that distribution is “compensated” by an increased efficiency of PMF exploration.
We begin by formally defining the frustration index, followed by a brief explanation of how the mode-training algorithm explores efficiently the PMF. Finally, we relate the two concepts in a cohesive manner. We provide an extensive analysis on the frustration of the RBM and its practical applications in our related work Pei et al. (2019).
c.1 Frustration Index
The frustration index is the ratio between the sum of unsatisfied couplings at the ground state and the sum of all coupling strengths. Formally, for a gauged RBM, it can be defined as follows
This index is closely related with the degeneracy of the low-energy states. In other words, with an increase in the frustration index, the excited states will be spaced closer to the ground state in energy. Furthermore, for a highly frustrated system, the transition from the ground state to the excited states usually involves flipping a large cluster of nodes. This gives rise to a large population of local minima in the energy landscape spaced far apart in distance but close together in energy (in terms of the metric discussed in section B.2), and this property of a highly frustrated system makes it difficult for local search algorithms to locate the global minimum. This motivates the need for an algorithm that is able to learn the long-range correlations of the RBM spins, and a possible candidate of this algorithm is presented in section A.
c.2 Inefficiency of Weight Decay
In this section, we discuss briefly why the standard weight decay algorithm (where is some learning rate) is not efficient in assisting local algorithms in sampling a multi-modal distribution. To begin with, we first recall that the joint distribution of the RBM is
where for a gauged RBM. Note that the weight decay update is a contracting affine transformation of the energies of all states, or simply a rescaling of the energies by some constant , meaning that the joint distribution transforms as
where the normalization condition is ignored.
Of course, the distribution does become more uniform under this transformation; however, the ordering of the states with respect to their energies will not change, meaning that the ordering of the dominant modes remains invariant under this transformation. In other words, a poorly initialized Markov chain trapped under a dominant mode will still remain trapped unless becomes sufficiently small; this means that a large learning rate, , is required to free the Markov chain and allow efficient exploration of the joint distribution. However, a large learning rate in this context is undesirable, as it brings the RBM to uniformity in a drastic manner, which voids much of the information gained from the previous iterations of pre-training.
The inefficiency of this approach boils down to the indiscriminate update of the weight matrix that is ignorant of the energy ordering of the states or the distance between them (see definition B.3).
Our mode-assisted update, on the other hand, updates the weight matrix based on the ground state configuration of the RBM, resulting in a maximal increase in energy for the ground state, and the energy change is “propagated” to the other states based on their distances from the ground state (see proposition B.2). An entirely different energy landscape will then emerge under this update procedure even under a small learning rate, and it is likely that a new ground state at a faraway distance will “pop” up. The next update iteration is then based on this new found mode, and the process is repeated. Effectively, we are dynamically sampling the energy landscape by making large leaps between dominant states without resorting to forcing uniformity on the energies.
c.3 Global Mode Cycling
For the sake of simplicity, consider a gauged RBM with a joint distribution having three dominant states, with their RBM energies being
. The heuristic analysis in this section can be easily generalized to multi-modal distributions with arbitrary number of dominant states. We can also assume that the pairwise distances between the three modes are the same (meaning that the three modes form an equilateral triangle under the metric defined in definitionB.3), which we can then denote simply as .
If we assume that the learning rate is , then from Eq. (20), we see that the new energies of the three states will become
where we are assuming that the magnitude of the learning rate is much larger than the energy gaps111This is a justified assumption if the system is highly frustrated, as the energy gaps near the ground state are generally very small for such system., or more precisely
where we note that the lower bound of gamma is proportional to the energy difference between the first excited state and the ground state. This guarantees that after one update, the ordering of the new energies of the states will become
which means that is the new ground state energy, and the next iteration of weight update will be based on state , resulting in the following new energies
The energies are then reordered as
so becomes the new ground state energy. And similarly, the third iteration will recover the original energy ordering .
Therefore, we see that in general, whenever we perform a weight update, the energy ordering of the modal states will experience a left circular shift, so we are, in some sense, sampling the multiple modes in a cyclic fashion, which allows us to effectively cover a large volume of the probability measure.
c.4 Relationship between Frustration and Mode Sampling
Now, we discuss how an increase in the frustration index is conducive to an efficient sampling of the multi-modal distribution. We here consider simply a gauged RBM, with ground state energy . We denote the average energy of states distance from the ground state as (see proposition B.1). Under an iteration of mode update, the new energies are (see Eq. (20))
c.4.1 Small Frustration
For the sake of simplicity, consider the case where the frustration index of the RBM is zero, then all the weights can be assumed positive. Furthermore, we make the simplifying assumption that the weights are iid222From here on, ‘iid’ will serve as the abbreviation of ‘independent and identically distributed’. random variables with uniform distribution in . We then make the following claim.
If we update the weight matrix continuously with the mode-assisted update procedure, then the new ground state will differ from the original ground state by distance almost surely.
More formally put, if we let be the minimum energy of states distance from the ground state, and . Then the smallest learning rate for which a new ground state can emerge is
If we let be the distance such that , then
Given any state distance from the ground state, we have
WLOG, we can also assume that is a prime number, then the number of states distance from the ground state (where ) is , which gives us (denoting and ).