Self-regularizing restricted Boltzmann machines

by   Orestis Loukas, et al.
SBA Research

Focusing on the grand-canonical extension of the ordinary restricted Boltzmann machine, we suggest an energy-based model for feature extraction that uses a layer of hidden units with varying size. By an appropriate choice of the chemical potential and given a sufficiently large number of hidden resources the generative model is able to efficiently deduce the optimal number of hidden units required to learn the target data with exceedingly small generalization error. The formal simplicity of the grand-canonical ensemble combined with a rapidly converging ansatz in mean-field theory enable us to recycle well-established numerical algothhtims during training, like contrastive divergence, with only minor changes. As a proof of principle and to demonstrate the novel features of grand-canonical Boltzmann machines, we train our generative models on data from the Ising theory and MNIST.


page 16

page 21

page 22


Restricted Boltzmann Machine with Multivalued Hidden Variables: a model suppressing over-fitting

Generalization is one of the most important issues in machine learning p...

Training Restricted Boltzmann Machines via the Thouless-Anderson-Palmer Free Energy

Restricted Boltzmann machines are undirected neural networks which have ...

A Novel Feature Extraction Method for Scene Recognition Based on Centered Convolutional Restricted Boltzmann Machines

Scene recognition is an important research topic in computer vision, whi...

On an enhancement of RNA probing data using Information Theory

Identifying the secondary structure of an RNA is crucial for understandi...

On better training the infinite restricted Boltzmann machines

The infinite restricted Boltzmann machine (iRBM) is an extension of the ...

Understanding Boltzmann Machine and Deep Learning via A Confident Information First Principle

Typical dimensionality reduction methods focus on directly reducing the ...

Matrix Product Operator Restricted Boltzmann Machines

A restricted Boltzmann machine (RBM) learns a probability distribution o...

1 Introduction

In the past decades, artificial intelligence has increasingly become a major key-player in a vastly wide range of fields. Training a machine to recognize patterns through versatile data, perform classification tasks and make decisions has been proven most of the times particularly successful, quite often outperforming hard-coded programs and human cognition. Also within physics, the implementation of machine learning (

ml) proves beneficial. The related literature ranges from (un)supervised leaning on statistical systems (for a concise introductory review see [2018arXiv180308823M]), many-body problems [Carleo602, 2019arXiv190506034V] and quantum entanglement [2017arXiv170401552L, PhysRevX.7.021021] up to high-energy applications in Particle Phenomenology (e.g. [komiske2017deep, roe2005boosted, baldi2014searching]), String Theory (e.g. [ruehle2017evolving, he2017deep, he2018calabi]) and holography [hashimoto2018deep, gan2017holography, you2018machine, inproceedings].

Yet, there are situations where the machine either after seemingly appropriate training unexpectedly fails to produce an adequate output or, to begin with, cannot learn the given data, at all. The often unpredicted failure of the intelligent algorithms as well as their surprising success at specific tasks signify our lack of a concrete understanding of the theory underlying most of ml

applications. At this point, input from theoretical physics can be proven beneficial. Among the various ideas invoked in the interface between the theoretical description of physical systems and machine learning to interpret and improve (deep) learning algorithms geometrization 

[2019arXiv190102354D], variational approaches [unknown, 2014arXiv1410.3831M, 2016PhRvB..94p5134T] and classical thermodynamics [2017NJPh...19k3001G, Tubiana:2016zpw] have been proposed. In [2017arXiv170804622M, Cossu:2018pxj, Iso:2018yqu, Funai:2018esm, 2018arXiv180909632L, 2018NatPh..14..578K] ideas from the Renormalization Group flow are used to comprehend the flow of configurations triggered by generative models after training on systems from condensed matter physics.

Besides the concrete model and type of task performed (classification vs. generative), a failure of the ml algorithm to recognize the desired features from the target data is intimately related to the existence of various so called hyperparameters

which require fine-tuning before or even during training. Most of those hyperparameters concern the architecture of the


model. This comprises the depth of the neural network, the number of units at each hidden layer and the activation function(s) used. In contrast to hyperparameters, other parameters of the model like the weights and biases are determined during training by extremizing an appropriate information-theoretic metric 

[1957PhRv..106..620J, renyi1961measures]

like cross entropy which conventionally measures how well the model can classify or reproduce the given data.

Generally, deeper networks tend to extract features from a target system with a higher level of sophistication. Similarly, hidden layers of bigger size can approximate functions with increasing accuracy and thus help to learn better a provided data set. However, opting for larger architectures comes at a price. Besides computational efficiency deeper networks sometimes bring no advantage over shallow models [2017arXiv170804622M, Funai:2018esm] or could even lead to instabilities (which come under the name of vanishing and exploding gradient [pascanu2013difficulty, DBLP:journals/corr/IoffeS15]). At the same time, hidden layers with more units tend to overlearn specifics of the concrete (practically finite) data set they are exposed to, while overshadowing the typical traits of the given target system from which the training subset descends. This overlearning (also called overfitting) decreases the ability of our ml model to generalize the “knowledge” acquired during training about a target system to new unseen data.

Evidently, the question arises about optimal architectural choices that keep a balance between learning the desired features of the target data at a satisfactory level and overlearning irrelevant details from the training sample. A priori, efficiently fine-tuning such hyperparameters requires experience and a good understanding of the target system. At a more pragmatic level, to address this issue one usually scans over the hyperparameter space after imposing various constraints based on intuition and/or rules of thumb, in the spirit of e.g. [hinton2012practical]. Another practical approach [domhan2015speeding] is to train one simpler ml routine to detect the most optimal values for the hyperparameters of the ml model which tries to learn the target data. At a more formal level, there exists mainly the widely used method of  – regularization, where a ml model consisting of bigger hidden layer(s) is implemented that imposes a penalty for using a growing number of hidden resources [doi:10.1162/neco.1995.7.2.219]. Despite its practical applicability,  –  and  – regularization still requires a certain amount of fine-tuning to control the severity of penalization for using additional hidden nodes and to adjust the consequent interference with training.

In this paper, we aim at trying to eliminate the hyperparameter related to the number of hidden units altogether, in a dynamic fashion, i.e. as a solution to the extremization problem constituting the training procedure. To this end, we concentrate on energy-based generative models which are trained to reproduce a target distribution by assigning a higher probability and lower energy to physically occurring configurations (see 

[Goodfellow-et-al-2016] for a pedagogical introduction). Specifically, the familiar restricted Boltzmann machine (rbm), originally formulated in [hinton1986learning] in the canonical ensemble, is reviewed and extended within the grand-canonical ensemble of statistical systems. Most naturally, this grand-canonical extension to accommodate a varying number of hidden units can be thought of as encompassing (theoretically infinitely) many restricted Boltzmann machines with hidden layers of all possible lengths. This concept is schematically presented in Figure 1. Notice that rbm of various sizes that are used to model the target data share hidden units.

In the language of statistical mechanics, the theory is at finite chemical potential , which now controls the strength of regularization, i.e. the most optimal size(s) of the hidden layer to be used. In principle, the ml model examined as a statistical system on its own right exhibits different phases depending on the value of . By an appropriate choice of the form of the chemical potential though, as a function of the other parameters that already exist in the Boltzmann machine (i.e. weights and biases), its value can be dynamically determined during training to favour networks of smaller sizes. In other words, the solution to the extremization problem posed during training in the grand-canonical formulation automatically ensures that our ml model learns the target distribution by promoting networks of the smallest possible hidden layer, avoiding thus overlearning.

Figure 1: The concept of the grand-canonical extension to the rbm. Hidden units are ordered from left to right into layers of different lengths (dashed rectangles). Each hidden layer is proportionally penalized according to by a chemical potential which is dynamically determined during training. Provided an observable distribution hidden layers of the appropriate size are invoked to model the target data.

In practice, we have to impose a cutoff to the maximal size of hidden layers that the grand-canonical model could use. For a sufficiently high cutoff though, not only we anticipate that the theory effectively becomes independent of the concrete cutoff implemented, but furthermore that most designer choices concerning the precise functional form of the chemical potential converge to the same regularizing effect.

The method of regularization presented in this paper fundamentally differs from the familiar  – regularization w.r.t. two aspects. On the one hand, once the chemical potential is appropriately chosen as a function of the rest of the rbm

parameters, there should remain no adjustable parameter – discrete or continuous – related to the strength of regularization. On the other side, this regularization scheme naturally treats target data in a local fashion. This means that networks with a different number of hidden units will be invoked for different subsets of the target data depending on concrete features of each subset. In the language of mathematical optimization, training an

rbm under standard  – regularization poses a hardly constrained problem, while training the suggested grand-canonical extension with an appropriately chosen chemical potential results into a softly constrained problem.

Overview of the paper

This paper is structured into a theoretical (Section 2) and applied (Section 3) part. Specifically, we review in Section 2.1 the necessary theoretical framework of grand-canonical Boltzmann distributions and lay out the model we wish to investigate. Subsequently, we set up in Section 2.2 the stage for training, by revisiting the minimization of the cross entropy between target and model distribution, while explaining necessary modifications at the level of a numerical solution. Next, we discuss in Section 2.3 working assumptions and justify our concrete choice of the chemical potential as a function of the weights and biases that penalizes larger hidden layers. Ultimately, we apply the developed techniques to two well-known data sets: two-dimensional Ising configurations and the mnist set of handwritten digits, in Sections 3.1 and 3.2, respectively. We discuss and compare the learning outcome among the two paradigms as well as to the standard rbm (without regularization).

2 Varying number of hidden units

A restricted (i.e. in absence of intra-layer interactions) Boltzmann machine consists of an observable layer with units and a hidden layer with units . In this picture, the observed interactions among the units v are modelled via their connection to the hidden (or latent) units . Generically, Boltzmann machines are characterized by the weights (also called connections) among observable and hidden layer together with the hidden and observable biases, and respectively. In the following, we collectively denote the trainable rbm parameters by . In contrast to those parameters which are expected to be fixed during training by extremizing the appropriate information-theoretic metric, the number of hidden units is a so-called hyperparameter which needs to be fine-tuned beforehand.

In what follows, the question we are going to answer is how to eliminate this hyperparameter from training or in other words, leaving the size of the hidden layer unconstrained if and how the machine can “select” by itself optimum values for to explain the provided data. For brevity, we shall refer to a restricted Boltzmann machine invoking a varying number of hidden units as vrbm.

2.1 Boltzmann machine at finite chemical potential

Henceforth, we focus for concreteness on a binary domain where and Bernoulli Boltzmann machines also with ; the generalization of our discussion to Gaussian or other multimodal models being straight-forward. As we are interested to work in this paper with varying number of hidden units , we need to switch to the grand-canonical ensemble. In this picture, the proper energy-based model at finite chemical potential is given by the grand-canonical Boltzmann distribution


being the (generally intractable) partition function. Indices are contracted in Euclidean space and summations are explicitly recorded, for clarity. We work in units where . Including terms up to quadratic order in the units and helps efficiently solve the extremization problem posed during training, as outlined in Section 2.2. Setting in Eq. (2.1) recovers the ordinary rbm dictated by the canonical Boltzmann distribution.

Introducing the shorthand notations


it is useful to define the free energies in the grand-canonical ensemble for a hidden layer of length via


by integrating out hidden and observable variables, respectively. -independent terms like can be dropped. At most, we can absorb an irrelevant for our purposes (since it leads to a -suppression irrespective of the training procedure) term like by redefining the chemical potential. Integrating out the hidden variables and summing over all possible lengths of the hidden layer leads to the observable free energy of the vrbm,


in terms of which the partition function of the model compactly reads


Evidently, the associated probabilities we are going to use in the following section are generically given by


At this stage, the partition function of the vrbm depends on the ordinary rbm parameters as well as the chemical potential . This model can be viewed as a collection of ordinary Boltzmann machines with a hidden layer of different lengths . Hence, rbm with different number of hidden units contribute to grand-canonical expectation values weighted by the introduced chemical-potential term. From the point of view of statistical systems, our vrbm is expected to exhibit different phases: For smaller hidden layers are favoured prohibiting learning the desired distribution, while in the opposite limit networks with larger number of hidden units prevail and overfitting occurs.

Working assumptions.

For the intended ml application, we can only sum over a finite number of sizes of the hidden layer. We thus have to use the free energy


whose limiting case is Eq. (2.5). This model has one more parameter than the partition function Eq. (2.6) of the idealized vrbm: the maximum number of hidden units that the model has at its disposal. In the spirit of eliminating hyperparameters and assuming that only a finite number of hidden units is needed to explain the target data, we take the number of available hidden units to be very large until the model “thinks” it always has a sufficient number of resources. Formally, we shall work in the regime of large . To quantify this, needs to be sufficiently larger than the natural scale of the problem set by the number of observable units , so that effects induced by the finite amount of hidden resources are suppressed by powers of . In the following, we shall always work in the approximation . In practice, as we are going to see in Section 3 this can be relaxed to .

Even under the large- regime, the model described by partition function Eq. (2.6) still seems to have annoyingly many adjustable parameters, for sure not less than its canonical counterpart. Most crucially, we are not interested to merely swap the discrete number of hidden units with a continuous chemical potential , but to eliminate the hyper-parameter determining the size of the hidden layer from the training altogether. Hence, we shall take the chemical potential to be some function of the other rbm parameters, . Generically, there are a few desired properties this function is expected to possess. First of all, at the formal level, the chemical potential being an intensive variable will be treated as a global model parameter, i.e. it will not exhibit any explicit or implicit dependance on the hidden-layer number . At the practical level, dropping this constraint222For such a treatment of vrbm with a local -dependent chemical potential see [2015arXiv150202476C, nalisnick2017infinite, peng2018better]. leads to certain instabilities in the learning process, at least for the systems tested in this work. In addition, to ensure that has a regularizing effect at all, it should be independent from the sign of the other parameters, weights and biases . Given those formal requirements one is entitled to test which ansatz for best suits the training on a particular target system.

For the purposes of this paper, we adopt a full agnostic approach concerning the target system and refrain from imposing any designer biases. Specifically, as the biases solely characterize observable units, the chemical potential – being a feature that is intimately related to the length of the hidden layer – is not expected to explicitly depend upon those. In the same spirit, we further assume that is unbiased towards observable units, i.e. it does not exhibit any implicit dependance on them, as well. Thus, our chemical potential is at most a function of the row p-norm of the weight matrix and hidden biases,


In what follows, we take all model quantities to ultimately depend only on the trainable parameters that are determined by appropriate training, as outlined in the next paragraph.

The extremization condition.

A generative model is primarily used to learn to reproduce a given data distribution by extracting its characteristic features. In the following, we always take the domain of to coincide with the domain of both the observable and hidden units of our generative model. Training our Boltzmann machine on a given target probability is performed by extremizing w.r.t. model parameters (recall that we take ) an appropriately chosen target function. As a target function we conventionally take the opposite of the cross entropy between model and target distribution, i.e. the expectation value under target distribution of the logarithm of probability given in Eq. (2.7):


It is straight-forward to show that maximizing

is equivalent to minimizing the Kullback-Leibler divergence which gives the relative entropy between target

and model distribution.

First, note that deriving free energy Eq. (2.5) w.r.t. some trainable parameter we get


where the conditional probability associated to Eq. (2.3) reads


Next, we can express the extremization condition on in terms of a derivative of the free energy Eq. (2.3) at level :


Using the concrete form (2.1) of grand-canonical Boltzmann distribution the extremization condition (2.13) results into a set of equations,


taking respectively. Notice in particular, the appearance of Heaviside step function with for , as a consequence of varying number of hidden units as well as the additional term in the first two equations due to the derivative of chemical potential Eq. (2.9).

In the literature, such extremization conditions are often compactly written as


where expectation values are understood to be taken w.r.t. the distribution of the provided training data

and the probability distribution

generated by our model. Due to our inability to generically evaluate the partition function the derived set of conditions cannot be solved in any closed form.

2.2 Contrastive Divergence revisited.

To circumvent this we shall make a gradient-descend inspired approximation. Employing contrastive divergence (cd) has proven to be a particularly efficient way to numerically find the maximum, Eq. (2.13), by updating our model parameters in the direction of steepest descend according to


until convergence is achieved. The learning rate is a tuneable parameter of the extremization process itself. Too large values of could drive away from the desired extremum, while too small learning rates slow down the training process. The cd derivative at -th step is defined by


where with . The momentum acts as a “memory” of previous updates to ensure stability of the gradient-descent algorithm.

The mean-field ansatz.

For each cd update in Eq. (2.16) we need to compute the expectation values in Eq. (2.1) that appear when taking the derivative of the target function w.r.t. model parameters , and . For that, we use mean-field theory to iteratively compute the vev of , and using consistency equations and substituting the one into the other. In our vrbm setting though, a slight modification of the mean-field theoretic consistency conditions which are invoked by the cd method in the standard rbm construction is required. Schematically, starting from the distribution v of observable units the length of hidden layer and subsequently the hidden units

are to be inferred which in turn are about to give a new estimation for


In detail, provided the distribution of observable units it is straight-forward to compute the multimodal conditional probability given in Eq. (2.12). As in ordinary Boltzmann machines, we calculate the expectation value of the hidden units provided a sample v from the observable distribution from the (grand-canonical in our case) free energy Eq. (2.5) by


using Eq. (2.11

). The complementary cumulative distribution function (ccdf) or survival function of the probability



ensures that only layers including the -th hidden unit contribute to its conditional expectation value, cf. Eq. (2.1). Next, we need to determine some optimum value for the size of the hidden layer being sufficient to extract the features from the provided observable distribution. In principle, the number of hidden units has to be determined by sampling from . In practice, it is computationally cheaper while converging faster to either compute the vev of provided v,


and round it upwards to the nearest integer, or take the largest most probable value for given its conditional distribution,


After sufficient number of training epochs and at large

, all methods effectively lead to the same outcome. Ultimately, given the derived value of together with we are in a position to deduce a new estimation for . Such an estimation has to be extracted from a Boltzmann machine with hidden units taking values via


where the necessary conditional probability is deduced from Eq. (2.4),


Hence, hidden units contribute to the feature extraction from the desired data set leading in turn, to the generation of a new estimation for . As our estimation for , either Eq. (2.20) or Eq. (2.21), depends on the initial configuration v that we sample from input distribution , we observe that different configurations from would generically be explained by a different number of hidden units. In other words, the vrbm model has the freedom to adjust the size of its hidden layer depending on the complexity level of each subset in . This observation constitutes one of the fundamental departures from standard  – regularization [bishop2006pattern] (the other main difference being the absence of a continuous parameter conventionally controlling the strength of ordinary regularization schemes). When regularization is globally applied to the input set the local features of each example configuration v are detected by the same fixed number of hidden units encompassing the danger of over-learning for some of the subsets in .

In total, the -th iteration in mean-field theory for reads


under the initial configuration described by . For well-defined extrema, mean-field theory is formally expected to converge towards , the very latest as . In practice, the mean-field ansatz converges for physical data beyond numerical accuracy already when .

Summary of the numerical method.

All in all, putting everything together the -th step of the cd method in the vrbm dictated by Eq. (2.16) under a mean-field theoretic ansatz after steps in Eq. (2.24) consists of


Kronecker delta is used to raise and lower indices. Compared to the standard rbm its grand-canonical extension adds a complexity in computing conditional probability Eq. (2.12) which grows linearly in the maximum number of hidden units the model has at its disposal, cf. Eq. (2.8). Apart from computing this multimodal distribution, the grand-canonical cd algorithm shares everything with its canonical counterpart. Setting in the equations (2.2) and taking in Eq. (2.18) – (2.23) reproduces the ordinary cd algorithm of the rbm in the canonical ensemble with fixed number of hidden units.

2.3 A penalizing chemical potential

So far, we have avoided to concretely specify the form of the chemical potential, besides a more generic discussion towards the end of Section 2.1. In particular, we have explained our motivation in taking and discussed about the anticipated functional form of concluding to Eq. (2.9) in order to regulate the length of the hidden layer. Of course, being an intensive quantity could well implicitly or explicitly depend on global non-trainable parameters like and , but not on itself.

Given that the additional term in the grand-canonical ensemble is expected to naturally act as a regularizer for the number of hidden units and by naive dimensional analysis, the chemical potential should have the form of a non-negative energy density:


where the number of active hidden units equals the number of participating in explaining the target data, i.e. have non-vanishing weights for at least one . Conceptually, our ansatz (2.26) for the chemical potential describes some notion of total non-negative – to achieve a regularizing effect – vacuum energy (which could well be infinite when ) that is equally partitioned over each hidden unit actively participating in extracting the features of the target data. That way, we uniformly define a penalty the model needs to pay for using additional hidden units. Networks with larger hidden layers are then proportionally penalized to their length by a factor that equals the aforementioned “vacuum-energy” density.

Even under the working assumptions of paragraph 2.1, there is still a certain freedom in specifying the form of the total “vacuum energy” in Eq. (2.26). Inspired mainly from the regularization procedure performed [hinton2012practical, wang2017regularization] in ordinary rbm it seems plausible to take as a definition of the “vacuum energy” entering the chemical potential,


introducing the fundamental “vacuum-energy” quantum


characterizing each hidden unit. Notice that depends on the -th row  – norm of the weight matrix and the respective hidden bias. In a similar spirit, the matrix norm of appearing in definition (2.27) avoids placing any bias over some specific observable or hidden unit that could falsify a fair learning of the necessary connections .

Concerning the denominator of Eq. (2.26) that should count the number of hidden units actively used by the vrbm to model the given data it is straight-forward to approximate it by a smooth (since asymptotically when ) counter,


In fact, to precisely count the number of active units we would have to replace by ReLu function. However, the infinitely steep derivative of the latter function when its argument becomes zero renders the numerical solution presented in Section 2.2 inadequate, as the cd method gets stack either to the regime or depending on initial conditions. Hence in total, a candidate chemical potential depending only on the row-norm of and hidden biases which is unbiased towards any hidden or observable unit reads


Alternatively, depending on the input data to ensure numerical stability one could further smoothen this definition by taking the of Eq. (2.27).

Figure 2: In the self-regularizing rbm the interactions among observable units with biases are modelled by connecting via weights to hidden units . To explain an observable configuration some of the hidden units might not be used in spite of having weights (dashed lines) which have been trained on the full data set. Additionally, there are hidden units which do not participate in feature extraction, at all, while their biases still contribute to the regularization procedure.

At this point, a few remarks are in order: Evidently, our energy quantum is solely controlled by the respective bias, when the hidden unit is inactive in the sense of for all . Consequently, the model has the ability to recursively adjust and by merely administrating the hidden biases (and hence ) of the latter inactive units without interfering much with the actual training of the first units . To control the aforementioned interference between regularization and learning those first units the -th row -norm of the weight-matrix entering definition (2.28) was normalized by the observable scale . For finite values of , this seems to be the most sensible normalization in order to avoid naively introducing any hierarchy in among the two summands in Eq. (2.28). At large (i.e. ) however, we anticipate (and indeed find) that any generic number in front of leads to the same outcome after sufficient number of epochs. The same observation holds for any generic rescaling of the hidden biases entering Eq. (2.28). Despite that such redefinitions of the chemical potential could delay the convergence of the training algorithm, they do not crucially change the regularizing effect of Eq. (2.30) which is determined by our conceptual choice of the functional form Eq. (2.26).

In Figure 2, we schematically draw how the vrbm is expected to model interactions among five observable units by using their connection (continuous lines) to hidden units via the trained weights . Not all weights that the vrbm has learned during training are expected to participate in the feature detection of each example from the target distribution . In our schematic depiction, one hidden unit remains inactive (dashed connecting lines) for the given observable configuration v as dictated by . The size of the biggest hidden network (wrapped by a dashed rectangular) encompassing all units connecting to the observable layer determines . Finally, there should exist hidden units for whose weights are effectively regularized to zero. In fact, weights with quickly get exponentially suppressed by the chemical-potential term in beyond any meaningful numerical accuracy. The role of those hidden units is to ensure the regularizing effect of the vrbm through their biases , as we are going to explicitly verify in the following two sections.

To summarize, the grand-canonical extension of the rbm is expected to dynamically determine during training the most optimal value of a penalizing chemical potential in order to extract the typical characteristics of the target distribution at a satisfactory level while avoiding over-learning. To do so, it employs –  provided a sufficiently large number of hidden units – hidden layers of different lengths depending on the input configuration v with a probability .

3 Training Boltzmann machines at finite chemical potential

In this section, we aim at applying the grand-canonical extension of the rbm developed in the previous paragraphs to learn target distributions that act as a reference in their respective fields. For this purpose, two error functions which measure the deviation of the ml-generated data from the actual data set come in handy. So far, the prediction of our model derived in Section 2.2 is an expectation value of the mean-field theoretic ansatz Eq. (2.24) for each observable unit. A priori, is not expected to belong to the domain of , though. As those error measures are concerned with the crucial ability of the generative model to learn and reproduce the target data, one should stochastically replace333In information-theoretic context and the computer-scientific literature [MacKay2003], this procedure appears as Gibbs sampling. the expectation value with the actual value as sampled from distribution (2.23). For binary distributions, given the expectation value that is computationally more efficient to obtain, this replacement is simply done with a probability .

Loss functions.

In most applications of interest, training the ml model on the full distribution , which is usually not fully known or very expensive to compute, is not a feasible task. In practice, we train our generative model on a smaller number of selected points sampled from . For ease of notation, we summarize the training subset via


in terms of one-dimensional delta functions (or Kronecker delta for discrete distributions). To express the ability of the generative model to accurately learn to reproduce

we introduce the quadratic reconstruction error on training data (also called train loss function) by


For the expectation value we substitute the prediction of our model after steps in mean-field theory according to Eq. (2.24) to closely follow the convergence of our algorithms. To benchmark the quality of learning, i.e. to which extend our generative model has correctly identified important features of the target distribution to faithfully reconstruct new – unseen during training – data points from , summarized by


we invoke the quadratic reconstruction error on test data (or test loss function)


Also here, we substitute our estimation of the expectation value deduced by iteratively applying Eq. (2.24) during training to monitor the learning quality.

In the same spirit, one could define absolute learning errors by taking the absolute value of the difference between target data and mean-field theoretic outcome. This makes sense especially for continuous distributions (where the quadratic loss function could underestimate the learning error in the domain

) or when outliers in the training data being overweighted by the quadratic loss erroneously lead to big training but still reasonable test errors. Evidently,

means that our ml algorithm is able to accurately reproduce the provided points from target distribution , while signifies the ability of our model to generalize into unseen data after correctly extracting the characteristic traits of during training.

At first sight, the outlined way to use these loss functions seems to depart from their general objective, as described in the beginning, which is to judge the actual prediction on the domain of the generative model (i.e. the integer value for discrete distributions). However, after a sufficient number of epochs for the binary systems under consideration, we observe that indeed converges towards the domain values . Consequently the probability to find the -th unit with value sharply peaks at 0 or 1 depending on the expectation value . In turn, this signals that effectively coincides with the sampled value within the domain of our model. For this reason, after a desired accuracy has been achieved, it is allowed to simply round to the nearest integer. As we are primarily interested in this work in benchmarking the convergence rate and learning quality of the formally developed grand-canonical extension to the rbm, we shall not further discuss sampling options and evaluate the loss functions on the mean-field theoretic ansatz.

Computational simplications.

In developing the theoretical framework for the vrbm we have been appealing to the large order of hidden units, , to render certain designer choices in the form of the chemical potential equivalent. The regime of large is a key feature to ensure the desired elimination of the length of the hidden layer as a hyperparameter. One might wonder whether such a regime is in practice feasible, beyond a mere theoretical playground. On the one hand, it turns out that does not need to be that large for the vrbm to work in a satisfactory self-regularizing manner. At least for the systems we have considered, finite- effects appear way beyond , for very reasonable values of , not influencing thus the quality of learning. On the other hand, larger networks become quickly – already from the first cd steps – exponentially suppressed in , thus setting hidden expectation values for larger effectively to zero (see Eq. (2.18)). Hence, one needs to calculate the cd algorithm (2.2) to update the vrbm parameters only for hidden units. This computational simplification considerably speeds up the training while allowing us to take the maximum number of hidden units even larger and verify the independence of the learning procedure from (to the desired level of accuracy).

3.1 The Ising model

First, we choose to train our vrbm on a system from statistical physics where the target distribution is extracted by sampling spin configurations on a lattice at certain temperatures

. Depending on the number of space-time dimensions and the amount of relevant symmetries the physical system can be under a (partial at least) analytic control. In the physics literature, a paradigm statistical system is the Ising model with nearest neighbour interactions. The first non-trivial behaviour of the Ising theory that exhibits a phase transition at a critical temperature

from a ferromagnetic to a paramagnetic phase in infinite volume manifests [onsager1944crystal] in two space-time dimensions. In absence of external magnetic fields, the partition function of this system up to two space-time dimensions can be calculated exactly [baxter2016exactly].

For concreteness, we take the Ising theory to live on a square lattice of length described by a spin matrix


with periodic boundary conditions and . The nearest-neighbour Hamiltonian reads


where the sum at each lattice point is taken over the four – in case of square lattices – directly neighbouring sites. is the Ising coupling whose sign dictates the (anti)ferromagnetic structure of spin configurations. In the thermodynamic limit , boundary effects become negligible and universality ensures the manifestation of the transition from the ordered to a disordered phase independently of the microscopic details. In practice, we observe most important features (like a peak in heat capacity at ) signalling the aforementioned phase transition already for . The aim of this section is to train our vrbm on this statistical system to test whether our self-regularizing ml algorithm can distinguish the physical interactions (leading to cluster formations as depicted in Figure 3) from mere thermodynamic fluctuations (cf. noisy high-temperature data in Figure 3).

Figure 3: Ising configurations on an lattice sampled via mcmc simulations. Spins at neighbouring sites try to align along the same direction. The order parameter, the net magnetization, which is given as the sum of all spins is non-vanishing in the ordered phase below the critical temperature and (abruptly in infinite volume) vanishes when the system enters the disordered phase at higher temperatures.

To extract the relevant physics of the Ising system the vrbm should be exposed to spin configurations sampled at various temperatures below and above the critical region. As noticed in [Iso:2018yqu] (a behaviour also verified for our vrbm), even without training an rbm in the vicinity of

, but only below and above, the generative model is still able to capture the physics signalling a phase transition. Via Markov chain Monte Carlo (

mcmc) simulations we thus produce a large number of spin configurations at various temperatures which we split444This way of splitting is to facilitate comparison with the handwritten dataset used for training the vrbm in the following paragraph. In fact, already training samples suffice so that the vrbm extracts the physics of the target system at a satisfactory level. into a training and test set of and samples, respectively. Our target distribution thus extends over the simulated Ising domain with .

At this stage, revealing essentially the final outcome is in order. In [Iso:2018yqu, 2017arXiv170804622M] the learning capacity of the ordinary rbm on the Ising model has been extensively studied showing that there exist three distinct learning phases depending on the relation between the number of hidden units and the total number of spins . For the rbm does not have enough resources to fully learn the target distribution from the Ising system. In this regime with less hidden than observable units, an appropriately trained generative model has still learned important features of the underlying Ising theory. In particular, the rbm flow seems to trigger a flow of spins very reminiscent of the Renormalization Group [Funai:2018esm]. We plan to come back to this tantalizing finding in conjunction with varying number of hidden units and different training metrics in a later work. When the rbm fully learns to reproduce the Ising theory of nearest-neighbour interactions at any temperature. Finally, over-learning starts to occur for and the rbm increasingly learns irrelevant noice of thermodynamic fluctuations with increasing number of hidden units.

Specifics of the training.

For our purposes, we have trained various generative models at a chemical potential of slightly different designs which fall within the class described in Section 2.3. Since such designer choices do not influence the final outcome for sufficiently large number of available hidden units, we concentrate here, for concreteness and clarity of presentation, on the ansatz (2.9) with . Our mean-field theoretic ansatz (2.24) to train our model by numerically solving the extremization problem posed in paragraph 2.2 converges to a satisfactory level already after . Once exposed to various temperatures, any self-regularizing generative model should be able to track down the clearly defined regime of optimal training, , detecting the physical interactions of nearest-neighbouring spins.

Figure 4: The train (3.32) and test (3.34) error as well as the average of the expectation value of the number of hidden units (3.37) are plotted while training the vrbm with for epochs.

Indeed, as seen in the upper Plot 4, our vrbm with hidden units appropriately learns after a couple of epochs the two-dimensional Ising system. The exceedingly small training and test quadratic error, as calculated without rounding by Eq. (3.32) and (3.34), respectively, effectively coincide throughout most of the training. To comprehend this order of magnitude, note that an ordinary rbm with (fixed) number of hidden units already of order leads for the same Ising configurations of total lattice size to a considerable test error . In fact, one does not even need to take that large to observe this self-regularizing character. A vrbm with only, effectively demonstrates the same learning curve as in Figure 4 once trained over the same data set. Sampling from those trained vrbm leads to perfect reconstructions of the original Ising data. The chemical potential Eq. (2.26) which is dynamically determined throughout training remains an order one quantity; specifically for the Ising model we get .

To understand what the vrbm has actually learned and how it did that, we plot in the lower part of Figure 4 the average of the expectation for the number of hidden units (see Eq. (2.20)) conditioned on the Isisng data,


over the course of learning epochs, together with the learning error. We clearly see that the model starts learning when becomes . Along similar lines, Figure 5 depicts the conditional probability defined in Eq. (2.12), averaged over the dataset,


for different number of available resources, , ranging from towards the formally desired regime . On the one hand, it is clear that networks of hidden size between and receive a significant probability to participate into feature extraction. Thus, the vrbm still slightly over-learns due to finite –  effects, with a test error though, which is essentially zero for any practical purpose. On the other hand, we observe that the curve of becomes narrower around the critical value with increasing

. This is nothing but a manifestation of “the law of large numbers”: at the theoretical limit

we anticipate the vrbm to precisely pick as the most optimal size of the hidden layer to learn the provided Ising configurations.

Figure 5: The average of the conditional probability Eq. (3.38) over the Ising data from the square lattice of length is plotted for different lengths of the hidden layer in the model with maximally hidden units. The probability clearly exhibits a peak around . For clarity, the plot concentrates on the region around given that networks of sizes away from it quickly get exponentially suppressed.

To further comprehend the behaviour of the grand-canonical generalization of rbm under training, we examine the value of the trained parameters from the perspective of the hidden units. The meaningful quantities to look at are the hidden biases and the average


entering the chemical potential via Eq. (2.28). For the first hidden units of the trained vrbm with these are plotted in Figure 6. As theoretically anticipated, we recognize that the vrbm has effectively set to zero all weights for in accordance with its self-regularizing character. In the language of paragraph 2.3 thus, . The latter hidden units not depicted in Figure 6 follow an evident pattern for and decouple from the feature detector (see schematic depiction in Figure 2). Incidentally, due to the aforementioned regularizing character also at smaller , the depicted profile in Figure 6 looks effectively the same also when and even . Most interestingly, the value of the first hidden biases is smaller than the corresponding weight scale set by Eq. (3.39). Thus, these do not participate much in the modelling of Ising interactions performed by the first weights (in the sense of Figure 7). The model then uses the remnant biases to adjust the value of the chemical potential and its regularizing effect without crucially interfering with the actual feature detection.

Figure 6: The scatter diagram depicts the row-average over the absolute value of weights defined in Eq. (3.39) together with the absolute value of hidden biases for the first after training a vrbm with for epochs.

Another quantity of interest that is meaningful to examine for energy-based generative models is the free energy defined in Eq. (2.5). An estimation of its value is given by the grand-canonical expectation value of the free energy introduced in Eq. (2.3),


under the conditional probability Eq. (2.12) of our model. For binary systems it is straight-forward [2018arXiv180308823M] to expand in powers of and resum using that . In [Cossu:2018pxj] the leading terms in the spin expansion were computed,


in terms of the spin current and the correlation matrix of a network with hidden units. For reasonably small values of the trained parameters the two scale as


and , since gives an irrelevant constant energy shift.

Figure 7: The heat map chart represents the interaction matrix defined in Eq. (3.42), normalized over its largest absolute value, for two networks with hidden sizes using the parameters of the vrbm with trained over epochs on the Ising data from the lattice.

Already in this crude approximation the current evaluated on the trained parameters appears subleading to the spin-spin interaction term (cf. also Figure 6), as anticipated given that the Ising data was originally sampled at zero external magnetic field. In Figure 7, we draw the correlation matrix for different hidden-layer sizes . (Note that the heat map chart of for will not look different from , as for .) After our preceding discussion, there is no surprise that networks with cannot satisfactory learn the input data. Their contribution to the approximation (3.40) to grand-canonical free energy is exponentially suppressed by as seen in Figure 5. In contrast, approaching the nearest-neighbour structure of the Ising data becomes apparent, at least to quadratic order in the spin expansion. The latter are precisely the networks which get significantly selected by Eq. (3.40) to participate in forming our estimation of the free energy of the vrbm model. In a future work, we plan to come back to the intriguing question of the order-by-order equivalence among the energies from the trained vrbm and the Ising model by formal resummation of the appropriate free energy Eq. (2.5) in the spirit of Eq. (3.41).

3.2 The dataset of handwritten digits

Figure 8: Examples from the mnist data set downsampled to

pixel size and binarized.

As a first step in benchmarking ai models it is customary to draw samples from the nist database of “Handprinted Forms and Characters”. By now, a typical list of examples used throughout ml literature is the mnist data set [726791] of handwritten digits from 0 to 9. It consists of training and test preprocessed images given in rgb format with an original pixel resolution. Let such a sample image be described by an integer matrix of rgb pixel intensities with .

For our demonstrative purposes, it suffices to consider a reduced version of the mnist data by downsampling to a image version. This can be done via so called max or mean pooling [masci2011stacked, sohn2012learning]. Concretely, each block of pixels in the original pixel matrix is replaced by either their average or their maximum leading for example to


and similarly for their average. In our case, . To smoothen the resulting image

and to reduce its size while preserving important information for the feature detector it is possible to apply additional space-convolution filters (adding a padding frame where necessary) like


which captures an average pixel intensity in overlapping blocks. Throughout the down-sampling and convolutional process we always make sure that the pixel center of mass, defined by


coincides with the geometrical center of the image located at in order to filter out translational symmetry. Furthermore, to make contact with the preceding paragraph on the Ising model we turn to black and white images via binarizing all pixels simply by rounding each normalized pixel to its nearest integer (i.e. 0 or 1). Hence, our training input is given by with


where we again arrange for consistent with the conventions introduced in Section 2.1. An example of the target images we are going to use in the following paragraph to train our vrbm is given in Figure 8.

Specifics of the training.

Figure 9: The train (3.32) and test (3.34) error as well as the average of the expectation value of the number of hidden units (3.37) are plotted while training the vrbm with for epochs.

Similar to the Ising model, we train vrbm of different designs ( in the chemical potential Eq. (2.26)) on the downsampled mnist data of observable size and recover the law of large numbers. With increasing number of available hidden resources the outcome stabilizes and effectively becomes independent of . The train and test loss errors, Eq. (3.32) and (3.34), quickly become and , respectively, signalling a very good convergence and a small over-learning. As a reference, an ordinary rbm with trained on the exact same data without any form of regularization has a test error of . In Figure 9, the loss errors are plotted over the learning epochs as well as the average of the expectation value for the size of hidden layer Eq. (3.37). Again, our ml model starts learning when networks of a given size around become more and more favourable. The chemical potential is dynamically determined during training to .

Figure 10: Continuous lines in the upper plot depict the average of conditional probability Eq. (2.12) over the mnist data for each handwritten digit 0,…,9 (the color pattern applies also to the scatter plot) of a trained vrbm with hidden units. Networks with bigger are exponentially suppressed and are not presented in the graph for clarity, . Dashed lines depict the corresponding ccdf in Eq. (2.19). The lower scatter diagram presents the expectation value of hidden units as deduced from Eq. (2.18) digit-wise conditioned on the input data.

To obtain a better feeling of what the vrbm has learned from the dataset of handwritten digits it is sensible to look at the average of conditional probability Eq (3.38) over each digit from mnist, separately. From the upper part of Figure 10 it becomes clear that for all digits the most probable size of the hidden layer coincides with the (rounded) average . Still, for an “easy” digit like zero there appears a lower pump already before approaches the region of . Such a profile for is typical for a system coming from everyday life and is comparatively richer to the Ising paradigm of the previous section, where all data came from the same microscopic Hamiltonian Eq. (3.36).

To avoid any confusion, the scatter diagram in the lower part of Figure 10 emphasizes that the probability according to which the size of the hidden layer gets selected is a different concept from the actual feature detection happening at the level of . Depending on the digit they track, hidden units are more or less likely to get activated with a given sign. Of course, the two concepts are intimately connected via Eq. (2.18): the complementary cumulative distribution associated to modulates the profile of . The hidden units with of larger hidden networks, whose probability to be selected becomes exponentially suppressed, remain deactivated. Stochastically, the equivalent statement would be that those receive an equal probability to be , behaving like free units, as they have decoupled from connected Boltzmann machine in Figure 2.

Figure 11: The scatter diagram depicts the row-average over the absolute value of weights defined in Eq. (3.39) together with the absolute value of hidden biases after training a vrbm with for epochs.

Indeed, the weights for have been effectively regularized to zero, as deduced from Figure 11. Hence, the corresponding hidden units to the far right of the plot decouple in the sense of the schematic depiction in Figure 2. The scatter plot 11 follows the same regularizing concept as the Ising plot 6 and we refer the reader to the corresponding paragraph in Section 3.1 about the Ising model. An obvious difference observed in the two plots for the first hidden units lies in the different character of the systems: Concerning the mnist dataset, hidden biases actively participate in feature detection in contrast to the Ising scenario in absence of external magnetic fields.

4 Conclusions and outlook

In this paper, we have considered (shallow) restricted Boltzmann machines at a finite chemical potential . In principle, such a grand-canonical extension of the rbm performs feature extraction from a target distribution by invoking hidden layers of various length to model interactions among observables units , where in principle . We have concentrated on an intuitive choice of the chemical potential as a function of a “vacuum” energy, which essentially measures the added norms of the weight matrix

and bias vector

per unit actively participating in feature extraction. The appropriately trained vrbm at such chemical potential is able to track down (up to –suppressed effects) the optimal length of the hidden layer to model provided data points. To do so, the vrbm mainly uses the biases of disconnected () hidden units to regulate the number of hidden units which actively participate in feature extraction, manifesting the self-regularizing character of the model.

Incorporating this regularizing character in the form of a chemical potential has many advantages, besides the formal simplicity of Legendre-transforming to the grand-canonical ensemble, which allows us to keep most of the techniques implemented to train ordinary Boltzmann machines intact. By maximizing the expectation value of -probability of the modelled data under the value of is dynamically fixed during training, already from the very first epochs. Thus, the probability to use unnecessary long hidden layers is quickly regularized to zero so that the cd algorithm only needs to update at most the parameters corresponding to the relevant hidden units. As the probability to use a hidden layer of a certain length is conditioned on the obserbable data v, the grand-canonical theory makes sure to always assign enough hidden resources to model a given subset of the target distribution