A central open question in machine learning is the effective handling of unlabeled data[1, 2]. The construction of balanced representative datasets for supervised machine learning for the most part still requires a very close and time consuming human direction, so the development of efficient learning from data algorithms in an unsupervised fashion is a very active area of research [1, 2]
. A general framework to deal with unlabeled data is the Boltzmann machine paradigm, in which is attempted to learn a probability distribution for the patterns in the data without any previous identification of input and output variables. In its most general setups however, the training of Blotzmann machines is computationally intractable[2, 3, 4]
. In this contribution is established a relation, which to the best of my knowledge was previously unknown, between Mixed Integer Programing (MIP) and the full Boltzmann machine in binary variables. Is hoped that this novel formulation opens the road to more efficient learning algorithms by taking advantage of the great variety of techniques available for MIP.
2 Full Boltzmann machine with data as constraints
Consider a network of units with binary state space. Each unit depends on all the others by a logistic-type response function,
where the “round” indicates the nearest integer function, the ’s are pairwise interactions between units and the ’s are shift parameters. As will later be clear, the proposed model supports both supervised and unsupervised learning and leads to a full Boltzmann machine in its classical sense.
Suppose a data set of
visible binary vectors,, with components each and a collection of hidden units , . The total number of units in the system is . If the connectivity and shift parameters are given, each sample fixes the binary vector and therefore the data set imposes the following constraints:
A posterior distribution for the parameters , can be constructed by the maximum entropy principle, which gives the less biased distribution that is consistent with a set of constraints [5, 6]. This is done by the minimization of the Lagrangian
where the brackets represent average under the posterior, is a vector that contains the connectivity and shift parameters and the ’s are positive Lagrange multipliers. Due to the linearity of the system of inequalities (2), the average of the constraints under with fixed unit values is simply given by the same set of inequalities with the coefficients ’s and ’s substituted by their averages ’s, ’s. The maximum entropy distribution for the parameters is therefore given by
where is a normalization factor. So due to the linearity of the constraints,
is a tractable (i. e. an easy to sample) product of independent two parameter exponential distributions:
where . Therefore a necessary and sufficient condition for the existence of the above distribution is the existence of the averages , which is determined by the satisfaction of the inequalities (2).
The representation of the posterior by its two parameter exponential form Eq. (5) gives a codification of the training data in terms of a tractable distribution for the parameters that in conjunction with Eq. (1) is in fact a distribution for new unlabeled binary strings of data. For fully connected topologies, this is what is usually understood by an equilibrium distribution of a full Boltzmann machine .
2.1 Illustrative example 1: Supervised XOR
The theoretical soundness of the proposed approach is now shown through the XOR logical table, . Let’s consider first a restricted architecture with only two directed arcs that connect two inputs with an output unit, as represented in Figure 2A. The inequalities (2) in this case read,
There are no values for , and that satisfy all the inequalities. This is reflected in the maximum entropy distribution,
which to be a properly normalized product of two-parameter exponential distributions must satisfy the contradictory conditions , , , , , . A valid model is however attainable by the addition of a single hidden unit. Consider the architecture represented in Figure 2B. This leads to a two stage constraint satisfaction problem. The first stage is given by the data evaluated on the visible units,
for which solutions certainly exist. Take for instance,
The second stage is consequently given by,
for which solutions exist under the conditions , , , , where is a positive constant. Therefore, the maximum entropy distribution for the parameters of the model represented in the Figure 2
B exists. Equivalently, this result shows that the classical XOR supervised learning problem can be solved by the proposed MIP feasibility formulation.
2.2 Illustrative example 2: Unsupervised XOR
A model capable of unsupervised learning is sketched in Figure 2C. The system of inequalities should be now extended to consider inputs to nodes and ,
which has indeed solutions, as discussed in the following section.
3 Sampling from the posterior distribution
The equilibrium posterior distribution of patterns can be sampled by taking an arbitrary solution of the MIP feasibility problem and using it to define the averages
. The standard deviation of each two-parameter exponential distribution is given by, which can be at first instance assigned to some positive value related to the constant . If is an uniform random deviate in the interval , then is a deviate from the two-parameter exponential distribution associated to . In this way, the vector of visible units can be sampled in a computation time that is quadratic in the number of total (visible and invisible) units.
The step of the algorithm above is made by starting with an inital random binary vector at each . The self-consistent system Eq. (1) is then iterated. No more than iterations are needed to achieve convergence.
The sampling procedure Algorithm 1 is now shown through the XOR example. Take an arbitrary solution of the MIP feasibility problem, say
Due to the rounding operator in Eq. (1), any can work. In the following experiments the value is used with sample sizes of . Some of the samples drawn for each are shown.
The resulting ratios of XOR patterns relative to non-XOR patterns over the entire samples for each case are, , and
In the author’s view, this paper presents a formalism that has the potential not only to give more efficient learning algorithms but to improve the understanding of the learning from data itself. Particularly, datasets explicitly constrain the parameters of the learning model by a set of feasiblity mixed binary inequalities. For fixed binary values, the system is linear and continuous. For fixed model parameters, it’s a linear constraint satisfaction problem in binary variables. The author together with collaborators is now working in different ways to exploit these structures in order to scale the framework to solve realistic large scale unsupervised learning problems. In such problems, a measure proportional to the number of satisfied constraints might be used to gide the learning procedure and to assign sensible values to the hyperparameter.
The author acknowledge partial financial support from UANL and CONACyT.
Conflict of interest
The author declares that he have no conflict of interest.
Oliver, A., Odena, A., Raffel, C. A., Cubuk, E. D., & Goodfellow, I. Realistic evaluation of deep semi-supervised learning algorithms. Advances in Neural Information Processing Systems, 3235-3246, (2018).
Goodfellow, I., Bengio, Y., & Courville, A., Deep learning. MIT press, (2016).
-  Li, R. Y., Albash, T., & Lidar, D. A., Improved Boltzmann machines with error corrected quantum annealing. arXiv preprint arXiv:1910.01283, (2019).
-  Jaynes, E. T., Information theory and statistical mechanics. Physical review, 106(4), 620, (1957).
Shore, J., & Johnson, R., Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on information theory, 26(1), 26-37, (1980).
-  Kececioglu, D., Reliability engineering handbook (Vol. 1). DEStech Publications, Inc., (2002).