Mixed integer programming formulation of unsupervised learning

01/20/2020 ∙ by Arturo Berrones-Santos, et al. ∙ 0

A novel formulation and training procedure for full Boltzmann machines in terms of a mixed binary quadratic feasibility problem is given. As a proof of concept, the theory is analytically and numerically tested on XOR patterns.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A central open question in machine learning is the effective handling of unlabeled data

[1, 2]. The construction of balanced representative datasets for supervised machine learning for the most part still requires a very close and time consuming human direction, so the development of efficient learning from data algorithms in an unsupervised fashion is a very active area of research [1, 2]

. A general framework to deal with unlabeled data is the Boltzmann machine paradigm, in which is attempted to learn a probability distribution for the patterns in the data without any previous identification of input and output variables. In its most general setups however, the training of Blotzmann machines is computationally intractable

[2, 3, 4]

. In this contribution is established a relation, which to the best of my knowledge was previously unknown, between Mixed Integer Programing (MIP) and the full Boltzmann machine in binary variables. Is hoped that this novel formulation opens the road to more efficient learning algorithms by taking advantage of the great variety of techniques available for MIP.

2 Full Boltzmann machine with data as constraints

Consider a network of units with binary state space. Each unit depends on all the others by a logistic-type response function,

(1)

where the “round” indicates the nearest integer function, the ’s are pairwise interactions between units and the ’s are shift parameters. As will later be clear, the proposed model supports both supervised and unsupervised learning and leads to a full Boltzmann machine in its classical sense.

Figure 1: Full Boltzmann machine with five units.

Suppose a data set of

visible binary vectors,

, with components each and a collection of hidden units , . The total number of units in the system is . If the connectivity and shift parameters are given, each sample fixes the binary vector and therefore the data set imposes the following constraints:

(2)

A posterior distribution for the parameters , can be constructed by the maximum entropy principle, which gives the less biased distribution that is consistent with a set of constraints [5, 6]. This is done by the minimization of the Lagrangian

(3)

where the brackets represent average under the posterior, is a vector that contains the connectivity and shift parameters and the ’s are positive Lagrange multipliers. Due to the linearity of the system of inequalities (2), the average of the constraints under with fixed unit values is simply given by the same set of inequalities with the coefficients ’s and ’s substituted by their averages ’s, ’s. The maximum entropy distribution for the parameters is therefore given by

(4)

where is a normalization factor. So due to the linearity of the constraints,

is a tractable (i. e. an easy to sample) product of independent two parameter exponential distributions:

(5)

where [7]. Therefore a necessary and sufficient condition for the existence of the above distribution is the existence of the averages , which is determined by the satisfaction of the inequalities (2).

The representation of the posterior by its two parameter exponential form Eq. (5) gives a codification of the training data in terms of a tractable distribution for the parameters that in conjunction with Eq. (1) is in fact a distribution for new unlabeled binary strings of data. For fully connected topologies, this is what is usually understood by an equilibrium distribution of a full Boltzmann machine [2].

Figure 2: Three different architectures for the XOR problem.

2.1 Illustrative example 1: Supervised XOR

The theoretical soundness of the proposed approach is now shown through the XOR logical table, . Let’s consider first a restricted architecture with only two directed arcs that connect two inputs with an output unit, as represented in Figure 2A. The inequalities (2) in this case read,

(6)

There are no values for , and that satisfy all the inequalities. This is reflected in the maximum entropy distribution,

(7)

which to be a properly normalized product of two-parameter exponential distributions must satisfy the contradictory conditions , , , , , . A valid model is however attainable by the addition of a single hidden unit. Consider the architecture represented in Figure 2B. This leads to a two stage constraint satisfaction problem. The first stage is given by the data evaluated on the visible units,

(8)

for which solutions certainly exist. Take for instance,

(9)

The second stage is consequently given by,

(10)

for which solutions exist under the conditions , , , , where is a positive constant. Therefore, the maximum entropy distribution for the parameters of the model represented in the Figure 2

B exists. Equivalently, this result shows that the classical XOR supervised learning problem can be solved by the proposed MIP feasibility formulation.

2.2 Illustrative example 2: Unsupervised XOR

A model capable of unsupervised learning is sketched in Figure 2C. The system of inequalities should be now extended to consider inputs to nodes and ,

(11)

which has indeed solutions, as discussed in the following section.

3 Sampling from the posterior distribution

The equilibrium posterior distribution of patterns can be sampled by taking an arbitrary solution of the MIP feasibility problem and using it to define the averages

. The standard deviation of each two-parameter exponential distribution is given by

, which can be at first instance assigned to some positive value related to the constant . If is an uniform random deviate in the interval , then is a deviate from the two-parameter exponential distribution associated to . In this way, the vector of visible units can be sampled in a computation time that is quadratic in the number of total (visible and invisible) units.

1:  Initialize: from a solution of the MIP feasibility problem and ( arbitrary positive real number).
2:  Asign value to (desired number of samples).
3:  Generate , uniform and independent random deviates in the interval.
4:  for s = 1 to size do
5:     ,
6:     Generate by inserting in Eq. (1)
7:  end for
Algorithm 1 (Pseudo-code for sampling from the maximum entropy posterior.)

The step of the algorithm above is made by starting with an inital random binary vector at each . The self-consistent system Eq. (1) is then iterated. No more than iterations are needed to achieve convergence.

The sampling procedure Algorithm 1 is now shown through the XOR example. Take an arbitrary solution of the MIP feasibility problem, say

(12)

Due to the rounding operator in Eq. (1), any can work. In the following experiments the value is used with sample sizes of . Some of the samples drawn for each are shown.

The resulting ratios of XOR patterns relative to non-XOR patterns over the entire samples for each case are, , and

4 Discussion

In the author’s view, this paper presents a formalism that has the potential not only to give more efficient learning algorithms but to improve the understanding of the learning from data itself. Particularly, datasets explicitly constrain the parameters of the learning model by a set of feasiblity mixed binary inequalities. For fixed binary values, the system is linear and continuous. For fixed model parameters, it’s a linear constraint satisfaction problem in binary variables. The author together with collaborators is now working in different ways to exploit these structures in order to scale the framework to solve realistic large scale unsupervised learning problems. In such problems, a measure proportional to the number of satisfied constraints might be used to gide the learning procedure and to assign sensible values to the hyperparameter.

acknowledgements

The author acknowledge partial financial support from UANL and CONACyT.

Conflict of interest

The author declares that he have no conflict of interest.

References