Minimal model of permutation symmetry in unsupervised learning

04/30/2019 ∙ by Tianqi Hou, et al. ∙ SUN YAT-SEN UNIVERSITY 0

Permutation of any two hidden units yields invariant properties in typical deep generative neural networks. This permutation symmetry plays an important role in understanding the computation performance of a broad class of neural networks with two or more hidden units. However, a theoretical study of the permutation symmetry is still lacking. Here, we propose a minimal model with only two hidden units in a restricted Boltzmann machine, which aims to address how the permutation symmetry affects the critical learning data size at which the concept-formation (or spontaneous symmetry breaking in physics language) starts, and moreover semi-rigorously prove a conjecture that the critical data size is independent of the number of hidden units once this number is finite. Remarkably, we find that the embedded correlation between two receptive fields of hidden units reduces the critical data size. In particular, the weakly-correlated receptive fields have the benefit of significantly reducing the minimal data size that triggers the transition, given less noisy data. Inspired by the theory, we also propose an efficient fully-distributed algorithm to infer the receptive fields of hidden units. Overall, our results demonstrate that the permutation symmetry is an interesting property that affects the critical data size for computation performances of related learning algorithms. All these effects can be analytically probed based on the minimal model, providing theoretical insights towards understanding unsupervised learning in a more general context.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised learning is defined as the process of searching for latent features in raw (unlabeled) data, and thus serves as a fundamental property of the cerebral cortex of the brain [1, 2]

. To understand unsupervised learning from a neural network perspective, restricted Boltzmann machine (RBM) is proposed. RBM is a two-layered neural network with one layer called the visible layer, and the other called the hidden layer. No lateral connections exist in each layer. The connections between visible and hidden layers are called synaptic weights, representing encoded latent features in the observed data. The process of learning the synaptic weights from the unlabeled data (also called training) mimics the unsupervised learning. RBM is thus receiving substantial research interests both from machine learning and statistical physics communities 

[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13].

Training of RBM relies on the maximum likelihood principle via a gradient ascent procedure. The mean activity of each neuron or correlations between visible and hidden neurons can be estimated by either truncated Gibbs sampling 

[3, 14] or advanced mean-field methods [6, 7]. However, the gradient ascent method is difficult to analyze and thus not amenable for a theoretical model. Therefore, based on the probabilistic graphical model framework, one-bit RBM where only one hidden neuron is considered was proposed to address a fundamental issue of unsupervised learning [15], i.e., how many data samples are needed for a successful learning. This work revealed a continuous spontaneous symmetry breaking transition separating a random-guess phase from a concept-formation phase at a critical value of the amount of provided samples (data size) [16]. This conclusion is later generalized to RBM with generic priors [17]

, and synapses of ternary values 

[18]

. However, it is still challenging to handle the case of multiple hidden neurons from the perspective of understanding the learning process as a phase transition. In the presence of multiple hidden neurons, permutation symmetry appears, i.e., the model of the observed data is invariant with respect to exchange of arbitrary two hidden neurons. In addition, the permutation symmetry is a common feature in many modern neural network architectures 

[19]. Therefore, understanding how the permutation symmetry affects the concept-formation process is important, which may provide us core mechanisms of unsupervised learning.

Here, we propose a minimal model of the permutation symmetry in unsupervised learning, based on mean-field approximations. We show that it is possible to theoretically understand the permutation symmetry using physics approximations. To be more precise, we consider a RBM with two hidden neurons, and embed a latent feature that generates a certain number of data samples through Gibbs samplings of the original model [15]

. Then, the data samples are learned by a theory-inspired algorithm, and finally the learned synaptic weights (a latent feature vector) are compared with the embedded ones, to test whether spontaneous symmetry breaking applies to the minimal model, and in addition investigate key factors affecting the critical data size for learning and moreover how the symmetry affects the learning process. We first apply the cavity approximation in statistical mechanics of disordered systems 

[20]

to derive the learning algorithm from a Bayesian inference perspective, whose computation performances in single instances of the model are then predicted by a replica theory. This theory introduces many copies of the original model, and the interaction between any two copies is characterized by a set of self-consistent mean-field equations, from which the critical data size for learning is determined.

Figure 1: (Color online) A schematic illustration of the minimal model. in this example (say, and ). (Left panel) The original model with only two hidden neurons (say, and ). (Right panel) The corresponding factor graph where the data node is indicated by a square, and the paired-synapses (feature vector) is indicated by a circle. In this example, (say, and ). The circle is an augmented version of single synapse considered in the one-bit RBM [15].

2 Model definition and mean-field methods

2.1 Minimal model of permutation symmetry

In this study, we use the RBM defined above with two hidden neurons (Fig. 1) to learn embedded features in input data samples, which are raw unlabeled data. Each data sample is specified by an Ising-like spin configuration where is the input dimensionality. A collection of samples is denoted as . Synaptic values connecting visible and hidden neurons are characterized by , where each component takes a binary value () as well. Because of two hidden neurons, where the superscript indicates the hidden neuron’s index. and are also called receptive fields of the first and second hidden neurons, respectively. Statistical properties of this RBM are thus described by the Boltzmann distribution [6]

(1)

where , , and is the partition function depending on the feature . Note that the two hidden neurons’ activities () have been marginalized out. The scaling factor ensures that the argument of the hyperbolic cosine function is of the order of unity. can be arbitrary one of the samples. When the embedded feature is randomly generated, the inverse-temperature tunes the noise level of generated data samples from the feature. Clearly, the data distribution is invariant with respect to (w.r.t) the exchange of the hidden neurons, which is called the permutation symmetry in this paper. The required number of hidden neurons to yield this symmetry is at least two, therefore, this setup defines a minimal model to study the permutation symmetry in unsupervised learning.

In this model, the embedded feature follows the distribution in which together with , where controls the fraction of components taking different values in the two feature maps associated with the two hidden neurons.

Given the

data samples, one gets the posterior probability of the embedded feature according to the Bayes’ rule:

(2)

where is the partition function of the minimal model. For simplicity, a uniform prior for is assumed, i.e., we have no prior knowledge about , although there may exist correlations between two feature maps. In addition, we use the same temperature as that used to generate data. Because we do not use the true prior , the current setting does not require the value of , thereby is not the Bayes-optimal setting which corresponds to Nishimori condition in physics [16]. Therefore, using the uniform prior is more computationally challenging. We leave a detailed analysis of the Bayes-optimal setting in a future work.

One obstacle to compute the posterior probability is the nested partition function . Fortunately, this partition function can be simplified in the large- limit. More precisely,

(3)

where we have used for small to arrive at the final equality, and , which is exactly the overlap between the two feature maps. To sum up, we move all the irrelevant constants into the partition function , the posterior probability can then be rewritten as

(4)

which forms the Boltzmann distribution of our minimal model. In this paper, we consider the case of where specifies the data (constraint) density.

2.2 Cavity approximation to handle the posterior probability

In what follows, we compute the maximizer of the posterior marginals (MPM) estimator  [16], where the feature map of each hidden neuron is combined and the prediction is thus the augmented version of the inferred feature vector in the one-bit RBM [15]. Hence, the task is to compute marginal probabilities, i.e., , which is intractable due to the interaction among data constraints (the product over in Eq. (4)). However, by mapping the original model (Eq. (4)) onto a graphical model (Fig. 1), where data constraints and paired-synapses are treated respectively as factor (data) nodes and variable nodes, one can estimate the marginal probability by running a message passing iteration among factor and variable nodes, as we shall explain below. The key assumption is that the paired-synapses on the graphical model are weakly correlated, which is called the Bethe approximation [21] in physics.

We first define a cavity probability with the data node removed. Under the weak correlation assumption, obeys a self-consistent equation:

(5a)
(5b)

where is a normalization constant, denotes neighbors of the feature node except the data node , denotes neighbors of the data node except the feature node , and the auxiliary quantity denotes the contribution from data node to feature node given the value of  [6, 15]. Products in Eq. (5) result from the weak correlation assumption. In addition, , , and the cavity version of is defined as , which can be further replaced by its typical value obtained by the average over the cavity probability (to be shown below). Although this is a crude approximation, it works quite well in practice.

Still, the above self-consistent equation is intractable due to the summation to estimate . Nevertheless, a careful inspection reveals that and

are approximately correlated Gaussian random variables due to the central limit theorem. As a result, the intractable summation can be replaced by an integral which is easy to calculate in this model. We just need to compute the following mean, variance and covariance between these random variables.

(6a)
(6b)
(6c)
(6d)
(6e)

where and denotes the mean and variance of the Gaussian random variable respectively, and the last quantity denotes the covariance between and . The cavity magnetization is defined as , and the cavity correlation is defined as

. Finally, using the above parameters of the correlated Gaussian distribution, we rewrite

as

(7)

where , , and stemming from in Eq. (5b) replaced by its cavity mean. The above integral representation of can be analytically estimated; for convenience, we define . It is easy to show that

(8)

To close the iteration equation, we need to compute the cavity magnetization and correlation as follows:

(9a)
(9b)
(9c)

can be interpreted as the message passing from feature node to data node ( is also similarly interpreted), while can be interpreted as the message passing from data node to feature node .

If the weak correlation assumption is self-consistent, starting from randomly initialized messages, the learning equations will converge to a fixed point corresponding to a thermodynamically dominant minimum of the Bethe free energy function [22], which is given by . The free energy contributions of variable node and data node are given respectively by:

(10a)
(10b)

where . The forms of , , and are similar to their cavity counterparts (e.g., in Eq. (8)), but with the only difference that the node ’s contribution is not excluded. Once the iteration converges, the MPM estimator predicts that and , where the full (non-cavity) magnetization is computed taking into account all contributions of adjacent data nodes to the node (see Eq. (9), and the symbol is thus removed).

2.3 Replica theory of the minimal model

To have an analytic argument about the critical threshold for spontaneous symmetry breaking, we calculate the free energy in the thermodynamic limit using the replica method. Instead of calculating a disorder average of , the replica method computes the disorder average of an integer power of , i.e., . In physics, this corresponds to preparing replicas of the original system; then the rescaled free energy density (multiplied by ) can be obtained as [16]

(11)

where the limits of and have been exchanged, such that the thermodynamic limit can be taken first for applying the Laplace’s method or saddle-point analysis [23], and the disorder average is taken over all possible samplings (data) and the random realizations of the true feature vector. The explicit form of reads

(12)

where indicates the replica index, , , and . Note that is pre-determined and used to generate the random true feature maps, as also defined in section 2.1. We leave the technical details to A, and give the final result here. The free energy function reads,

(13)

where means an average w.r.t , (a standard Gaussian measure vector, as defined below Eq. (7)), and similarly . means the replica symmetry assumption we used to get the final result. This assumption implies that the order parameter (various kinds of overlaps, explicitly defined below) does not rely on its specific replica index. We assume that this assumption is able to describe the system as we shall show it leads to consistent predictions verified in algorithmic results of single instances. The auxiliary quantities and are defined as follows,

(14a)
(14b)
(14c)
(14d)
(14e)
(14f)
(14g)

where , , and .

The associated (non-conjugated) saddle-point equations are expressed as

(15a)
(15b)
(15c)
(15d)
(15e)
(15f)
(15g)
(15h)

Note that the average w.r.t the true features can be written explicitly by definition as for both true components taking different values, and otherwise . is related to by . The outer average also includes the disorder average over . The inner average indicates the thermal average under the partition function (corresponding to a two-spin interaction Hamiltonian). This average is analytically tractable, e.g., . and can also be similarly computed.

We further comment that characterizes the typical overlap between inferred value of the first feature and its true counterpart, and likewise characterizes the typical overlap between the second feature and its ground truth; and characterize the sizes of the first and second feature spaces respectively; characterizes the correlation of the two features within the same replica, while is the correlation for different replicas; characterizes the typical correlation between the first true feature and the inferred value of the second feature in an arbitrary replica, and likewise characterizes the typical correlation between the second true feature and the inferred value of the first feature in an arbitrary replica. and are thus responsible for the permutation symmetry effect. Taken all together, forms the order parameter set of our model. Their exact mathematical definitions are given in the A.

Finally, the conjugated order parameters can also be derived from a saddle point analysis of the free energy function, and they obey the following equations:

(16a)
(16b)
(16c)
(16d)
(16e)
(16f)
(16g)
(16h)

where the average , , and . The auxiliary quantities are defined as follows,

(17a)
(17b)
(17c)

To sum up, Eqs. (15) and (16) construct a closed iterative equation (detailed derivations are given in the B), whose fixed point gives an approximate evaluation of the free energy. Meanwhile, the obtained order parameters, especially and can be compared with the algorithmic results, and can also be used to analytically derive the critical threshold for unsupervised learning in this permutation-symmetry model. When all order parameters vanish, the free energy has an analytic value expressed as in agreement with in the same limit. We emphasize that in the limit , these saddle point equations can be simplified to the result of Ref. [16] for unsupervised feature learning in a one-bit RBM. When the data size is not sufficient, we expect that the order parameters vanish, and in the small order-parameter limit, , , , and . Based on these four equations, it is easy to show that the critical learning threshold is given by

(18)

As expected, the critical value for our current model does not depend on the sign of . A detailed derivation of is given in the C. In the correlation-free limit , the known result of the one-bit RBM,  [15, 16] is recovered. Thus, we first theoretically prove the conjecture made by empirical observations in Ref. [17] that once the number of hidden neurons is finite (not proportional to ), the critical learning threshold does not change with this finite number! However, Ref. [17] overlooks the effect of the potential correlation across true hidden features, which indeed affects the learning threshold. Remarkably, this effect, more natural than the ideal correlation-free case, is clearly captured by our theory (Eq. (18)). We show this effect in Fig. 2. We can see that for a fixed noise level, increasing has the effect of decreasing the critical data size. That is, if the data set is created by strongly correlated receptive fields (feature maps), then the learning is relatively easy, or fewer samples are sufficient to trigger the phase transition of concept-formation. Conversely, if is zero, the data is generated from independent feature maps, then a successful learning requires a much more larger dataset. We also observe that for a large (less noisy data), a small correlation level can already significantly reduce the minimal data size that triggers learning, as verified by the fact that around the small region the larger the is, the sharper the surface becomes.

Figure 2: (Color online) The critical value of data size (Eq. (18)) for learning in our minimal model, as a function of the noise level as well as the correlation level .
Figure 3: (Color online) The critical value of data size (Eq. (18)) as a function of the noise level . We consider the weak-feature-correlation limit at different orders of magnitude compared with . We use the value of for an example. The dashed line indicates the third case of Eq. (19).

Next, we analyze two interesting limits implied by the critical threshold equation (Eq. (18)). In the limit , provided that is relatively large such that . The second case is another limit , i.e., takes a small value but not zero, implying that a weak correlation among feature maps is maintained. Depending on the order of magnitude of , we have the following result given a relatively large :

(19)

Note that means any large value of such that rather than a definite value of infinity. Eq. (19) reveals that once the two feature maps are weakly correlated, the minimal learning data size for a transition can be further (or even significantly) reduced compared to the correlation-free case, particularly in the case that is not very small but still larger than the order of magnitude set by . We show this result in Fig. 3.

We thereby have a significant hypothesis for the triggering of concept-formation that a bit large (compared with ) yet still small value of the correlation level is highly favored for unsupervised learning from a dataset of smaller size (compared with the correlation-free case). Regularization techniques such as locally enforcing feature orthogonality [24] and dropping some connections during training [25]

have been introduced to deep learning. Weakly-correlated receptive fields are also favored from the perspective of neural computation, since the redundancy among synaptic weights is reduced and thus different feature detectors inside the network can encode efficiently stimuli features rather than capturing noise in the data. A similar decorrelation in hidden activities was recently theoretically analyzed in feedforward neural networks 

[26]. We hope our theoretical prediction can be verified in specific machine learning tasks, and even in neuroscience experiments where the relationship amongst the minimal data size for learning, the correlation level of synapses (or receptive fields) and the noise level in stimuli can be jointly established. Therefore, from the Bayesian learning perspective, the correlated-feature-map case yields a much lower threshold of phase transition towards the concept formation, in comparison with the correlation-free case  [15, 16, 17].

Overall, the prediction quantitatively captures the learning behavior in both uncorrelated and correlated settings. In next section, we shall further verify this conclusion by extensive numerical simulations on single instances of the minimal model.

3 Results and discussion

In this section, we study how the permutation symmetry between two hidden neurons affects the learning process, i.e., the spontaneous symmetry breaking transition of the concept-formation during unsupervised learning. We focus on whether the replica theory predicts the learning threshold related to the continuous transition. The learning threshold can be estimated from the message passing algorithmic results on single instances of the minimal model. We first randomly generate true feature maps with a pre-determined correlation level specified by . The true feature maps are then used to generate Monte-Carlo samples through a Gibbs sampling procedure [15] according to Eq. (1). Finally, these samples are used as a quenched disorder for the Bayesian inference of the true feature maps [16]. An overlap with the ground truth is computed and compared with the replica prediction. In fact, for comparison, we define the overlap [16], e.g.,