Statistical physics of unsupervised learning with prior knowledge in neural networks

11/06/2019 ∙ by Tianqi Hou, et al. ∙ 0

Integrating sensory inputs with prior beliefs from past experiences in unsupervised learning is a common and fundamental characteristic of brain or artificial neural computation. However, a quantitative role of prior knowledge in unsupervised learning remains unclear, prohibiting a scientific understanding of unsupervised learning. Here, we propose a statistical physics model of unsupervised learning with prior knowledge, revealing that the sensory inputs drive a series of continuous phase transitions related to spontaneous intrinsic-symmetry breaking. The intrinsic symmetry includes both reverse symmetry and permutation symmetry, commonly observed in most artificial neural networks. Compared to the prior-free scenario, the prior reduces more strongly the minimal data size triggering the reverse symmetry breaking transition, and moreover, the prior merges, rather than separates, permutation symmetry breaking phases. We claim that the prior can be learned from data samples, which in physics corresponds to a two-parameter Nishimori plane constraint. This work thus reveals mechanisms about the influence of the prior on unsupervised learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Supplemental Material

Appendix A Message Passing Algorithms for unsupervised learning with prior information

In our current setting, we assume that the statistical inference of synaptic weights from the raw data has the correct prior information , where , where is the correlation level between the two receptive fields. According to the Bayes’ rule, the posterior probability of synaptic weights is given by

(S1)

where , representing the overlap of the two RFs, and is the so-called partition function in statistical physics.

Using the Bethe approximation Hou et al. (2019), we can easily write down the belief propagation equations as follows,

(S2a)
(S2b)

where we define auxiliary variables , , and . indicates neighbors of the data node excluding the synaptic weight . In our model, all synaptic weights are used to explain each data sample. The belief propagation is commonly defined in a factor graph representation, where the synaptic-weight-pair acts as the variable node, while the data sample acts as the factor node (or constraint to be satisfied) Mézard and Montanari (2009). The learning can then be interpreted as the process of synaptic weight inference based on the data constraints. The cavity probability is defined as the probability of the pair without considering the contribution of the data node . is thus a normalization constant for the cavity probability . The cavity probability can then be parameterized by the cavity magnetization and correlation as . represents the contribution of one data node given the value of

. Due to the central limit theorem,

and can be considered as two correlated Gaussian random variables. We thus define , , , and

as the means and variances of the two variables, respectively. The covariance is given by

. Moreover, is approximated by its cavity mean . As a result, the intractable summation in Eq. (S2b) can be replaced by a jointly-correlated Gaussian integral,

(S3)

where the standard Gaussian measure , and .

We further define the cavity bias as

(S4)

Using Eq. (S2a), the cavity magnetizations , , and the cavity correlation can be computed as follows,

(S5)

Starting from random initialization values of cavity magnetizations and correlations, the above belief propagation iterates until convergence. To carry out the inference of synaptic weights (so-called learning), one only need to compute the full magnetizations by replacing in Eq. (S5) by . The free energy can also be estimated under the Bethe approximation, given by where the single synaptic-weight-pair contribution and the single data sample contribution are given as follows,

(S6a)
(S6b)

where , , , , , and .

Appendix B Replica analysis of the model

For a replica analysis, we need to evaluate a disorder average of an integer power of the partition function , where is the disorder average over the true RF distribution that is factorized over components and the corresponding data distribution as

(S7)

where , and is the replica index. The typical free energy can then be obtained as . To compute explicitly , we need to specify the order parameters as follows:

(S8a)
(S8b)

Inserting these definitions in the form of the delta functions as well as their corresponding integral representations, one can decompose the computation of into entropic and energetic parts. However, to further simplify the computation, we make a simple Ansatz, i.e., the order parameters are invariant under the permutation of replica indexes. This is the so-called RS Ansatz. The RS Ansatz reads,

(S9a)
(S9b)

for any , and

(S10a)
(S10b)

for any and . Note that are conjugated order parameters introduced when using the integral representation of the delta function.

Then we can reorganize as

(S11)

where and denote, respectively, all non-conjugated and conjugated order parameters. In the large limit, the integral is dominated by an equilibrium action:

(S12)

where is the entropic term, and is the energetic term.

We first compute the entropic term as follows,

(S13)

After a bit lengthy algebraic manipulation with the techniques developed in our previous work Hou et al. (2019), we arrive at the final result of as

(S14)

where we define with three independent standard Gaussian random variables (, and ), denotes a disorder average with respect to the true prior. From this expression, an effective two-spin interaction Hamiltonian can be extracted, determining the effective partition function in the main text. The effective fields and coupling are given as follows,

(S15a)
(S15b)
(S15c)

Therefore, .

Next, we compute the energetic term given by

(S16)

where defines the disorder average. , , and , , where represents a typical data sample. These four quantities are correlated random Gaussian variables, due to the central limit theorem. To satisfy their covariance structure determined by the order parameters, the random variables and are parameterized by six standard Gaussian variables of zero mean and unit variance () as follows,

(S17a)
(S17b)
(S17c)
(S17d)

where , , and . Therefore, the term can be calculated by a standard Gaussian integration given by

(S18)

where .

By introducing the auxiliary variables as follows,

(S19a)
(S19b)
(S19c)

we finally arrive at the free energy as

(S20)

where . The saddle-point analysis in Eq. (S11) requires that the order parameters should be the stationary point of the free energy. All these conjugated and non-conjugated order parameters are subject to saddle-point equations derived from setting the corresponding derivatives of the free energy with respect to the order parameters zero. Here we skip the technical details to derive the saddle-point equations. We refer the interested readers to our previous work Hou et al. (2019).

The saddle-point equations for non-conjugated order parameters are given by

(S21a)
(S21b)
(S21c)
(S21d)
(S21e)