Markov Chain Monte Carlo (MCMC) algorithms for combinatorial optimization problems are designed to converge to a stationary distribution that is a monotone decreasing function of the objective function one needs to minimize. A fictitious temperature is usually introduced to make the distribution more and more focused on the optima. Depending on the form of the stationary distribution (i.e. on the temperature) the sampling process can converge fast or it can get trapped in local minima. There is typically a tradeoff between optimality of the sampled solutions and the form of
: smooth and close to uniform distributions are the easiest to sample but the sampled configuration are most often far from optimal. On the contrary hard to sample distributions are characterized by a so called glassy landscape where the number of metastable minima that can trap the MCMC and brake ergodicity are typically exponentially numerouskrzakala2007gibbs ; mezard_information_2009 .
Random constraint satisfaction problems (CSPs) offer an ideal framework for understanding these type of questions in that they allow to analyze the geometry of the space solutions of hard to sample problems and at the same time design and test novel algorithms mezard2002analytic . In many random CSPs, the computational hardness is associated to the existence of optimal and metastable states that are grouped into different clusters of nearby solutions with different sizes and properties. Finer geometrical properties of the space of solutions have been investigated in the literature Zhou2009 ; Li2009 ; Mezard2005 , but a general scenario has yet to be established.
Large deviation analysis allows to describe in some detail the structures of such clusters, ranging from the dominant clusters (those in which one would fall by choosing uniformly at random a solution) to the subdominant ones. Note that in general, algorithms are by no means bounded to sample solutions uniformly at random.
Very recently it has been shown baldassi-subdominant
that problems that were believed to be intractable due to their glassy nature, namely the learning problems in neural networks with discrete synaptic weights, possess in fact a richer structure of the space of solutions than what is suggested by the standard equilibrium analysis. As it turns out, there exist sub-dominant and extremely dense clusters of solutions that can be analytically unveiled by defining a probabilistic weight based on the local entropy, i.e. on the number of solutions within a given radius from a reference solution. Despite being sub-dominant, these states turn out to be accessible by extremely simple heuristic algorithms.
Here, we move forward and study the same structures without enforcing the constraint that the reference configuration is itself a solution. This apparent simplification actually requires higher levels of replica-symmetry breaking in the analysis. The resulting measure defines an objective function, namely the local entropy, that we then use to devise a novel MCMC, which we call Entropy-driven Monte Carlo (EdMC). When applied to the binary perceptron learning problem, EdMC yields algorithmic results that are in very good agreement with the theoretical computations, and by far outperform standard Simulated Annealing in a direct comparison.
We also applied this approach to the -SAT problem, showing that even when some of the computations can not be carried out exactly the practical performance is indeed very good even in hard regions. Further heuristic algorithmic improvements are also discussed.
Here, we focus on the representative case of zero temperature and Hamming distance, but the technique could be easily extended to go beyond these simple assumptions.
The rest of the paper is organized as follows: in Sec. I.1 we present the results of the analysis of the sub-dominant clusters of solutions and define the novel EdMC algorithm, in Sec. II we report extensive numerical results, comparing EdMC to the theoretical results and to standard Simulated Annealing, while in Sec. III we discuss our findings. Two Appendices follow the main text: in the first, Appendix A, we provide details about the Belief Propagation algorithm; in the second, Appendix B, we report a detailed self-contained description of the analytical computations.
i.1 Sub-dominant clusters analysis and Entropy-driven Monte Carlo
In most random combinatorial optimization problems the so called dominating states of the equilibrium Gibbs measure at zero temperature (the ground states) are not relevant in the analysis of practical optimization algorithm and their dynamics. The algorithmically accessible states are typically sub-dominant states characterized by a high internal entropy dall2008entropy . The structure of such sub-dominant states can be investigated by means of the Replica Method or the Cavity Method, at least in the average case.
The approach we will present in the following extends our recent findings in the discrete Perceptron Learning Problem baldassi-subdominant , where we carried out a large deviations analysis by introducing a reweighting of the solutions by the local entropy
, i.e. by the number of other solutions surrounding a given one. We show that even if we remove the explicit constraint that the reference configuration is a solution, optimizing the local entropy has approximately the same effect with high probability: in other words we find that if we can estimate the local entropy for any given configuration, and then seek the configuration for which it is maximum, in all likelihood we will end up in a solution.
This naturally raises the possibility to translate such theoretical predictions into a general practical solver, because even though computing the local entropy may be more difficult than computing the energy, the resulting landscape may be radically different. A simple and direct way to show that this is the case is to compare two MCMC implementations, one optimizing the energy and one optimizing the local entropy (we call the latter Entropy-driven Monte Carlo, or EdMC). Indeed, the local entropy landscape proves to be much smoother, allowing to reach ground states inaccessible to the energetic MCMC.
i.1.1 Large deviations analysis of CSPs
A generic Constraint Satisfaction Problem (CSP) can be defined in terms of configurations of variables , subject to constraints . Each constraint involves a subset of the variables, which we collectively represent as , and we define if the constraint is satisfied, otherwise. For concreteness, let us focus on the case of binary spin variables , the generalization to multi-valued variables being straightforward baldassi-subdominant-multivalued . We may define an energy function of the system simply as the number of violated constraints, namely:
A solution of a CSP is then a zero-energy configuration. The standard zero-temperature Gibbs measure for CSPs, which we will call equilibrium measure, assumes uniform weight over the space of solutions and it is the one associated to the partition function
which just counts the solutions.
Suppose now one wants to analyze the local structure of the solution space by counting the number of solution vectorsaround a given planted vector . To this end, we define the local free entropy, as a function of the planted configuration, as:
counts all solutions to the CSP with a weight that depends on the distance from and is modulated by the parameter . Solving a CSP amounts to finding a solution vector such that : we shall show below that this can be achieved by optimizing the cost function over , which naturally guides the system in a region with a high density of solutions. In order to determine , one needs to study a slightly different system then the one defined by , in which the variables ’s are coupled to some external fields , with and is the coupling strength. Thus the directions of the external fields ’s are considered as external control variables. The parameter sets the magnitude of the external fields, and in the limit of large it effectively fixes the Hamming distance of the solutions from . The local free entropy is then obtained as the zero-temperature limit of the free energy of the system described by , and can be computed by Belief Propagation (see Sec. I.1.4 and Appendix A).
We then study a system defined by the following free energy:
where has formally the role of an inverse temperature and has formally the role of an energy. Throughout the paper the term temperature will always refer to except where otherwise stated. In the limit of large , this system is dominated by the ground states for which the local free entropy is maximum; if the number of such ground states is not exponentially large in , the local entropy can then be recovered by computing the Legendre transform
where is the typical value of the overlap . This quantity thus allows us to compute
i.e. to count the number of solutions at normalized Hamming distance from the ground states (note that the soft constraint of eq. 4 is equivalent to a hard constraint in the limit of large , but that the opposite is not true in general — see Sec. B.1 for more details on this point).
Informally speaking, if the distance is small enough (i.e. at large ), then is going to be roughly at the center of a dense cluster of solutions, if such cluster exists, which means that it is likely going to be a solution itself (indeed, that is surely the case when ). Furthermore, because of the reweighting term, these dense solution regions will typically have different statistical properties with respect to the set of thermodynamically locally stable states. We therefore refer to these solutions as sub-dominant clusters, since they are not normally part of the equilibrium description. However, we have reasons to believe that these kind of sub-dominant clusters play a crucial role in the algorithmic properties of practical learning problems in CSPs: the analysis of the local free entropy in the case of the binary perceptron learning problem (see below) has shown that indeed such sub-dominant ultra-dense regions exist at least up to a critical value of the parameter , that these kind of solutions exhibit different properties with respect to typical equilibrium solutions, and that heuristic solvers typically find a solution in such regions baldassi-subdominant .
i.1.2 Two prototypical CSPs
In this paper we consider two prototypical CSPs, which are both computationally hard but have very different characteristics and structure. The first one — the main focus of this paper — is the binary Perceptron Learning Problem, a fully connected problem that originally motivated our investigation. The second is an example of a general diluted problem, the Random -SAT, which has a long tradition in the Statistical Mechanics of Optimization Problems.
Let us consider the problem of classifyinginput patterns , that is associating to each of them a prescribed output . The perceptron is defined as a simple linear threshold unit that implements the mapping , with representing the vector of synaptic weights. In what follows we will consider the classification problem, which consists in finding a vector that correctly classifies all inputs, i.e. , given a set of random i.i.d. unbiased . We will focus on the case of spin-like weights, i.e. (the generalization to more states does not pose significant additional difficulties). The corresponding energy function is the sum of wrongly classified patterns, namely:
where is the Heaviside step function. This problem has been extensively analyzed in the limit of large by means of Replica gardner-derrida ; krauth-mezard and Cavity mezard-gardner-cavity methods, finding that in the typical case there is an exponential (in ) number of solutions up to a critical capacity , above which no solution typically exists. Despite the exponential number of solutions, the energetic landscape is riddled with local minima, and energetic local search algorithms are typically ineffective at finding them for any obuchi2009weight ; huang2013entropy ; huang2014origin . Moreover typical solutions are known to be isolated huang2014origin .
The satisfiability problem, in its ‘random -SAT’ instantiation, consists in finding an assignment for truth values that satisfies random logical clauses, each one involving exactly different variables. Let us then consider Boolean variables , with the common identification . A given clause is the logical OR of its variables, whose indices are , and which can appear negated. Let us work in a spin representation and introduce the couplings , where if the variable appears negated in clause , and otherwise. The graph structure is random, in that each clause involves
variables extracted uniformly at random, and the couplings are also unbiased i.i.d. random binary variables. With these definitions, the solutions to a-SAT problem are the zero energy configurations of the following Hamiltonian:
which counts the number of violated clauses.
Random -SAT has been central in the development of the statistical mechanical approach to CSPs. For more details, we refer the reader to comprehensive reviews martin2001statistical ; mezard_information_2009 . Here we just point out that, when varying the number of constraints per variable
, the problem undergoes a sequence of phase transitions, related to the fragmentation of the phase space in a huge number of disconnected clusters of solutions. This rich phenomenology, observed well below the UNSAT threshold (above which no solution typically exists at large), can be analyzed by the cavity method in the framework of -step Replica Symmetry Breaking (-RSB), and is reflected in the exponential slowing down of greedy algorithms as well as sampling strategies.
i.1.3 1-Step Replica-Symmetry-Broken solution in the binary perceptron
The existence of dense sub-dominant regions of solutions with radically different properties than those described by the usual equilibrium analysis was first discovered for the perceptron learning problem in baldassi-subdominant , as a result of extensive numerical simulations and of the study of a free energy function very similar to eq. (5).
The free energy used in baldassi-subdominant differs from the one of eq. (5) because in the latter we do not impose any constraint on the reference configurations . In baldassi-subdominant , we analyzed the free energy using the Replica Method at the level of a Replica Symmetric (RS) solution, and found that there was a maximum value of for which the external entropy was non-negative. The external entropy is the logarithm of the number of configurations divided by , and should not take finite negative values: when it does, it means that there is a problem in the RS assumption. Therefore, in that analysis, we used the value that led to a zero complexity.
We repeated the RS computation for the unconstrained version, eq. (5), and found that as increases the results become patently unphysical. For example, the RS solution at would predict a positive local entropy even beyond the critical value . Therefore, the RS assumption needs to be abandoned and at least a -step of Replica Symmetry Breaking (-RSB) must be studied. Specifically, we assumed that RSB occurred at the level of the variables, while we kept the RS Ansatz for the variables, and computed the result in the limit . This appears as a geometrically consistent assumption, in that the clusters we are analyzing are dense and we do not expect any internal fragmentation of their geometrical structure. All the details of the calculation are in the Appendix B. The result is a system of coupled equations, with and as control parameters (using as control parameter instead of its conjugate which was used in eq. 5 is advantageous for the theoretical analysis at high , see below).
Solving these equations still yields a negative external entropy for all values of and , and studying the finite case is numerically much more challenging in this case, to the point of being presently unfeasible. However, the magnitude of the external entropy is greatly reduced with respect to the analogous RS solution at ; furthermore, its value tends to zero when , and all the other unphysical results of the RS solution were fixed at this step of RSB. Additionally, the qualitative behavior of the solution is the same as for the constrained RS version of baldassi-subdominant . Finally, simulation results, where available, are in remarkable agreement with the predictions of this computation (see below, Sec. II.1.1). We therefore speculate that this solution is a reasonable approximation to the correct solution at .
In particular, these qualitative facts hold (see Fig. 1):
For all below the critical value , the local entropy in the region of tends to the curve corresponding to , implying that for small enough distances the region around the ground states is extremely dense (almost all points are solutions).
There is a transition at after which the local entropy curves are no longer monotonic; in fact, we observe the appearance of a gap in where the system of equations has no solution. We speculatively interpret this fact as signaling a transition between two regimes: one for low in which the ultra-dense regions are immersed in a huge connected structure, and one at high in which the structure of the sub-dominant solutions fragments into separate regions.
Two additional results are worth noting about the properties of the reference configurations (see Appendix B.3 for the complete details):
In the limit , the the local entropy takes exactly the same value as for the constrained case in which the are required to be solutions, and the same is true for the parameters that are common to both cases. The external entropy, however, is different. This is true both in the RS and the -RSB scenario.
For the unconstrained case, we can compute the probability that the reference configuration makes an error on any one of the patterns (see Fig. 2). It turns out that this probability is a decreasing function of (going exponentially to as ) and an increasing function of . For low values of , this probability is extremely low, such that at finite values of the probability that is a solution to the full pattern set is almost .
i.1.4 EdMC: Local entropy as an alternative objective function in optimization
The case of the binary perceptron suggests that it is possible to exploit the results of the sub-dominant analysis and devise a simple and fairly general scheme for solving CSPs in a very efficient and controlled way.
For the binary perceptron, Fig. 1B shows that up to we can use as a control parameter to determine , while Fig. 2 shows that a solution to the learning problem can be found by maximizing the local free entropy (eq. (3)) as a function of at sufficiently large .
The reason to follow this strategy, as opposed to directly trying to minimize the energy (eq. (1)), is that it turns out that the landscape of the two objective functions is radically different: while the energy landscape can be riddled with local minima that trap local search algorithms, the local entropy landscape is much smoother. Therefore, a simple Monte Carlo algorithm on the local free entropy, or “Entropy-driven Monte Carlo” (EdMC) for short, is able to effectively find solutions that are very hard to find for energy-based simulated annealing.
Furthermore, the behavior of this algorithm can — at least in principle — be described in the typical case with the tools of Statistical Mechanics, to the contrary of what is currently possible for the other efficient solvers, which all, to some extent, resort to heuristics (e.g. decimation, soft decimation a.k.a. reinforcement, etc.).
Indeed, the main practical difficulty in implementing the EdMC algorithm sketched above is estimating the local free entropy . We use the Belief Propagation (BP) algorithm for this, a cavity method algorithm that computes the answer in the context of the Bethe Approximation. The BP algorithm is briefly explained in the Appendix A, where we also reported the specific BP equations used for the particular CSPs that we studied in this paper.
More specifically: is initialized at random; at each step is computed by the BP algorithm; random local updates (spin flips) of are accepted or rejected using a standard Metropolis rule at fixed temperature . In practice, we found that in many regimes it suffices to use the simple greedy strategy of a zero temperature Monte Carlo (). From a practical standpoint, it seems more important instead to start from a relatively low and increase it gradually, as one would do in a classical annealing procedure. We call such procedure ‘scoping’, as it progressively narrows the focus of the local entropy computation to smaller and smaller regions. The reason to adopt such a strategy is easily understood by looking at the theoretical error probability curves in Fig. 2.
Ii EdMC results
In what follows, we will describe EdMC in more detail for the two prototypical examples introduced in Sec. I.1.2.
ii.1.1 Comparison with theoretical results
We tested the theoretical results of Sec. I.1.3 by running EdMC on many samples at and , at various values of (we used , varying in steps of ). In this case, since we sought the optimum value of the free local entropy at each , we did not stop the algorithm when a solution was found. We ran the algorithm both directly at and using a cooling procedure, in which was initialized at and increased by a factor of for every accepted moves. The search was stopped after consecutive rejected moves. For each sample and each polarization level, we recorded the value of the overlap , of the local entropy (see eq. 6 above and eqs. (20) and (21) in the Appendix A) and of the error probability per pattern. Then, we binned the results over the values of , using bins of width , and averaged the results of the local entropy and the error rate in each bin. Fig. 3 shows that both values are in reasonable agreement with the theoretical curve: the qualitative behavior is the same (in particular: the error rate goes to zero at and the entropy is positive until , confirming the existence of dense clusters) and the more accurate version is closer to the theoretical values. The remaining discrepancy could be ascribed to several factors: 1) finite size effects, since is rather small 2) inaccuracy of the Monte Carlo sampling, which would be fixed by lowering the cooling rate 3) inaccuracy of the theoretical curve due to RSB effects, since we know that our solution is only an approximation.
Note that, with these settings, the average number of errors per pattern set is almost always less than for all points plotted in Fig. 3A. Also note that, for all values of , the mode and the median of the error distribution is at , and that the average is computed from the tails of the distribution (which explains the noise in the graphs). Finally note that, for all samples, points corresponding to errors were found during the Monte Carlo procedure.
ii.1.2 Comparison with standard Simulated Annealing
We tested our method with various problem sizes and different values of , and compared its performance with a standard MCMC. The most remarkable feature of EdMC is its ability to retrieve a solution at zero temperature in a relative small number of steps. We found that zero temperature MCMC (Glauber dynamics) immediately gets trapped in local minima at zero temperature, even at small . In order to find a solution with MCMC we used a simulated annealing (SA) approach, with initial inverse temperature , and we increased by a factor for every accepted moves. The factor is a cooling rate parameter that we optimized for each problem instance (see below). Fig. 4 shows a comparison between a typical trajectory of SA versus EdMC on the very same instance (EdMC is run at with ): at first glance, it exemplifies the typical difference in the average number of steps required to reach a solution between SA and EdMC, which is of or orders of magnitude for small . Also note the smoothness of the EdMC trajectory in contrast to SA: the local entropy landscape is far smoother and ensures a rapid convergence to the region with highest density of solutions.
We studied the scaling properties of EdMC in contrast to SA, at and , varying between and and measuring the number of iterations needed to reach a solution to the learning problem.
For the SA tests, we used the following procedure: for each instance of the problem, we tried to find a solution at some value of the cooling rate ; after consecutive rejected moves, we started over with a reduced , and repeated this until a solution was eventually found. The values of that we used were , , , , , , , , . This allowed us to measure the least number of iterations required by SA to solve the problem (we only report the number of iterations for the last attempted value of ). At , all tested instances were solved, up to . At , however, this procedure failed to achieve 100% success rate even for in reasonable times; at , the success rate was 0% even with ; at , using did not seem to yield better results than . Therefore, no data is available for SA at .
For the EdMC tests, we used the following procedure: we started the algorithm at with , and ran the Monte Carlo procedure directly at ; if a solution was not found, we increased by a step of and continued from the last accepted configuration, up to if needed. Since we worked at , we only needed to test each spin flip at most once before abandoning the search and increasing . This procedure allowed us to find a solution in all cases.
The results are shown in Fig. 5, in log-log scale. For SA (panel A in the figure), the behavior is clearly exponential at , and missing altogether for . For EdMC (panel B in the figure), the data is well fit by polynomial curves, giving a scaling for and for . Also note the difference of several orders of magnitude in the ranges of the axes in the two panels.
From the theoretical analysis, and the results shown in Figs. 1 and 3, it could be expected that EdMC should be able to find a solution at least until when the entropy curves lose their monotonicity, and therefore be on par with reinforced Belief Propagation braunstein-zecchina and reinforced Max-Sum baldassi2015max in terms of algorithmic capacity (though being much slower in terms of absolute solving times), if following a cooling procedure on . Assessing this in detail however would require even more extensive testing and goes beyond the scope of the present work.
Finally, we also tested SA using a simple variant of the energy function, in which a rectified linear function is used instead of the step function in eq. (8), and verified that, while the performance indeed improves, the qualitative picture remains unchanged.
ii.2 Extensions: the -SAT case and heuristic improvements
The results we have displayed for the Binary Perceptron rely on a general scheme that in principle can be applied to other CSPs (or optimization problems). The main bottleneck in implementing such extensions resides in the possibility of computing the local entropy efficiently, e.g. by BP or some other sampling technique. As proof of concept we apply EdMC to the very well studied case of random -SAT monasson1999determining ; mezard2002random ; mezard2002analytic focusing on the non trivial case , at various and . Random -SAT is characterized by three different regimesmontanari2008clusters ; krzakala-csp : For the phase is RS and the solution space is dominated by a connected cluster of solutions with vanishing correlations among far apart variables. For the dominant part of the solution space brakes into an exponential number of clusters that have an extensive internal entropy. Long range correlations do not vanish. For the solution space is dominated by a sub-exponential number of clusters. Eventually for the problem becomes unsatisfiable. The hard region for random 4-SAT is , i.e. where long range correlations do not vanish. In such region SA are expected to get stuck in glassy states and most of the heuristics are known to fail.
In the RS regime, EdMC succeeds in finding a solution in a small number of steps, confirming the smooth nature of the objective function. Typical trajectories for different values of are depicted in Figure 6A. Figure 6B shows the scaling behavior with , which is polynomial () as in the case of the Perceptron.
In the hard phase the method suffers from the lack of convergence of BP. Even if BP (technically the -RSB solution with ) would converge up to the condensation point , the addition of the external fields prevent BP from converging even below such point. These are expected results that could be solved by resorting to a -RSB cavity algorithm to compute the local entropy. While certainly interesting, this is beyond the scope of this paper. For the sake of simplicity we decided to adopt simple heuristic solutions just to show that the overall method is effective.
In cases in which BP does not converge, we take as a proxy for its average over a sufficient number of BP iterations. While this trick typically leads to a solution of the problem, it has the drawback of making the overall Monte Carlo procedure slow. To overcome this difficulty, we simply improve the choice of the flipping step by using the information contained in the cavity marginals in order to effectively guide the system into a state of high local density of solutions. The heuristic turns out to be much faster and capable of solving hard instances up to values of close to the SAT/UNSAT threshold (see Fig. 7). The same heuristic has also been tested in the Perceptron learning problem with excellent results at high values of .
The main step of the heuristic method consists in performing an extensive number of flipping steps at each iteration in the direction of maximum free energy, choosing the variables to flip from the set of ’s whose cavity marginals in absence of the external field are not in agreement with the direction of the field itself (see Appendix A for a precise definition of the marginals ). Let us call the set of such variables, in a ranked order with respect to the difference between the external and the cavity fields, such that the ones with the largest difference come first. In the spirit of MCMCs, we propose a collective flip of all the variables and compute the new value of . The collective flip is always accepted if there is an increase in free energy, otherwise it is accepted with probability , where is the free energy difference. When a collective flip is accepted, a new set is computed. If, on the contrary, the flips are rejected, a new collective flip is proposed, simply eliminating the last variable in the ranked set , and the procedure is repeated until the set is empty. In the general case this procedure quickly leads to a solution with zero energy. If all the collective flips in are rejected, the procedure is terminated, and a standard EdMC is started from the last accepted values of .
As it turns out, most of these collective moves are accepted immediately. The interpretation is that these moves try to maximize, at each step, the local contributions ’s associated to each variable in the Bethe free energy, in presence of an external field .
As for standard EdMC, we can additionally employ an annealing strategy, by which is progressively increased during the iteration, and a ‘scoping’ strategy, i.e. a gradual increase of the external field as well. The resulting algorithm is detailed in Algorithm 1.
Of the two strategies, annealing and scoping, the latter seems to be far more important in practice, and in many cases crucial to produce a quick solution: as is increased, smaller regions are progressively observed, and this focus can eventually lead the search into a given compact cluster of solutions, a region sufficiently dense so that Replica Symmetry locally holds. Indeed, in the typical observed trajectory of this algorithm in the presence of the scoping dynamics, even when BP suffers from convergence issues in the first steps, convergence is restored when the external fields vector gets close to a region of high solution density. Both strategies, scoping and annealing, have been used to maximize the probability of success of the algorithm in the hard phase of the -SAT problem in the simulations showed in Fig. 7.
We recently demonstrated the relevance of a probability measure based on the local entropy in the theoretical analysis of sub-dominant clusters in Constraint Satisfaction Problems baldassi-subdominant . In the present work, we extended on the previous analysis and introduced EdMC, a novel method for efficiently sampling over solutions in single instances of CSPs.
At variance with standard Monte Carlo methods, we never make use directly of the energy information: minimization of energy naturally emerges from the maximization of local entropy. What we propose here is thus a radically different perspective for optimization problems, based on the possibility of estimating the local entropy, a quantity that can effectively guide an MCMC straight into a region with a high density of solutions, thus providing a solver. The effectiveness of our Entropy-driven Monte Carlo may be understood in terms of a high level of smoothing in the local entropy landscape. This is evident if we compare a zero temperature EdMC to an energy guided SA with low cooling rate: the former is orders of magnitude faster in terms of attempted moves, and does not suffer from trapping in local minima. The procedure can even be made largely more efficient by a very simple heuristic.
In order to estimate the local entropy, we rely on BP, which itself can often be heuristically modified to become a solver (e.g. by introducing decimation or reinforcement schemes). However, those heuristics are not under control, while the scheme we propose here has a clear theoretical interpretation. BP is constructed on a cavity approximation scheme that strongly depends on the local factorization properties of the CSPs as well as on the global phase space structure, which is in general dependent on the number of constraints. The validity of cavity approximation has to be assessed for each problem at hand. It is very intriguing to think of simpler estimations for local entropy that could not rely on the convergence of BP equations. On the one hand, we have shown that even when convergence is not guaranteed at all, a simple averaging of messages could give useful information and lead to a solution in a small number of steps, implying that even an imprecise estimate of the local entropy is sufficient in these cases. Besides, the kinetic accessibility of sub-domintant clusters by simple learning protocols suggests that one could consider the general problem of engineering simple dynamical processes governed by the measure that we introduced in Eqs. (3-5). This subject is currently under investigation.
Acknowledgements.CB, CL and RZ acknowledge the European Research Council for grant n° 267915.
Appendix A Belief Propagation
In this section, we briefly introduce the Belief Propagation (BP) algorithm in general, define the quantities used by EdMC, and explicitly write the relevant expressions for the cases of the Binary Perceptron and the -SAT problems. We will make use of the notation introduced in Sec. I.1.
a.1 General BP scheme
One common method for representing an instance of a CSP is to draw a bipartite Factor Graph, where each factor node stands for a constraint , each variable node stands for a variable , and each edge connects a factor node with a variable node if , or equivalently . This kind of graphical model representation is very helpful for understanding the basic dynamics of message passing methods such as BP.
Belief Propagation can be considered as a tool for an efficient marginalization of a complicated probability distribution that can be written as a product of local interaction terms. This is obviously the case of the Gibbs distribution associated to the Hamiltonian (1), which reads:
In full generality, BP equations are a set of coupled nonlinear equations for the cavity messages , which can be viewed as messages associated to each link in the Factor Graph. For a Hamiltonian of the form (1), the BP equations are the following:
A common way of solving these equations is by iteration, until convergence to a fixed point is established. Fixed point messages may then be used to compute local joint marginals and other interesting macroscopic quantities such as the average energy and the free energy of the system. As explained in Sec. II.2, one can also extract useful information from cavity messages themselves, and rely on them to construct a very efficient optimization procedure for the free energy. Of course, approximations for the true marginals and free energy can be obtained by means of instantaneous values at a given computing time , or by averaging them over the course of the BP iterations: we exploit this fact when dealing with regimes where the convergence of BP is hard if not impossible (see Sec. II.2).
Non-cavity marginals over variables can be computed at the fixed point as:
In the heuristic version of EdMC presented in Sec. II.2 (Algorithm 1), we consider the case of binary spins and we use the cavity magnetization fields in absence of the external fields. These are defined from expression (12) as:
The local free entropy of eq. (3) can be computed from the fixed point messages in the zero-temperature limit in terms of purely local contributions from variables, edges and factor nodes:
a.2 BP for the binary perceptron
The BP equations for a given instance are most easily written in terms of cavity magnetizations and . Note that it is always possibile to set without loss of generality, by means of the simple gauge transformation . With these simplification, equations (10,11) become:
is the convolution of the all cavity messages impinging on the pattern node , except for : it is thus the (cavity) distribution of the total synaptic input for pattern
, in absence of the synapse. The complexity of the second update is at most with an appropriate pre-computation of cavity convolutions. When one deals with the case of and an extensive number of patterns, a common and simple strategy is to adopt a Gaussian approximation for the distribution , where
denotes the normal distribution. It suffices then to compute the meanand variance of the distribution , whose dependence on cavity messages
is easily determined from the central limit theorem:
(analogous non-cavity quantities and are computed by summing over all indices ). By doing so, equation (23) becomes:
and we used the function .
The free-entropy can be easily obtained by eq. (16) putting . In our Gaussian approximation, the expression can be written as:
The total number of solutions for a given instance can be determined by means of the entropy:
Consistently with the theoretical predictions, BP equations always converge for . Indeed, Replica Symmetry holds with the identification of states as the dominant isolated configurations, this being evident in the proportionality relation between RS and RSB free-energy krauth-mezard , with an intra-state overlap (RSB parameter) equal to . The entropy decreases monotonically with and, provided is high enough, it vanishes at the critical threshold krauth-mezard .
Very interestingly, if one slightly modifies the original equations with the introduction of a ‘reinforcement term’, much similar to a sort of smooth decimation procedure, BP becomes a very efficient solver that is able to find a solution with probability up to braunstein-zecchina . Reinforced BP equations can be further simplified, this leading to a dramatic reduction in update complexity. The resulting on-line algorithms, SBPI baldassi-et-all-pnas and CP+R baldassi-2009 are able to achieve a slightly lower capacity (). As we pointed out in the Introduction, Sec. I, the performances of all the cited algorithms seem to be directly affected by large sub-dominant clusters: these high entropy states happen to be easily accessible, while the dominant isolated solutions are very hard (if not impossible) to find efficiently by simple learning methods.
and similarly for the total magnetization: . The local free entropy is also simply given by the expression using eqs. (30) and (20) with the modified magnetizations. The cavity magnetization fields in absence of the external fields, eq. (15), are:
a.3 BP for -Sat
The Belief Propagation equations for the -SAT problem are most easily written with a parametrization of the BP messages in terms of the quantities , where is the probability that does not satisfy clause in absence of this clause, and is the probability that all the variables in clause except violate the clause. With this choice, and calling the set of variables in the constraint , equation (11) simply becomes:
Equation (10) for the variable node update is slightly more involved:
where (resp. ) is the set of clauses in which variable is involved with a coupling (resp. ).
As for the perceptron, the free-entropy is obtained from expression (16) at :
where (resp. ) is the set of all clauses in which variable is involved with (resp. ). Analogously, the entropy is obtained from the following expression, which only depends upon the messages :
In the chosen parametrization, the external fields may be easily introduced by means of additional single-variable ‘soft clauses’ , which send fixed cavity messages to their respective variables, taking the value:
with the restriction that (resp. ) if (resp. ).
If we call , the new set of clauses, enlarged so as to contain the , the total magnetization is given by:
The cavity magnetization fields in absence of the external fields, eq. (15), are simply given by:
Convergence properties of equations (33,34) on single instances are deeply related to the Replica Symmetry Breaking scenario briefly discussed in the preceding section. The onset of long range correlations in clustered RSB phase prevents BP from converging.
While RSB limits the usefulness of BP (as well as reinforced BP) at high , ideas from the -RSB cavity method, when applied to the single case without the averaging process, have led to the introduction of a powerful heuristic algorithm, Survey Propagation (SP) mezard2002analytic ; braunstein2005survey , which is able to solve -SAT instances in the hard phase, almost up to the UNSAT threshold.
Appendix B Details of the large deviations analysis for the binary Perceptron Learning problem
In this section, we provide all technical details of the analysis of the reweighted free energy function of eq. (5) of Sec. I.1 for the case of the Perceptron Learning problem with binary synapses of Sec. I.1.2. The results of this analysis are presented in Sec. II.1.1.
However, the notation in this section will differ at times from the one in the main text, to make it more similar to the one used in baldassi-subdominant , and so that this section is mostly self-contained.
b.1 Setting the problem and the notation
We generate patterns by drawing the inputs as random i.i.d. variables with distribution . Without loss of generality, we assume that the desired output of all patterns is .
We consider the learning problem of correctly classifying a set of patterns (where ). We define, for any vector of synaptic weights with the quantity
(where we used to represent the Heaviside step function: if , otherwise) such that the solutions of the learning problem are described by . The factor has been added to account for the scaling of the effective fields.
Note that throughout this section we will use the index for the synapses and for the patterns. In all sums and products, the summation ranges are assumed implicitly, e.g. . Also, all integrals are assumed to be taken over unless otherwise specified. The letters , , and will be reserved for replica indices (see below). Finally, when we will make the -RSB Ansatz, we will also use , , and for the replica indices; it should be clear from the context that these do not refer to the capacity or the inverse temperature in those cases.
We write the number of solutions as:
We want to study the solution space of the problem in the following setting: we consider a set of reference configurations, each of which has an associated set of solutions that are constrained to have a certain overlap with it.
In the following, we indicate with the reference configurations, and with the other solutions. In general we use a tilde for all the quantities that refer to the reference configurations.
Let us define then:
(where is the Dirac-delta distribution111we should have used a Kronecker delta symbol here, but this abuse of notation comes in handy since in the following we will use integrals instead of sums for the weights), i.e. the number of solutions that have overlap with (or equivalently, distance from) a reference configuration . We then introduce the following quenched free energy:
where is the partition function and has the role of an inverse temperature. This is the free energy density of a system where the configuration is described by and the energy is given by minus the entropy of the other solutions with overlap from it.
This expression is almost equivalent to eq. 5 in the main text, except that we used the overlap as a control parameter instead of the coupling , by using a hard constraint (the delta distribution in eq. 42) rather than a soft constraint (the exponential in eq. 4). The two parameters are conjugates. The main advantage of using is that it is easier to implement using the BP algorithm, which is why we used it for the EdMC algorithm. The advantage of using is that it provides a more general description: while in large portions of the phase space the relationship between and is bijective (and thus the two systems are equivalent), some regions of the phase space at large can only be fully explored with by constraining , and thus we have used this system for the theoretical analysis.
The main goal is that of studying the ground states of the system, i.e. taking the limit of . This limit allows us to seek the reference configuration for which the number of solutions at overlap with it is maximal, and to derive an expression for the entropy of the surrounding solutions, which we call local entropy. As we shall see, we find that is always positive for when , indicating the presence of dense clusters of solutions.
In the remainder, we will generally use the customary notation or (in stead of or ) to denote the integral over possible values of the weights; since we assume binary weights, we will have:
b.2 Entropy and Complexity
b.2.1 Replica trick
In order to compute the quantity of eq. (43), we use the replica trick. We will use to d