It is generally believed that learning and memory in neural systems take place via plastic changes to the connections between neurons (synapses) in response to external stimuli, either by creating/destructing the connections or by modifying their efficacy (also called synaptic weight)hebb1949organization
. Although the details of how these processes occur in neural tissues are largely unknown, due to technical difficulties in experiments, this idea has inspired advances in machine learning which in recent years have proven highly successful in many complex tasks such as image or speech recognition, natural language processing and many more, reaching performances comparable to those of humanscollobert2008unified ; krizhevsky2012imagenet
. The currently dominating paradigm in the machine learning field consists of employing very large feed-forward multi-layer networks trained with a large number of labeled examples through variants of the stochastic gradient descent algorithm: the learning task is thus framed as an optimization problem, in which the network implements a complex non-linear function of the inputs parametrized by the synaptic weights, and the sum over all the training set of the distance of the actual output from the desired output is a cost function to be minimized by tuning the parameters.
Despite the practical success of these techniques, it is rather unlikely that actual brains employ the same gradient-based approach: real synapses are generally very noisy, and the estimated precision with which they can store information, although very difficult to assess conclusively, ranges betweenand bits per synapse o2005graded ; bartol2015 . In other words, real synaptic efficacies might be better described by discrete quantities rather then continuous ones. Conversely, the gradient descent algorithm is well suited to solve continuous optimization problems, and in fact machine learning techniques generally employ synapses with at least
bits of precision. Moreover, theoretical arguments show that even in the simplest possible architecture, the one-layer network also known as perceptron, the properties of the training problem change drastically when using discrete rather than continuous synapses: optimizing the weights is a convex (and therefore easy to solve) problem in the latter case, while it is in general algorithmically hard in the former (NP-complete indeedamaldi1991complexity ).
The additional computational difficulties associated with discrete synapses are not however insurmountable: while most traditional approaches (e.g. simulated annealing) fail (they scale exponentially with the problem size, see e.g. horner1992dynamics ; baldassi_local_2016
), a number of heuristic algorithms are known to achieve very good performances even in the extreme case of binary synapsesbraunstein-zecchina ; baldassi-et-all-pnas ; baldassi-2009 ; baldassi2015max ; some of those are even sufficiently simple and robust to be conceivably implementable in real neurons baldassi-et-all-pnas ; baldassi-2009 . The reason why this is at all possible has been an open problem for some time, because the theoretical analyses on network with binary synapses consistently described a situation in which even typical instances should be hard, not only the worst-case ones huang2014origin . In Ref. baldassi_subdominant_2015 , we showed that those analyses were incomplete: the training problem has a huge number of effectively inaccessible solutions which dominate the statistical measure, but also dense subdominant regions of solutions which are easily accessed by the aforementioned heuristic algorithms.
More precisely, the standard statistical analyses of the typical properties associated with the training problem are concerned with all possible solutions to the problem, and therefore use a flat measure over all configurations of synaptic weights that satisfy the training set. In the binary synapses case, the resulting picture is one in which with overwhelming probability a solution is isolated from other solutions, and embedded in a landscape riddled with local minima that trap local search algorithms like hill climbing or simulated annealing. In contrast, in our large deviation analysis of Ref.baldassi_subdominant_2015 , we have shown that by reweighting the solutions with the number of other solutions in their neighborhood we could uncover the existence of extensive regions with very high density of solutions, and that those regions are the ones which are found by efficient heuristic algorithms.
In this work, we extend those results from the binary case to the general discrete case. Our main goal is to show that the above picture is still relevant in the more biologically relevant scenario in which synapses can assume more than two states, the sign of the synaptic weights can not change (also known as Dale’s principle), and the inputs and outputs are generally sparse, as is known to be the case in real neural networks. To this end, we first extend the standard analysis (the so-called equilibrium analysis, i.e. using a flat measure) and then we repeat the large deviation analysis (i.e. in which solutions are reweighted to enhance those of high local density). While this generalization poses some additional technical difficulties, we find that, in both cases, the qualitative picture is essentially unaltered: when the synapses are constrained to assume a limited number of discrete states, most solutions to the training problem are isolated, but there exist dense subdominant regions of solutions that are easily accessible. We also show that the capacity of the network saturates rather fast with the number of internal synaptic states, and that the aforementioned accessible regions exist up to a value very close to the network capacity, suggesting that it may be convenient in a practical implementation (be it biological or not) to reduce the number of states and instead exploit the geometrical properties of these dense clusters.
The paper is organized as follows: in Section II we introduce the discrete perceptron model; in Section III we compute its capacity as a function of the number of states and analyze the geometrical properties of typical solutions; in Section IV we describe our large deviations formalism and present the main results of this paper; in Section V we apply a proof-of-concept Monte Carlo algorithm driven by the local entropy baldassi_local_2016 to our perceptron model; in the Conclusions section we discuss the scenario emerging from our analysis; in the Appendices we provide details on the calculations.
We consider a single layer neural network with
inputs and one output. The network is parametrized by a vector of synaptic weightswhere each weight can only assume values from a finite discrete set. For simplicity, throughout this paper we assume that , but our derivation is general, and holds for arbitrary sets. The network output is given by
where is a vector of inputs of length , is a neuronal firing threshold, and is the Heaviside step function returning if its argument is positive and otherwise.
We consider training sets composed of pairs of input/output associations , , where and . We assume all inputs and outputs to be drawn at random from biased
independent identical distributions, with a probability distribution for each entry given by, where is the Dirac distribution. The bias parameter is often also called coding rate in biological contexts. For simplicity, in the following we will fix the coding rate to be the same for the inputs and the outputs, while in principle we could have chosen two distinct values.
For any input pattern , we can define the error function:
if the network correctly classifies the pattern, andotherwise. Therefore, the training problem of finding an assignment of weights such that the number of misclassified patterns is minimized reduces to the optimization of the cost function:
Finding a configuration for which is a constraint satisfaction problem: we denote as
the corresponding indicator function. It is generally the case in these models that, in the limit of large , there is a sharp transition at a certain value of , called the critical capacity , such that the probability (over the distribution of the patterns) that the problem can be satisfied () tends to if and tends to if 111This has not been proved rigorously, but is the result of non-rigorous replica theory analysis and numerical simulations supporting the statement. For rigorous works providing bounds to the transitions, see kim_covering_1998 ; talagrand_intersecting_1999 .. Some well-known values of in similar models are in the case of unbounded continuous weights and unbiased inputs in gardner-derrida ; in the same situation but with positive continuous weights and inputs in (this also corresponds to the limiting case of the model of eq. (1)) Brunel2004 ; in the case of both inputs and weights taking values in with unbiased inputs krauth-mezard ; and for the model of eq. (1) for the binary case and unbiased inputs gutfreund1990capacity . In all these cases, the neuronal threshold needs to be chosen optimally in order to maximize the capacity: for example, in the latter case of and the optimal value is . In the next section we will show the values of for general and .
The choice of using uncorrelated inputs and outputs is arguably not very realistic, both from the point of view of biological modeling and for machine learning applications. This simple scenario is also known as a classification task; it is possible to consider instead the case where the outputs are modeled as being produced from some underlying rule which the device has to discover from the training set, the so called generalization task. The latter is certainly a more relevant scenario for many applications. Nevertheless, in the binary case of our previous work baldassi_subdominant_2015 we showed that these assumptions – which were taken in order to simplify the theoretical analysis – seem to leave the resulting qualitative picture unaltered, and therefore we argue that it is rather likely that the situation would be similar for the multi-valued model studied in this paper. We will come back to this issue in the discussion of Section IV.
Iii Equilibrium analysis
iii.1 Critical capacity as a function of the number of synaptic states
As a first step towards extending our large deviation analysis to the model described by eq. (1), we performed a standard equilibrium analysis and verified that the scenario is the same that holds in other similar models. We could also use this analysis to compute the theoretical critical capacity of the system as a function of the number of states per synapse and of the coding rate .
This kind of analysis, often called à la Gardner gardner-derrida , consists in studying the typical thermodynamical properties of a system described by the Boltzmann measure
where is defined in eq. (3) and (also known as the partition function) is a normalization constant, in the zero-temperature limit . This is therefore a flat measure on the ground states of the system. When perfect learning is possible, i.e. , we have (see eq. (4)):
where the subscript “” stands for “flat”. In order to describe the typical behavior of a system we need to compute the average over the patterns of the entropy density:
where denotes the average over the distribution of the patterns. This computation is accomplished by the so-called replica trick and, although not rigorous, is believed to provide the correct values for some relevant quantities such as the optimal value of the neuronal threshold and the critical capacity, which in this case is derived as the value of for which .
The details of the computation follow standard steps (they can also be obtained from the computation presented in Appendix B setting ). Fig 1A shows the resulting value of as a function of the number of states , for different values of the coding rate . As expected, increases with the number of values a synaptic variable can assume and with the sparsity of the coding. Fig. 1B shows the same curve for the dense (unbiased) case with a wider range of : it is expected that in this case as , consistently with the case of continuous positive synapses Brunel2004 , and therefore we also show the results of a tentative fit of the form which estimates the rate of convergence to the continuous case; the fit yields , as expected, and an exponent . From the results in Fig. 1A, it can be seen that the qualitative behavior is not different for the sparser cases. Qualitatively similar results were also obtained in a slightly different setting in gutfreund1990capacity .
One interesting general observation about these results is that the gain in capacity with each additional synaptic state decreases fairly rapidly after the first few values. This observation by itself is not conclusive, since even when solutions exist they may be hard to find algorithmically (see the next section). As we shall see in Section IV.3, however, accessible solutions exist for all the cases we tested at least up to . From the point of view of the implementation cost (whether biological or in silico), it seems therefore that increasing the synaptic precision would not be a sensible strategy, as it leads to a very small advantage in terms of computational or representational power. This is consistent with the general idea that biological synapses would only need to implement synapses with a few bits of precision.
iii.2 Typical solutions are isolated
To explore the solution space structure of a perceptron learning problem the general idea is to select a reference solution sampled from the flat measure of eq. (6), and count how many other solutions can be found at a given distance from the selected one. This technique is known as the Franz-Parisi potential franz1995recipes .
First, we define the local entropy density for a given reference configuration at a given distance as
i.e. as the logarithm of the number of solutions at distance from , having defined the as a normalized distance function:
We introduced the factor for consistency with the computation in baldassi_subdominant_2015 : in the case where , this reduces to the Hamming distance.
Sampling from a Boltzmann distribution means looking at the typical structure of the solution space, i.e. computing the typical local entropy density:
where the subscript “” stands for Franz-Parisi, represents the reference equilibrium solution, and as usual is the average over the disorder (the patterns).
Again, the computation can be performed with the replica method, and is detailed in Appendix A. The results of this typical case analysis are shown as the black lines in Fig. 2 and are qualitatively the same as those already obtained in models with binary synapses, regardless of the number of synaptic states or the coding rate: namely, for all values of the parameters – and in particular, for all – there exist a value such that for we obtain . This (unphysical) result is assumed to signal the onset of replica symmetry breaking effects and that the actual value of the local entropy is , which means that typical solutions are isolated in the space of configurations, when considering neighborhoods whose diameter is of order .
Isolated solutions are very hard to find algorithmically. As we have shown in baldassi_subdominant_2015 , the efficient algorithms, i.e. which experimentally exhibit sub-exponential scaling in computational complexity with , find non-isolated solutions, which are therefore not typical. In the next section, we will show that these subdominant solutions exist also in the multi-valued model we are considering in the present work.
Iv Large deviations analysis
Following baldassi_subdominant_2015 , we introduce a large deviation measure in order to describe regions of the configuration space where the solutions to the training set are maximally locally dense. To this end, we modify the flat distribution of eq. (6) by increasing the relative weight of the solutions with a higher local entropy (eq. 8), as follows:
where the subscript “” stands for “reweighted, constrained”. The parameter has the role of an inverse temperature: by taking the limit this distribution describes the solutions of maximal local density.
Alternatively, we can just use as a reference configuration without enforcing the constraint , and obtain:
where the subscript “” stands for “reweighted, unconstrained”.
We can study the typical behavior of these modified measures as usual within the replica theory, by computing their corresponding average free entropy density:
With these, we can compute the typical values of the local entropy density
and of the external entropy density
(analogous relations hold for the case).
The latter quantity measures the logarithm of the number of reference configurations that correspond to the given parameters and , divided by . Since the systems are discrete, both these quantities need to be non-negative in order for them to represent typical instances, otherwise they can only be interpreted in terms of rare events rivoire_properties_2004 .
There are several reasons for studying both the and the cases. The case is more directly comparable with the typical case: as such, it is the most straightforward way to demonstrate that the large deviations analysis paints a radically different picture about the nature of the solutions than the equilibrium case. Furthermore, when studying the problem at finite , only the constrained case gives reasonable results when assuming replica symmetry (see below). The case, on the other hand, can be exploited in designing search algorithms (Sec. V); moreover, as explained below, the case reduces to the case in the limit . Finally, since both cases are problematic, due to the numerical difficulties in solving the saddle point equations and to the possible presence of further levels of replica symmetry breaking, the accuracy of the results may be questioned. Their comparison however shows that the results of the different analyses are in quite good agreement: this observation, complemented by numerical experiments, provides an indication that the results are reasonably accurate.
iv.1 Reweighted Constrained distribution, RS analysis
In the case of the computation of , eq. (13), we performed the analysis using a replica-symmetric (RS) Ansatz. The details are provided in Appendix B. Since we are interested in the configurations of highest density, we want to take the parameter to be as high as possible, in principle we wish to study the case . However, in this case the external entropy is negative for all values of the parameters. This signals a problem with the RS Ansatz, and implies that we should instead consider replica-symmetry-broken solutions. In geometrical terms, the interpretation is as follows: the RS solution at implies that the typical overlap between two different reference solutions and , as computed by , tends to (see Sec. C), and therefore that there should be a single solution of maximal local entropy density. The fact that the RS assumption is wrong implies that the structure of the configurations of maximal density is more complex, and that, at least beyond a certain , the geometry of the reference configurations breaks into several clusters (see comments at the end of the following section).
Because of technical issues in solving the equations at the 1RSB level (namely, the fact that the resulting system of equations is too large and that some of the equations involve multiple nested integrals that are too expensive to compute in reasonable times for arbitrary ), we used instead the maximum value of for which the RS results are physically meaningful, as we did already in baldassi_subdominant_2015 . Therefore, for any given , we computed such that .
The 1RSB equations simplify in the limit , but that still does not solve the problem of the negative external entropy, suggesting that the correct solution requires further levels of replica symmetry breaking. We come back to this point in the next section (IV.2), where we also comment on the results of the analysis, shown in Fig. 2.
iv.2 Reweighted Unconstrained distribution, 1RSB analysis
The unconstrained case, , eq. (14), is considerably simpler. However, replica symmetry breaking effects are also stronger, leading to clearly unphysical results at the RS level even when using such that (for example, this solution would predict a positive local entropy for some values of the parameters even beyond , which does not make sense).
Therefore, this case needs to be studied at least at the level of 1RSB. The details are provided in Appendix D. Again, the resulting equations are computationally still very heavy, and we could not explore the whole range of parameters at finite . In the limiting case the equations simplify and the computational complexity is comparable to the constrained case at finite in the RS Ansatz. Interestingly, in this limit the thermodynamic quantities are identical in the constrained and unconstrained case.
This solution does not solve the problem of negative external entropy, implying that further levels of replica symmetry breaking are required, but the situation improves considerably: the unphysical branches beyond disappear, and the modulus of the external entropy is very small and tends to as . Furthermore, the results of the 1RSB analysis at and of the RS analysis of the constrained case at are qualitatively essentially the same and quantitatively very close, which suggests that these results provide a good approximation to the description of the regions of highest local entropy density in the configuration space. Furthermore, all the results are completely analogous to the ones obtained in the binary balanced unbiased case (the constrained RS analysis was shown in baldassi_subdominant_2015 and the 1RSB analysis in baldassi_local_2016 ), where it was also shown that numerical experiments, where available, match very closely the theoretical predictions.
In Fig. 2 we show the predictions for the local entropy as a function of the distance in one representative case, for and , for different values of , for the three cases: (eq. (10)), (derived from eq. (13)) and (derived from eq. (14)). The latter two cases give results which, where it was possible to directly compare them, are quantitatively so close that the difference cannot be appreciated at the resolution level of the plotted figure, and thus we treat the two cases as equivalent for the purposes of the description of the results. In both those cases, the solution of the equations become numerically extremely challenging around the transition point (see below for the definition) and thus we could not complete all the curves in that region. The most notable features that emerge from this figure are:
Typical solutions are isolated: becomes negative in a neighborhood of .
Up to a certain (where is between and for the specific case of Fig. 2), there exist dense regions of solutions: in this phase, there exist non-typical solutions that are surrounded by an exponential (in ) number of other solutions, and at small distances the local entropy curves tend to collapse onto the curve, which corresponds to the upper bound where each configuration is a solution.
Between and , there are regions of where either there is no solution to the equations or the solution leads to a negative local entropy; in both cases, we interpret these facts as indicating a change in the structure of the dense clusters of solutions, which either disappear or break into small disconnected and isolated components.
The significance of the phase transition atis related to the accessibility of the dense regions of solutions and the existence of efficient algorithms that are able to solve the training task. In the case of the binary, balanced and unbiased case studied in baldassi_subdominant_2015 , our best estimate was , while the best available heuristic algorithms (Belief Propagation with reinforcement braunstein-zecchina , Max-Sum with reinforcement baldassi2015max ) were measured experimentally to have a capacity of and the theoretical critical capacity is believed to be krauth-mezard . Another (simpler but faster) heuristic algorithm, called SBPI, was measured to achieve a slightly lower capacity, reaching almost baldassi-et-all-pnas . A very similar situation happens with the same model in the generalization scenario, where baldassi_subdominant_2015 is very close to the maximum value reached by the best heuristic solvers baldassi2015max , leaving a region where the heuristics fail before the theoretical transition to perfect learning at sompolinsky1990learning . For the dense binary case with , i.e. the model considered in this paper with and , SBPI was measured to achieve a capacity slightly above baldassi-et-all-pnas , to be compared to the theoretical maximum gutfreund1990capacity . For this case, the large deviation analysis gives . It was also shown by direct numerical experiments in baldassi_subdominant_2015 that all solutions found by the heuristic algorithms at sufficiently large are part of a dense region that is well described by the large deviation analysis.
All these results thus strongly suggest that signals a transition between an “easy” phase and a “hard” phase. This situation bears some clear similarities with other constraint satisfaction problems like random -satisfiability (-SAT), where in particular there can be a “frozen” phase where solutions are isolated and no efficient algorithms are known to work krzakala-csp . Contrary to the -SAT case, however, in the case of neural networks this transition does not appear in the equilibrium analysis – which would predict that the problem is intractable at all values of – but only in a large deviations study. This latter observation is presumably linked with the complex geometrical structure of the dense regions, which are not “states” in the usual sense given to the word in the context of statistical physics of complex systems, i.e. they are not clearly separated clusters of configurations, according to the argument that otherwise it should not have been necessary to perform the large deviation analysis in the first place in order to observe them. Our analysis (theoretical and numerical) is not sufficient to completely characterize this geometrical structure, apart from telling us that it must be extensive, that the density seems to vary in a continuous fashion (i.e. the local entropy landscape is rather smooth, such that it is algorithmically easy to find a path towards a solution, see Sec. V), and that there are several (but less than exponentially many) regions of highest density (due to replica symmetry breaking effects, i.e. not related to any obvious symmetry of the problem). These highest density regions are, to the leading exponential order, all equivalent, thus a local search algorithm designed to exploit their existence needs to be able to spontaneously break the symmetry among them. It would indeed be very interesting to be able to further refine this description.
iv.3 Transition point as a function of the number of states
Determining the value of , where the dense regions seem to disappear (or are at least no longer easily accessible), is extremely challenging computationally, not only because of the time-consuming task of solving the system of equations that result from the replica analysis (and which require repeated nested numerical integrations), but especially due to purely numerical issues related to the finite machine precision available and the trade-offs involved between computational time and increased precision. These issues are exacerbated near the transition point.
However, despite the fact that the RS analysis in the limit (performed in Appendix C) gives some unphysical results that need to be corrected at higher level of symmetry breaking, it still provides an estimate of , which can be computed reasonably efficiently, and which is not dramatically affected by the RSB corrections. For example, in the binary, balanced unbiased case of baldassi_subdominant_2015 , the RS analysis at gives , while the 1RSB solution gives . In the case of the multi valued model of this paper with the parameters and used for Fig. 2, the RS analysis at gives while the 1RSB analysis gives between and .
Therefore, we have used the RS analysis at (note that in this limit there is no difference between the constrained free entropy and the unconstrained free entropy , see the discussion in Appendix C) to explore the behavior of when varying the number of states and the coding level of the patterns. This is most easily achieved by studying the derivative of the local entropy as a function of the distance : Fig. 3A shows an example of the behavior of the local entropy as a function of the distance for various values of in the dense ternary case , .
As one can notice, there are three types of behavior: (i) below a certain the local entropy curves are concave; (ii) between and there appear intermediate regions where the curve becomes convex; (iii) between and a gap appears, where there are no solutions. The appearance of the gap (and thus ) is signaled by the fact that it is the lowest value of for which there exists a such that (Fig. 3B). These qualitative observations remain unchanged in the 1RSB case and across all neural network models we have studied.
The appearance of the gap at seems to be related to a breaking apart of the structure of the dense regions, which we also observed numerically at relatively small . It is not completely clear whether the branch after the break is physical or an artifact of the replica analysis, since we are unable — for the time being — to find such regions numerically at large , and thus to confirm their existence. If it is physical, then it depicts a situation in which dense regions still exist, but are broken apart into several separated clusters and are no longer as easily accessible as for .
The behavior of as a function of the number of states, for various values of the coding level , is shown in Fig. 4A. The behavior is very close to that of the critical capacity, cf. Fig. 1. We expect that these quantities converge to the same value in the limit where the device should behave as in the case of continuous synapses, which seem to be the case, see Fig. 4B.
V Proof of concept: generalizing Entropy-driven Monte Carlo
The existence of subdominant dense clusters of solutions not only serves to provide a plausible explanation for the observed behavior of existing heuristic algorithms: it can also be exploited to design new algorithms. As a proof of concept, in baldassi_local_2016 , we have presented an algorithm called “Entropy-driven Monte Carlo” (EdMC), that exploits the fact that the landscape of the local entropy can be radically different from that of the energy.
The basic idea is to run a Simulated Annealing (SA) algorithm using the local entropy as an objective function rather than the energy, as follows: at any given configuration , we consider a nearby configuration (obtained by picking uniformly at random a synaptic index and then randomly increasing or decreasing by one) and estimate the shift in local entropy (see eq. (8)), and accept or reject the move according to the Metropolis rule at an inverse temperature . After a number of accepted moves, we increase and reduce by a fixed amount, until we eventually find a solution. We call the process of gradually reducing “scoping”, in analogy with the “annealing” process of increasing .
The estimation of the local entropy is performed using Belief Propagation (BP) mackay2003information ; yedidia2005constructing ; for simplicity, instead of imposing a hard constraint on the distance , we alternatively fix the value of its Legendre conjugate parameter by introducing a collection of fixed external fields (of a defined intensity ) in the direction of , as described in detail in baldassi_local_2016 . The scoping process is thus obtained by gradually increasing .
The tests performed with this algorithm show that, while standard Simulated Annealing using the energy (the number of misclassified patterns, see eq. (3)) as the objective function gets immediately trapped by the exponentially large number of local minima, EdMC does not, and can reach a solution even in the greedy case in which it is run directly at zero temperature ().
While this algorithm is certainly slower than other efficient heuristic solvers, it is still interesting for these reasons: (i) it is generic, since it can in principle be generalized to any model where reasonable estimates of the local entropy can be achieved; (ii) it is more “under control” than the heuristic alternatives, since its behavior actually closely matches the theoretical prediction of the large deviation analysis; (iii) it proves that the local entropy landscape is very different from the energy landscape (and EdMC could obviously be used directly to explore such a landscape, if run as a simple Monte Carlo algorithm without scoping or annealing).
In any case, it is also easy to heuristically improve this algorithm dramatically, by using the BP fixed point messages to propose the moves, rather then performing them at random (but still using to decide whether to actually accept the moves or not). Also, instead of starting from a random configuration, we can use the BP marginals in the absence of any distance constraint, and clip them to determine a good starting point.
Fig. 5 shows the results of a test on one sample for , , , . Although the search space is considerably larger, the behavior of the algorithm is very similar to what was observed in baldassi_local_2016 for the binary, balanced and unbiased case: while EdMC reaches errors in a few iterations, standard SA plateaus and only eventually finds a solution, in several orders of magnitude more iterations (as is typical for these glassy systems, the time during which SA is trapped in a plateau increases exponentially with ). The heuristic enhancements further improve EdMC performance.
More specifically in this test all the variants of the EdMC were run at ; when the initial configuration was chosen at random, we started with external fields of low intensity and progressively increased it by after each greedy optimization procedure, thus avoiding inconsistencies in the messages and guaranteeing the convergence of BP even in the early stages, when the reference configuration is very far away from any solution. When starting from the clipped BP marginals the fields could be set directly at .
On the other hand in the SA we observed that the chosen was large enough to trap the standard Monte Carlo even with very slow cooling rates, so we had to resort to a different definition of the energy function
where if , otherwise; i.e. this energy function measures the negative of the sum of the so-called stabilities. The annealing scheme was carried out adopting a cooling rate of , which is multiplied to after every accepted moves, starting from an inverse temperature of .
In both the EdMC and the SA the firing threshold was set to its optimal value, which was determined analytically via replica calculations.
In this work, we extended a large deviation analysis of the solution space of a single layer neural network from the purely binary and balanced case baldassi_subdominant_2015 to the general discrete case. Despite some technical challenges in solving the equations, the results clearly indicate that the general qualitative picture is unchanged with respect to the binary case. In particular, for all values of the parameters and regardless of the number of synaptic states, we observe the existence of two distinct phases: one in which most solutions are isolated and hard to find, but there exists a dense and accessible cluster of solutions whose presence can be directly or indirectly exploited by heuristic solvers (e.g. by the EdMC algorithm); one in which this dense cluster has broken apart. The transition point between these two phases was greater than in all our tested settings, i.e. it is fairly close to the maximal theoretical capacity. Both and grow with the number of synaptic states; however, the increase becomes rapidly very slow after the first few states: if there is a cost (metabolic or hardware) associated to adding more states to the synapses, this analysis suggests that the overall benefit of doing so would rapidly vanish. In other words, synapses may only need few bits of precision, both in the sense that efficient learning is still possible (contrary to what previous analyses suggested) and in the sense that, increasing the precision, the marginal advantage in terms of capacity decreases quite rapidly.
Our main drive for performing this analysis was to make the model more biologically plausible with respect to the binary case, while still keeping it simple enough so that the theoretical analysis can be performed (albeit with great difficulty, for computational and numerical reasons). Indeed, as we already mentioned, our model neurons are very crude simplifications of biological neurons; also, using uncorrelated inputs and outputs is hardly realistic, or at the very least there certainly are settings in which we would rather consider some kind of correlations. Despite these shortcomings, we believe that this analysis, together with the previous one for the binary case, bears a rather clear general message, namely that the qualitative picture is the same regardless of the finer detail. In particular, this picture was not affected in our analysis by any of the parameters (number of synaptic states, sparsity of the patterns). Also, we had already shown for the binary case that numerical tests performed on a handwritten-digit image-recognition benchmark indicate that even when using more “natural” (highly correlated and structured) patterns the heuristic learning algorithms invariably end up in a dense region of solutions such as those described by the theoretical analysis of the uncorrelated-inputs case.
Therefore, despite the inevitable shortcomings of the model, this analysis provides a plausible picture in which to frame the study of the synaptic precision of biological neurons in relation to their computational and representational power. In a nutshell, it suggests that low precision synapses are convenient for concrete implementations because the solution space has regions that can be exploited for learning efficiently, consistently with experimental biological results. Indeed, the learning mechanism must be different from what is usually employed in machine learning applications (stochastic gradient descent), but simple effective algorithms exist thanks to the peculiar structure of the solution space, with ample room for discretion in implementation details. This is clearly exemplified by the effectiveness of the Entropy-driven Monte Carlo technique that we introduced in baldassi_local_2016
and that we extended here to the more general case. Establishing the presence of this geometrical picture in the learning of discrete deep forward networks and recurrent neural networks looks like a promising direction for future investigations.
Acknowledgements.C.B., C.L. and R.Z. acknowledge the European Research Council for grant n° 267915.
Appendix A Franz-Parisi potential
In order to describe the geometrical properties of the solution space of the generalized perceptron, it is possible to carry out a mean-field analysis based on the computation of the Franz-Parisi potential. This method is conceptually divided in two stages: first we select a reference configuration from the equilibrium Boltzmann measure at a certain inverse temperature , then we evaluate the free energy of a coupled model where the configurations , at inverse temperature , are constrained to be exactly at a distance from the reference point:
Since we are interested in the constraint satisfaction problem, in our case both temperatures are set to zero (). It is important to notice that the sampling of is not affected by the coupling, so it is extracted at random from a flat distribution over all possible solutions, and represents the typical case (i.e. numerically dominant in this measure).
The Franz-Parisi potential can thus be interpreted as a typical local entropy density:
with the definition of the indicator function of equation 4 and the averaging is performed over the flat measure on all solutions to the problem.
It is possible to introduce a robustness parameter to stabilize the learned patterns (at the order ), so that each association is considered learned only if the output of the device is correct and the modulus of the activation is above this threshold. The indicator function can be then redefined as:
where we omitted the indication of the ranges and for simplicity of notation, and we defined in order to convert between the device output and a more convenient representation . Note that with this definition the average over the output for any function is defined as:
i.e. we use the parameter to denote the output coding rate, which in principle can be distinguished from the input coding rate .
In order to perform the average over the measure of solutions and over the quenched disorder, we employ the replica trick: we introduce non interacting copies of the reference solution, and leave out the index for the replica appearing in the distance constraint. Furthermore we denote the replicas of the coupled solutions . Throughout this section, we will use the indices for the replicated and for the replicated , and we will omit the specification of the indices ranges in sums and products, for notational simplicity. In the end and will be sent to zero:
where we used the definition of eq. (9) for the distance function and introduced the measure over the possible values of the weights:
(In our experiments, we always used , but the derivation is general.) In the last line of eq. (A) we also defined the replicated volume .
As a first step we can introduce some auxiliary variables to substitute the arguments of the indicator functions:
We can now perform the average over the pattern distribution :
where indicates the average of the inputs and
is their variance.
All the overlaps (such as ) can now be replaced with order parameters via Dirac distributions. In the case of the generalized perceptron we also need to introduce two specific parameters for the -norm and for the -norm.
Maximum capacity with biased patterns can be achieved if the mean value of the synaptic weights is on the threshold given by the ratio:
and because of the unbalanced distribution of the outputs we also need to introduce an correction controlled by the order parameter :
We can define the following:
After these substitutions in the expression of the replicated volume , we use the integral representation of the Dirac distributions, introducing the required conjugate parameters, and rearrange the integrals so that it becomes possible to factorize over the and indices:
where we have singled out a first term and the so-called entropic and energetic contributions , :
Note that we dropped the indices and from all quantities since we have rearranged the terms and factorized the contributions; in particular, note that the indices were dropped from the weights , and the output .
a.1 Replica Symmetric Ansatz
To proceed with the calculations we now have to make an assumption on the structure of the replicated order parameters, the simplest possible one being the symmetric Ansatz, where one can drop all the dependencies on the replica indices. We only have to make a distinction between the overlaps and , since the first one enters also in the expression of the constraint on the distance :
for , for
, , , , , , .
The first term of the expression of the volume can now be simplified, and the can be taken, obtaining: