# On the geometry of solutions and on the capacity of multi-layer neural networks with ReLU activations

Rectified Linear Units (ReLU) have become the main model for the neural units in current deep learning systems. This choice has been originally suggested as a way to compensate for the so called vanishing gradient problem which can undercut stochastic gradient descent (SGD) learning in networks composed of multiple layers. Here we provide analytical results on the effects of ReLUs on the capacity and on the geometrical landscape of the solution space in two-layer neural networks with either binary or real-valued weights. We study the problem of storing an extensive number of random patterns and find that, quite unexpectedly, the capacity of the network remains finite as the number of neurons in the hidden layer increases, at odds with the case of threshold units in which the capacity diverges. Possibly more important, a large deviation approach allows us to find that the geometrical landscape of the solution space has a peculiar structure: while the majority of solutions are close in distance but still isolated, there exist rare regions of solutions which are much more dense than the similar ones in the case of threshold units. These solutions are robust to perturbations of the weights and can tolerate large perturbations of the inputs. The analytical results are corroborated by numerical findings.

• 21 publications
• 3 publications
• 18 publications
06/20/2018

### Learning One-hidden-layer ReLU Networks via Gradient Descent

We study the problem of learning one-hidden-layer neural networks with R...
08/03/2018

### Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

Neural networks have many successful applications, while much less theor...
04/29/2009

### Adaptive Learning with Binary Neurons

A efficient incremental learning algorithm for classification tasks, cal...
11/03/2021

### Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks

Understanding the properties of neural networks trained via stochastic g...
09/18/2015

### Subdominant Dense Clusters Allow for Simple Learning and High Computational Performance in Neural Networks with Discrete Synapses

We show that discrete synaptic weights can be efficiently used for learn...
05/20/2016

### Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes

In artificial neural networks, learning from data is a computationally d...
06/14/2021

### An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks

It is well known that modern deep neural networks are powerful enough to...

## References

• (1) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
• (2) Léon Bottou.

Large-scale machine learning with stochastic gradient descent.

In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
• (3) Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947, 2000.
• (4) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.
• (5) Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München, 91(1), 1991.
• (6) Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National Academy of Sciences, 113(48):E7655–E7662, November 2016.
• (7) Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett., 115:128101, Sep 2015.
• (8) Carlo Baldassi, Fabrizio Pittorino, and Riccardo Zecchina. Shaping the learning landscape in neural networks around wide flat minima. arXiv preprint arXiv:1905.07833, 2019. URL: https://arxiv.org/abs/1905.07833.
• (9) Elizabeth Gardner. The space of interactions in neural network models. Journal of Physics A: Mathematical and General, 21(1):257–270, jan 1988.
• (10) Elizabeth Gardner and Bernard Derrida. Optimal storage properties of neural network models. Journal of Physics A: Mathematical and General, 21(1):271–284, jan 1988.
• (11) Marc Mézard. The space of interactions in neural networks: Gardner’s computation with the cavity method. Journal of Physics A: Mathematical and General, 22(12):2181, 1989.
• (12) Hidetoshi Nishimori. Statistical Physics of Spin Glasses and Information Processing: An Introduction. International series of monographs on physics. Oxford University Press, 2001.
• (13) Eli Barkai, David Hansel, and Haim Sompolinsky.

Broken symmetries in multilayered perceptrons.

Phys. Rev. A, 45:4146–4161, Mar 1992.
• (14) A. Engel, H. M. Köhler, F. Tschepke, H. Vollmayr, and A. Zippelius. Storage capacity and learning algorithms for two-layer neural networks. Phys. Rev. A, 45:7590–7609, May 1992.
• (15) Graeme Mitchison and Richard Durbin. Bounds on the learning capacity of some multi-layer networks. Biological Cybernetics, 60(5):345–365, 1989.
• (16) Rémi Monasson and Riccardo Zecchina. Weight space structure and internal representations: a direct approach to learning and generalization in multilayer neural networks. Physical review letters, 75(12):2432, 1995.
• (17) Silvio Franz and Giorgio Parisi. Recipes for metastable states in spin glasses. Journal de Physique I, 5(11):1401–1415, 1995.
• (18) Haiping Huang and Yoshiyuki Kabashima. Origin of the computational hardness for learning with binary synapses. Physical Review E, 90(5):052813, 2014. .
• (19) Werner Krauth and Marc Mézard. Storage capacity of memory networks with binary couplings. J. Phys. France, 50:3057–3066, 1989.
• (20) Thomas B Kepler and Laurence F Abbott. Domains of attraction in neural networks. Journal de Physique, 49(10):1657–1662, 1988.
• (21) Carlo Baldassi.

A method to reduce the rejection rate in monte carlo markov chains.

Journal of Statistical Mechanics: Theory and Experiment, 2017(3):033301, 2017.

## Appendix A Model

Our model is a tree-like committee machine with weights divided into groups of entries. We use the index for the group and for the entry. We consider two cases, the binary case for all and the continuous case with spherical constraints on each group, for all .

The training set consists of random binary i.i.d. patterns. The inputs are denoted by and the outputs by , where is the pattern index.

In our analysis, we write the model using a generic activation function for the first layer (hidden) units. We will consider two cases: the sign case and the ReLU case . The output of any given unit in response to a pattern is thus written as

 τμl=g(1√N/K∑iWliξμli) (10)

The connection weights between the first layer and the output are denoted by and considered binary and fixed; for the case of ReLU activations we set the first half to the value and the rest to ; for the case of the sign activations we can set them to all to without loss of generality.

A configuration of the weights solves the training problem if it classifies correctly all the patterns; we denote this with the indicator function

 Xξ,σ(W)=∏μΘ(σμ√K∑lclτμl) (11)

where if and otherwise is the Heaviside step function.

The volume of the space of configurations that correctly classify the whole training set is then

 Z=∫dμ(W)Xξ,σ(W) (12)

where is the flat measure over the admissible values of the , depending on whether we’re analyzing the binary or the spherical case. The average of the log-volume over the distribution of the patterns, , is the free entropy of the model . We can evaluate it in the large limit with the the “replica trick”, using the formula , computing the average for integer and then taking the limit .

As explained in the main text, in all cases the resulting expression takes the form

 F=GS+αGE (13)

where the part is only affected by the spherical or binary nature of the weights, whereas the part is only affected by and by the activation function . Determining their value requires to compute a saddle-point over some overlap parameters with representing overlaps between replicas, and their conjugates ; in turn, this requires an ansatz about the structure of the saddle-point in order to perform the limit.

For the spherical weights, the part (before the limit) reads

while for the binary case we have a very similar expression, except that the summations don’t have the case and the integral over becomes a summation:

 GbinS=−12nKK∑l=1∑a≠bqabl^qabl+1nKK∑l=1log∑Wa=±1 e12∑a≠b^qablWaWb (15)

In all cases, we study the problem in the large

limit, which allows to invoke the central limit theorem and leads to a crucial simplification of the expressions.

## Appendix B Critical capacity

### b.1 Replica symmetric ansatz

In the replica-symmetric (RS) case we seek solutions of the form for all , where is the Kronecker delta symbol, and similarly for the conjugated parameters, for all . The resulting expressions, as reported in the main text, are:

 GsphS =12^Q+12q^q+12ln2π^Q+^q+^q2(^Q+^q) (17) GbinS =−^q2(1−q)+∫Dzln2cosh(z√^q) (18) GE =∫Dz0lnH(−√Δ−Δ−1Δ2−Δz0) (19)

where is a Gaussian measure, and the expressions of , and depend on the activation function :

 Δsgn−1=0;Δsgn=1−2πarccos(q);Δsgn2=1
 ΔReLU−1=12π;ΔReLU=√1−q22π+qπarctan√1+q1−q;ΔReLU2=12

The values of the overlaps and conjugated parameters are found by setting to the derivatives of the free entropy.

The critical capacity is found in the binary case by seeking numerically the value of for which the saddle point solutions returns a zero free entropy.For the spherical case, instead, is determined by finding the value of such that , which can be obtained analytically by reparametrizing and expanding around . In this limit, we must also reparametrize using , where is an exponent that depends on the activation function: it is for the sign and for the ReLU. Due to this difference in this exponent, diverges in the sign activation case (as was shown in (barkai1992broken, ; engel1992storage, )), while for the ReLU activations it converges to . However, the RS result for the spherical case is only an upper bound, and a more accurate result requires replica-symmetry breaking.

### b.2 1RSB ansatz

In the one-step replica-symmetry-breaking (-RSB) ansatz we seek solutions with 3 possible values of the overlaps and their conjugates. We group the replicas in groups of replicas each, and denote with the overlaps among different groups and with the overlaps within the same group. As before, the self overlap is and its conjugate (these are only relevant in the spherical case).

The resulting expressions are:

 GsphS =12^Q+12^q1q1+m2(q0^q0−q1^q1)+12[ln2π^Q+^q1+1mln(^Q+^q1^Q+^q1−m(^q1−^q0))+^q0^Q+^q1−m(^q1−^q0)] (20) GbinS =−^q12(1−q1)+m2(q0^q0−q1^q1)+1m∫Duln∫Dv(2cosh(√^q0u+√^q1−^q0v))m (21) GE =1m∫Dz0ln∫Dz1H(−√Δ0−Δ−1z0+√Δ1−Δ0z1√Δ2−Δ1)m (22)

the expressions of and are the same as in the RS case. The expressions of and take the same form as the RS expressions for , except that and must be used instead of .

Similarly to the RS case, in order to determine the critical capacity in the spherical case, we need to find the value of such that . In this case however we must also have . The scaling is such that is finite. The final expression reads

 αc=ln(1+~m(1−q0))+~mq01+~m(1−q0)2f(q0,~m) (23)

where

 f(q0,~m)=∫Dz0ln⎡⎢ ⎢ ⎢ ⎢⎣√δΔe−(Δ0−Δ−1)~mδΔ+(Δ2−Δ0)~mz202√δΔ+(Δ2−Δ0)~mH(−√Δ0−Δ−1Δ2−Δ0√δΔz0√δΔ+(Δ2−Δ0)~m)+H(√Δ0−Δ−1Δ2−Δ0z0)⎤⎥ ⎥ ⎥ ⎥⎦ (24)

The expression of is the same as in the RS case using instead of ; the values of and are determined by saddle point equations as usual.

## Appendix C Franz-Parisi potential

The Franz-Parisi entropy (franz1995recipes, ; huang2014origin, ) is defined as (cf. eq. (7) of the main text):

 FFP(S)=1N⟨∫dμ(~W)Xξ,σ(~W)lnNξ,σ(~W,S)∫dμ(~W)Xξ,σ(~W)⟩ξ,σ (25)

where counts the number of solutions at a distance from the reference . In this expression, represents a typical solution. The evaluation of this quantity requires the use of two sets of replicas: replicas of the reference configuration , for which we use the indices and , and replicas of the surrounding configurations , for which we use the indices and . The computation proceeds following standard steps, leading to these expressions, written in terms of the overlaps , , and their conjugate quantities:

 FFP=GS+αGE (26)
 GsphS = −12nKK∑l=1∑a,bqabl^qabl−12nKK∑l=1∑c,dpcdl^pcdl−1nKK∑l=1∑a>1,ctacl^tacl−SnKK∑l=1∑a>1,c^Scl+ +1nKln∫∏ald~Wal∏cldWcl∏le12∑a,b^qabl~Wal~Wbl+12∑c,d^pcdlWclWdl+∑a>1,c^tacl~WalWcl+~Wa=1l∑c^SclWcl GbinS = −12nKK∑l=1∑a≠bqabl^qabl−12nKK∑l=1∑c≠dpcdl^pcdl−1nKK∑l=1∑a>1,ctacl^tacl−SnKK∑l=1∑a>1,c^Scl+ +1nKln∑~Wal=±1∑Wcl=±1∏le12∑a≠b^qabl~Wal~Wbl+12∑c≠d^pcdlWclWdl+∑a>1,c^tacl~WalWcl+~Wa=1l∑c^SclWcl GE = 1nEσln∫∏ladλald^λal2π∏lcducld^ucl2πeiλal^λal+iucl^ucl∏aθ(σ√K∑lclg(λal))∏cθ(σ√K∑lclg(ucl))× ×∏le−12∑a(^λal)2−12∑c(^ucl)2−∑a1,ctacl^λal^ucl−S^λa=1l∑c^ucl

The calculation proceeds by taking the RS ansatz with the same structure as that of sec. B.1 and the limit , . We obtain, in the large limit:

 GsphS = 12^P+12^pp+^tt−^SS+12ln2π^P+^p+1^P+^p⎡⎢ ⎢ ⎢ ⎢⎣^p2+(^S−^t)2(^Q2+^q)(^Q+^q)2+^t(^S−^t)^Q+^q⎤⎥ ⎥ ⎥ ⎥⎦ (30) GbinS = −12^p(1−p)+^tt−^SS+∫Du∑~W=±1e~W√^qx∫Dvln[2cosh(√^p−^t2^qv+^t√^qu+(^S−^t)~W)]2cosh(√^qu) (31) GE = ∫Dz0∫Dz1H(−D1z1+√Σ0Γz0√Σ1Γ−D21) lnH(−√Γz1+D0z0/√Σ0√Δ3)H(−√Σ0Σ1z0) (32)

where we introduced the auxiliary quantities , , , , and which depend on the choice of the activation function (like the of the previous section). For the sign activations we get:

 Δsgn3 =2πarccosp (33) Σsgn0 =1−2πarccosq (34) Σsgn1 =2πarccosq (35) Dsgn0 =2πarctan(t√1−t2) (36) Dsgn1 =2π[arctan(S√1−S2)−arctan(t√1−t2)] (37) Γsgn =1−2πarccosp−(Dsgn0)2Σsgn0 (38)

For the ReLU activations we get:

 ΔReLU3 =12−√1−p22π−pπarctan√1+p1−p (39) ΣReLU0 =√1−q22π+qπarctan√1+q1−q−12π (40) ΣReLU1 =12−√1−q22π−qπarctan√1+q1−q (41) DReLU0 =√1−t22π+tπarctan√1+t1−t−12π (42) DReLU1 =√1−S22π+Sπarctan√1+S1−S−√1−t22π−tπarctan√1+t1−t (43) ΓReLU =√1−p22π+pπarctan√1+p1−p−12π−(DReLU0)2ΣReLU0 (44)

In order to find the order parameters for any given and , we need to set to the derivatives of the free entropy w.r.t. the order parameters , , and the conjugates , , , , , , thus obtaining a system of 9 equations (7 for the binary case) to be solved numerically. The equations actually reduce to 6 (5 in the binary case) since , and are the same ones derived from the typical case (sec. B.1).

## Appendix D Large deviation analysis

Following (baldassi2019shaping, ), the large deviation analysis for the description of the high-local-entropy landscape uses the same equations as the standard RSB expressions eqs. (20), (21) and (22). In this case, however, the overlap is not determined by a saddle point equation, but rather it is treated as an external parameter that controls the mutual overlap between the replicas of the system. Also, the parameter is not optimized and it is not restricted to the range ; instead, it plays the role of the number of replicas and it is generally taken to be large (we normally use either a large integer number to compare the results with numerical simulations, or we take the limit ). For these reason, there are two saddle point equations less compared to the standard RSB calculation.

The resulting expression for the free entropy represents, in the spherical case, the log-volume of valid configurations (solutions at the correct overlap) of the system of replicas. These configurations are thus embedded in where is the -dimensional sphere of radius . In order to quantify the solution density, we must normalize , subtracting the log-volume of all the admissible configurations at a given without the solution constraint (which is obtained by the analogous computation with ). The resulting quantity is thus upper-bounded by (cf. Fig. (2) of the main text). For the binary case, is the log of the number of admissible solutions, and the same normalization procedure can be applied.

#### Large m limit

 GbinS(q1) =−^q12(1−q1)−12(δq0^q1+q1δ^q0)+∫Dumaxv[−v22+ln2cosh(√^q1u+√δ^q0v)] (45) GsphS(q1) =12q11−q1+δq0+12ln(1−q1) (46) GE(q1) =∫Dz0maxz1[−z212+logH(−√Δ1−Δ−1z0+√δΔ0z1√Δ2−Δ1)] (47)

In the case the order parameters and need to be rescaled with and reparametrized with two new quantities and , as follows:

 q0 =q1−δq0m (48) ^q0 =^q1−δ^q0m (49)

As a consequence, we also reparametrize with a new parameter defined as:

 m(Δ1−Δ0)=δΔ0=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩2δq0π√1−q21(sign)δq0πarctan(√1+q11−q1)(ReLU) (50)

The expressions eqs. (20), (21) and (22) become:

## Appendix E Distribution of stabilities

The stability for a given pattern/label pair is defined as:

 Ξ(ξ∗,σ∗) =σ∗√KK∑l=1clg⎛⎝√KNN/K∑i=1Wliξ∗li⎞⎠ (51)

The distribution over the training set for a typical solutions can thus be computed as

 P(Ξ)=⟨∫dμ(W)Xξ,σ(W)δ(Ξ−Ξ(ξ1,σ1))∫dμ(W)Xξ,σ(W)⟩ξ,σ (52)

where we arbitrarily chose the first pattern/label pair , without loss of generality. The expression can be computed by the replica method as usual, and the order parameters are simply obtained from the solutions of the saddle point equations for the free entropy. The resulting expression at the RS level is:

 P(Ξ)=Θ(Ξ)∫DzG(Ξ−z√Δ−Δ−1√Δ2−Δ)Δ2−ΔH(−z√Δ−Δ−1Δ2−Δ)−1 (53)

where is a standard Gaussian. The difference between the models (spherical/binary and sign/ReLU) is encoded in the different values for the overlaps and in the different expressions for the parameters , , .

In the large deviation case, we simply compute the expression with a RSB ansatz and fix and as described in the previous section. The resulting expression is

 P(Ξ)=Θ(Ξ)√Δ2−Δ1∫Dz0∫Dz1G(Ξ−z0√Δ0−Δ−1+z1√Δ1−Δ0√Δ2−Δ1)H(−z0√Δ0−Δ−1+z1√Δ1−Δ0√Δ2−Δ1)m−1∫Dz1H(−z0√Δ0−Δ−1+z1√Δ1−Δ0√Δ2−Δ1)m (54)

where the effective parameters , , and are the same defined in section B.2. In the limit the previous expression reduces to

 P(Ξ)=Θ(Ξ)∫Dz0G(Ξ−z0√Δ1−Δ−1+z∗1√δΔ0√Δ2−Δ1)√Δ2−Δ1H⎛⎝−z0√Δ1−Δ−1+z∗1√δΔ0√Δ2−Δ1⎞⎠−1 (55)

where satisfies

 z∗1=argmaxz1[−z212+ln(−z0√Δ1−Δ−1+z1√δΔ0√Δ2−Δ1)] (56)

where is defined in equation (50).