# Discrete perceptrons

Perceptrons have been known for a long time as a promising tool within the neural networks theory. The analytical treatment for a special class of perceptrons started in seminal work of Gardner Gar88. Techniques initially employed to characterize perceptrons relied on a statistical mechanics approach. Many of such predictions obtained in Gar88 (and in a follow-up GarDer88) were later on established rigorously as mathematical facts (see, e.g. SchTir02,SchTir03,TalBook,StojnicGardGen13,StojnicGardSphNeg13,StojnicGardSphErr13). These typically related to spherical perceptrons. A lot of work has been done related to various other types of perceptrons. Among the most challenging ones are what we will refer to as the discrete perceptrons. An introductory statistical mechanics treatment of such perceptrons was given in GutSte90. Relying on results of Gar88, GutSte90 characterized many of the features of several types of discrete perceptrons. We in this paper, consider a similar subclass of discrete perceptrons and provide a mathematically rigorous set of results related to their performance. As it will turn out, many of the statistical mechanics predictions obtained for discrete predictions will in fact appear as mathematically provable bounds. This will in a way emulate a similar type of behavior we observed in StojnicGardGen13,StojnicGardSphNeg13,StojnicGardSphErr13 when studying spherical perceptrons.

06/17/2013

### Spherical perceptron as a storage memory with limited errors

It has been known for a long time that the classical spherical perceptro...
07/13/2021

### A Theory of Spherical Diagrams

We introduce the axiomatic theory of Spherical Occlusion Diagrams as a t...
03/25/2020

### Exact discrete Lagrangian mechanics for nonholonomic mechanics

We construct the exponential map associated to a nonholonomic system tha...
05/18/2018

### Spherical harmonics entropy for optimal 3D modeling

3D image processing constitutes nowadays a challenging topic in many sci...
11/19/2021

### Esophageal virtual disease landscape using mechanics-informed machine learning

The pathogenesis of esophageal disorders is related to the esophageal wa...
07/17/2019

### Spherical data handling and analysis with R package rcosmo

The R package rcosmo was developed for handling and analysing Hierarchic...
02/27/2016

### Single-Image Superresolution Through Directional Representations

We develop a mathematically-motivated algorithm for image superresolutio...

## 1 Introduction

In last several decades there has been a lot of great work related to an analytical characterization of neural networks performance. While the neural networks have been known for quite some time it is probably with the appearance of powerful statistical mechanics techniques that incredibly results related to characterization of their performance started appearing. Of course, since the classical perceptrons are among the simplest and most fundamental tools within the frame of neural networks theory, it is a no surprise that among the very first analytical characterizations were the ones related to them. Probably the most successful one and we would say the most widely known one is the seminal approach of Gardner, developed in

[11] and complemented in a follow-up [12]. There, Gardner adapted by that time already well-known replica approach so that it can treat almost any feature of various perceptron models. She started the story of course with probably the simplest possible case, namely the spherical perceptron. Then in [11] she and in [12] she and Derrida proceeded with fairly accurate predictions/approximations for its storage capacities in several different scenarios: positive thresholds (we will often referred to such perceptrons as the positive spherical perceptrons), negative thresholds, correlated/uncorrelated patterns, patterns stored incorrectly and many others.

While these predictions were believed to be either exact in some cases or fairly good approximations in others, they remained quite a mathematical challenge for a long time. Somewhat paradoxically, one may though say that the first successful confirmation of some of the results from [11, 12] had actually arrived a long time before they appeared. Namely, for a special case of spherical perceptrons with zero-thresholds, the storage capacity was already known either explicitly within the neural networks community or within pure mathematics (see, e.g. [21, 10, 20, 37, 36, 8, 16, 6, 35]). However, the first real confirmation of the complete treatment presented in [11] appeared in [22, 23]. There the authors were able to confirm the predictions made in [11] related to the storage capacity of the positive spherical perceptrons. Moreover, they confirmed that the prediction related to the volume of the bond strengths that satisfies the perceptron dynamics presented in [11] is also correct. Later on, in [34] Talagrand reconfirmed these predictions through a somewhat different approach. In our own work [24] we also presented a simple framework that can be used to confirm many of the storage capacity predictions made in [11]. Moreover, in [29] we confirmed that the results presented in [11] related to the negative spherical perceptrons are rigorous upper bounds that in certain range of problem parameters may even be lowered. Along the same lines we then in [31] attacked a bit harder spherical perceptron type of problem that relates to their functioning as erroneous storage memories. This problem was initially treated in [12] through an extension of the replica approach utilized in [11]. The predictions obtained based on such an approach were again proved as rigorous upper bounds in [31]. Moreover, [31] hinted that while the predictions made in [12] are rigorous upper bounds one may even be able to lower them in certain range of parameters of interest.

Of course, as one may note all the above mentioned initial treatments relate to the so-called spherical perceptrons. These are long believed to be substantially easier for an analytical treatment than some other classes of perceptrons. On the other hand, we believe that among the most difficult for an analytical treatment are the ones that we will call discrete perceptrons. While we will below give a detailed description of what we will mean by discrete perceptrons, we would like to just mention here that an introductory treatment of such perceptrons was already started in [11, 12]. There it was demonstrated that framework designed to cover the spherical perceptron can in fact be used to obtain predictions for many other perceptrons as well and among them certainly for what we will call discrete perceptrons. However, as already observed in [11, 12] it may happen that the treatment of such perceptrons may be substantially more difficult than the spherical ones. To be a bit more specific, an initial set of results obtained for the storage capacity in the simple zero-thresholds case indicated that the variant of the framework given in [11] may not be able to match even the simple combinatorial results one can obtain in such a case. As a result it was hinted that a more advanced version of the framework from [11] may be needed. In [15] the authors went a bit further and considered various other types of discrete perceptrons. For many of them they were able to provide a similar set of predictions given in [11] for the simple spherical and ones. Moreover, they hinted at a potential way that can be used to bridge some of deficiencies that the predictions given in [11] may have. In this paper we will also study several discrete perceptrons. On top of that we will cover a “not so discrete case” which in a sense is a limiting case of some of the discrete cases studied in [15] and itself was also studied in [15]. The framework that we will present will rigorously confirm that the results related to these classes of perceptrons obtained in [15] relying on the replica symmetry approach of [11] are in fact rigorous upper bounds on their true values. For the above mentioned “not so discrete case” it will turn out that the predictions made in [15] can in fact be proven as exact.

Before going into the details of our approach we will recall on the basic definitions related to the perceptrons and needed for its analysis. Also, to make the presentation easier to follow we find it useful to briefly sketch how the rest of the paper is organized. In Section 2 we will, as mentioned above, introduce a more formal mathematical description of how a perceptron operates. Along the same lines we will formally present the sevearl classes/types of perceptrons that we will study in later sections. In Section 3 we will present several results that are known for the classical spherical perceptron as some of them we will actually need to establish the main results of this paper as well. In Sections 4, 5, and 6 we will discuss the three types of perceptrons that we plan to study in great detail in this paper. Finally, in Section 7 we will discuss obtained results and present several concluding remarks.

## 2 Perceptrons as mathematical problems

To make this part of the presentation easier to follow we will try to introduce all important features of the perceptron that we will need here by closely following what was done in [11] (and for that matter in our recent work [24, 29, 31]). So, as in [11], we start with the following dynamics:

 H(t+1)ik=sign(n∑j=1,j≠kH(t)ijXjk−Tik). (1)

Following [11] for any fixed we will call each , the icing spin, i.e. . Continuing further with following [11], we will call , the interaction strength for the bond from site to site . To be in a complete agreement with [11], we in (1) also introduced quantities . is typically called the threshold for site in pattern . However, to make the presentation easier to follow, we will typically assume that . Without going into further details we will mention though that all the results that we will present below can be easily modified so that they include scenarios where .

Now, the dynamics presented in (1) works by moving from a to and so on (of course one assumes an initial configuration for say ). Moreover, the above dynamics will have a fixed point if say there are strengths , such that for any

 Hiksign(n∑j=1,j≠kHijXjk−Tik)=1 (2) ⇔ Hik(n∑j=1,j≠kHijXjk−Tik)>0,1≤j≤n,1≤k≤n.

Of course, the above is a well known property of a very general class of dynamics. In other words, unless one specifies the interaction strengths the generality of the problem essentially makes it easy. After considering the general scenario introduced above, [11] then proceeded and specialized it to a particular case which amounts to including spherical restrictions on . A more mathematical description of such restrictions considered in [11] essentially boils down to the following constraints

 n∑j=1X2ji=1,1≤i≤n. (3)

These were of course the same restrictions/constraints considered in a series of our own work [24, 31, 29]. In this paper however, we will focus on a set of what we will call discrete restirctions/conctraints. While the methods that we will present below will be fairly powerful to handle many different discrete restrictions we will to avoid an overloading and for clarity purposes here present the following two types of discrete constraints.

 Xji ∈ {−1√n,1√n},1≤i≤n,1≤j≤m Xji ∈ {0,1√n},1≤i≤n,1≤j≤m. (4)

We will call the perceptron operating with the first set of constraints given in (4) the perceptron (in fact we may often refer to the bond strengths in such a perceptron as the ones from set although for the scaling purposes we assumed the above more convenient set). Analogously, we will call the perceptron operating with the second set of constraints given in (4) the perceptron. Moreover, we will also consider a third type of the perceptron that operates with the following constraints on the bond strengths

 Xji∈[−1√n,1√n],1≤i≤n,1≤j≤m. (5)

We will refer to the perceptron operating with the set of constraints given in (5) the box-constrained perceptron.

The fundamental question that one typically considers then is the so-called storage capacity of the above dynamics or alternatively a neural network that it would represent (of course this is exactly one of the questions considered in [11]). Namely, one then asks how many patterns (-th pattern being ) one can store so that there is an assurance that they are stored in a stable way. Moreover, since having patterns being fixed points of the above introduced dynamics is not enough to insure having a finite basin of attraction one often may impose a bit stronger threshold condition

 Hiksign(n∑j=1,j≠kHijXjk−Tik)=1 (6) ⇔ Hik(n∑j=1,j≠kHijXjk−Tik)>κ,1≤j≤n,1≤k≤n,

where typically is a positive number. We will refer to a perceptron governed by the above dynamics and coupled with the spherical restrictions and a positive threshold as the positive spherical perceptron (alternatively, when is negative we would refer to it as the negative spherical perceptron; for such a perceptron and resulting mathematical problems/results see e.g. [29]).

Also, we should mentioned that beyond the above mentioned cases many other variants of the neural network models are possible from a purely mathematical perspective. Moreover, many of them have found applications in various other fields as well. For example, a nice set of references that contains a collection of results related to various aspects of different neural networks models and their bio- and many other applications is [2, 1, 4, 5, 3, 25, 7]. We should also mention that while we chose here a particular set of neural network models, the results that we will present below can be adapted to be of use in pretty much any other known model. Our goal here is to try to keep the presentation somewhat self-contained, clear, and without too much of overloading. Because of that we selected only a small number of cases for which we will present the concrete results. A treatment of many others we will present elsewhere.

## 3 Known results

As mentioned above, our main interest in this paper will be studying what we call discrete perceptrons. However, many of the results that we will present will lean either conceptually or even purely analytically on many results that we created for the so-called spherical perceptrons. In fact, quite a few technical details that we will need here we already needed when treating various aspects of the spherical perceptrons, see e.g. [29, 24, 31]. In that sense we will find it useful to have quite handy some of the well-known spherical perceptron results readily available. So, before proceeding with the problems that we will study here in great detail we will first recall on several results known for the standard spherical perceptron.

In the first of the subsections below we will hence look at the spherical perceptrons, and in the following one we will then present a few results known for the discrete perceptrons. That way it will also be easier to later on properly position the results we intend to present here within the scope of what is already known.

### 3.1 Spherical perceptron

We should preface this brief presentation of the known results by mentioning that a way more is known that what we will present below. However, we will restrict ourselves to the facts that we deem are of most use for the presentation that will follow in later sections.

#### 3.1.1 Statistical mechanics

We of course start with recalling on what was presented in [11]. In [11] a replica type of approach was designed and based on it a characterization of the storage capacity was presented. Before showing what exactly such a characterization looks like we will first formally define it. Namely, throughout the paper we will assume the so-called linear regime, i.e. we will consider the so-called linear scenario where the length and the number of different patterns, and , respectively are large but proportional to each other. Moreover, we will denote the proportionality ratio by (where obviously is a constant independent of ) and will set

 m=αn. (7)

Now, assuming that

, are i.i.d. symmetric Bernoulli random variables,

[11]

, using the replica approach, gave the following estimate for

so that (6) holds with overwhelming probability (under overwhelming probability we will in this paper assume a probability that is no more than a number exponentially decaying in away from )

 αc(κ)=(1√2π∫∞−κ(z+κ)2e−z22dz)−1. (8)

Based on the above characterization one then has that achieves its maximum over positive ’s as . One in fact easily then has

 limκ→0αc(κ)=2. (9)

Also, to be completely exact, in [11], it was predicted that the storage capacity relation from (8) holds for the range .

#### 3.1.2 Rigorous results – positive spherical perceptron (κ≥0)

The result given in (9

) is of course well known and has been rigorously established either as a pure mathematical fact or even in the context of neural networks and pattern recognition

[21, 10, 20, 37, 36, 8, 16, 6, 35]. In a more recent work [22, 23, 34] the authors also considered the storage capacity of the spherical perceptron and established that when (8) also holds. In our own work [24] we revisited the storage capacity problems and presented an alternative mathematical approach that was also powerful enough to reestablish the storage capacity prediction given in (8). We below formalize the results obtained in [22, 23, 34, 24].

###### Theorem 1.

[22, 23, 34, 24] Let be an matrix with i.i.d.Bernoulli components. Let be large and let , where is a constant independent of . Let be as in (8) and let be a scalar constant independent of . If then with overwhelming probability there will be no such that and (6) is feasible. On the other hand, if then with overwhelming probability there will be an such that and (6) is feasible.

###### Proof.

Presented in various forms in [22, 23, 34, 24]. ∎

As mentioned earlier, the results given in the above theorem essentially settle the storage capacity of the positive spherical perceptron or the Gardner problem in a statistical sense (it is rather clear but we do mention that the overwhelming probability statement in the above theorem is taken with respect to the randomness of ). However, they strictly speaking relate only to the positive spherical perceptron. It is not clear if they would automatically translate to the case of the negative spherical perceptron. As we hinted earlier, the case of the negative spherical perceptron () may be more of interest from a purely mathematical point of view than it is from say the neural networks point of view. Nevertheless, such a mathematical problem may turn out to be a bit harder than the one corresponding to the standard positive case. In fact, in [34], Talagrand conjectured (conjecture 8.4.4) that the above mentioned remains an upper bound on the storage capacity even when , i.e. even in the case of the negative spherical perceptron. Such a conjecture was confirmed in our own work [24]. In the following subsection we will briefly summarize what in fact was shown in [24].

#### 3.1.3 Rigorous results – negative spherical perceptron (κ<0)

In our recent work [29] we went a step further and considered the negative version of the standard spherical perceptron. While the results that we will present later on in Sections 4, 5, and 6 will relate to any our main concern will be from a neural network point of view and consequently the emphasis will be on the positive case, i.e. to scenario. Still, in our own view the results related to the negative spherical perceptron are important as they hint that already in the spherical case things may not be as easy as they may seem to be based on the results of [11, 22, 23, 34, 24] for the positive spherical perceptron.

Moreover, a few technical details needed for presenting results in later section were already observed in [24, 29] and we find it convenient to recall on them while at the same time revisiting the negative spherical perceptron. This will in our view substantially facilitate the exposition that will follow.

We first recall that in [29] we studied the so-called uncorrelated case of the spherical perceptron (more on an equally important correlated case can be found in e.g. [24, 11]). This is the same scenario that we will study here (so the simplifications that we made in [29] and that we are about to present below will be in place later on as well). In the uncorrelated case, one views all patterns , as uncorrelated (as expected,

stands for vector

). Now, the following becomes the corresponding version of the question of interest mentioned above: assuming that is an matrix with i.i.d. Bernoulli entries and that , how large can be so that the following system of linear inequalities is satisfied with overwhelming probability

 Hx≥κ. (10)

This of course is the same as if one asks how large can be so that the following optimization problem is feasible with overwhelming probability

 Hx≥κ ∥x∥2=1. (11)

To see that (10) and (11) indeed match the above described fixed point condition it is enough to observe that due to statistical symmetry one can assume . Also the constraints essentially decouple over the columns of (so one can then think of in (10) and (11) as one of the columns of ). Moreover, the dimension of in (10) and (11) should be changed to ; however, since we will consider a large scenario to make writing easier we keep the dimension as . Also, as mentioned to a great extent in [24, 31, 31], we will, without a loss of generality, treat in (11) as if it has i.i.d. standard normal components. Moreover, in [24] we also recognized that (11) can be rewritten as the following optimization problem

 ξn=minxmaxλ≥0 κλT1−λTHx subject to ∥λ∥2=1 (12) ∥x∥2=1,

where is an -dimensional column vector of all ’s. Clearly, if then (11) is feasible. On the other hand, if then (11) is not feasible. That basically means that if we can probabilistically characterize the sign of then we could have a way of determining such that . That is exactly what we have done in [24] on an ultimate level for and on a say upper-bounding level for . Relying on the strategy developed in [30, 28] and on a set of results from [13, 14] we in [24] proved the following theorem that essentially extends Theorem 1 to the case and thereby resolves Conjecture 8.4.4 from [34] in positive:

###### Theorem 2.

[24] Let be an matrix with i.i.d. standard normal components. Let be large and let , where is a constant independent of . Let be as in (12) and let be a scalar constant independent of . Let all ’s be arbitrarily small constants independent of . Further, let be a standard normal random variable and set

 fgar(κ)=1√2π∫∞−κ(gi+κ)2e−g2i2dgi=κe−κ22√2π+(κ2+1)erfc(−κ√2)2. (13)

Let and be scalars such that

 (1−ϵ(m)1)√αfgar(κ)−(1+ϵ(n)1)−ϵ(g)5 > ξ(l)n√n (1+ϵ(m)1)√αfgar(κ)−(1−ϵ(n)1)+ϵ(g)5 < ξ(u)n√n. (14)

If then

 limn→∞P(ξ(l)n≤ξn≤ξ(u)n)=limn→∞P(min∥x∥2=1max∥λ∥2=1,λi≥0(ξ(l)n≤κλT1−λTHx)≤ξ(u)n)≥1. (15)

Moreover, if then

 limn→∞P(ξn≥ξ(l)n)=limn→∞P(min∥x∥2=1max∥λ∥2=1,λi≥0(κλT1−λTHx)≥ξ(u)n)≥1. (16)
###### Proof.

Presented in [24]. ∎

In a more informal language (essentially ignoring all technicalities and ’s) one has that as long as

 α>1fgar(κ), (17)

the problem in (11) will be infeasible with overwhelming probability. On the other hand, one has that when as long as

 α<1fgar(κ), (18)

the problem in (11) will be feasible with overwhelming probability. This of course settles the case completely and essentially establishes the storage capacity as which of course matches the prediction given in the introductory analysis presented in [11] and of course rigorously confirmed by the results of [22, 23, 34]. On the other hand, when it only shows that the storage capacity with overwhelming probability is not higher than the quantity given in [11]. As mentioned above this confirms Talagrand’s conjecture 8.4.4 from [34]. However, it does not settle problem (question) 8.4.2 from [34].

The results obtained based on the above theorem as well as those obtained based on Theorem 1 are presented in Figure 1. When (i.e. when ) the curve indicates the exact breaking point between the “overwhelming” feasibility and infeasibility of (11). On the other hand, when (i.e. when ) the curve is only an upper bound on the storage capacity, i.e. for any value of the pair that is above the curve given in Figure 1, (11) is infeasible with overwhelming probability.

Since the case did not appear as settled based on the above presented results we then in [29] attempted to lower the upper bounds given in Theorem 2. We created a fairly powerful mechanism that produced the following theorem as a way of characterizing the storage capacity of the negative spherical perceptron.

###### Theorem 3.

Let be an matrix with i.i.d. standard normal components. Let be large and let , where is a constant independent of . Let be a scalar constant independent of . Set

 ˆγ(s)=2c(s)3+√4(c(s)3)2+168, (19)

and

 Isph(c(s)3)=ˆγ(s)−12c(s)3log(1−c(s)32ˆγ(s)). (20)

Set

 p=1+c(s)32γ(s)per,q=c(s)3κ2γ(s)per,r=c(s)3κ24γ(s)per,s=−κ√p+q√p,C=exp(q22p−r)√p, (21)

and

 I(1)per(c(s)3,γ(s)per,κ)=12erfc(κ√2)+C2(erfc(s√2)). (22)

Further, set

 Iper(c(s)3,α,κ)=maxγ(s)per≥0(γ(s)per+1c(s)3log(I(1)per(c(s)3,γ(s)per,κ))). (23)

If is such that

 minc(s)3≥0(−c(s)32+Isph(c(s)3)+Iper(c(s)3,α,κ))<0, (24)

then (11) is infeasible with overwhelming probability.

###### Proof.

Presented in [29]. ∎

The results one can obtain for the storage capacity based on the above theorem are presented in Figure 2 (as mentioned in [29], due to numerical optimizations involved the results presented in Figure 2 should be taken only as an illustration; also as discussed in [29] taking in Theorem 3 produces the results of Theorem 2). Even as such, they indicate that a visible improvement in the values of the storage capacity may be possible, though in a range of values of substantially larger than (i.e. in a range of ’s somewhat smaller than zero). While at this point this observation may look as unrelated to the problem that we will consider in the following section one should keep it in mind (essentially, a conceptually similar conclusion will be made later on when we study the capacities with limited errors).

### 3.2 Discrete perceptrons

Below we present the results/predictions known for the discrete perceptrons. We will mostly focus on the perceptron as that one has been studied the most extensively throughout the literature. The known results related to the other two cases that we will study here, namely and box-constrained perceptrons, we find it easier to discuss in parallel as we present our own (these results are a bit involved and we believe that it would be easier to discuss them once we have a few other technical details setup).

Before presenting the concrete known results in this direction we will recall on the problem given in (10) and (11) and how it changes as one moves from the spherical to constraint. Following what was done in Section 3.1.3 one can ask how large can be so that the following optimization problem is feasible with overwhelming probability

 Hx≥κ x2i=1,1≤i≤n. (25)

We do, of course, recall that the dimension of is again and that where is a constant independent of .

As was the case in the previous subsection, we should again preface this brief presentation of the known results by mentioning that a way more is known than what we will present below. However, we will restrict ourselves to the facts that we deem are of most use for the presentation that will follow in later sections.

#### 3.2.1 Statistical mechanics

As far as a statistical mechanics approach to perceptron goes their analytical characterization to a degree have already been started with [11]. Although the main (or the more successful) concern of [12] was the spherical perceptron, it was also observed that perceptron can be handled through the replica mechanisms introduced therein. In a nutshell, what was shown in [12] (and later also observed in [15]) related to perceptron was the following: assuming that , are i.i.d. symmetric Bernoulli random variables then if is such that

 αc(κ)=2π(1√2π∫∞−κ(z+κ)2e−z22dz)−1, (26)

then (6) holds with overwhelming probability with the restriction on being . Stated in other words (possibly in a more convenient way), if is such that (26) holds then (25) is feasible with overwhelming probability.

Based on the above characterization one then has that achieves its maximum over positive ’s as . One in fact easily then has

 limκ→0αc(κ)=4π. (27)

Of course, it was immediately pointed out already in [12] that the above prediction is essentially not sustainable. In fact not only was it pointed out because of potential instability of the replica approach used in [11], it was actually rigorously argued through simple combinatorial arguments that . Many other problems remained open. For example, while it was obvious already based on the considerations presented in [12] that the storage capacity prediction of for the case is an upper bound, it was not clear if one can make such a safe prediction for the entire range of the parameter .

Of course the above considerations then left the replica treatment presented in [11] a bit powerless when it comes to the scenario (at the very least in a special case of the so-called zero-thresholds, i.e. when ). However, many other great works in this direction followed attempting to resolve the problem. A couple of them relied on the statistical mechanics approach mentioned above as well. Of course, as one may expect (and as already had been hinted in [12]), the first next natural extension of the approach presented in [11, 12] would have been to start breaking replica symmetry. A study in this direction was presented in [18]. However, as such studies typically may run into substantial numerical problems, the authors in [18] resorted to a clever way of predicting the critical value for the storage capacity by taking the value where the entropy becomes zero. For , that gave an estimate of , substantially lower than , what the above mentioned simple combinatorial bound gives. Similar argument was repeated in [15] for perceptron and extended to perceptron and a few other discrete perceptrons studied therein.

#### 3.2.2 Rigorous results – ±1 perceptron

As far as the rigorous results go we should mention that not much seems to be known. While that does not necessarily mean that the problem is hard, it may imply that it is not super easy either. Among the very first rigorous results are probably those from [17]. Roughly speaking, in [17], the authors showed that if then (25) is feasible with overwhelming probability. While these bounds can be improved, improving them to reach anywhere close to prediction of [18, 15] does not seem super easy.

We should also mention a seemingly unrelated line of work of Talagrand. Namely, Talagrand studied a variant of the above problem through a more general partition function type of approach, see e.g. [34]. While he was able to show that replica symmetry type of approach would produce rigorous results for such a consideration, he was able to do so in the so-called high-temperature regime. However the problem that he considers boils down to the one of interest here exactly in the opposite, low-temperature regime.

#### 3.2.3 Simple combinatorial bound – ±1 perceptron

Since we have mentioned it in on a couple of occasions in the above discussion we find it useful to also present the simple approach one can use to upper bound the storage capacity of many discrete perceptrons (and certainly of the that we consider here). While these bounds may not have been explicitly presented in [12] the approach that we present below follows the same strategy and we frame it as known result. Namely, one starts by looking at how likely is that each of the inequalities in (25) is satisfied. A simple consideration then gives

 P(Hi,:x≥κ|x)=P(g≥κ)=12erfc(κ√2),1≤i≤m. (28)

After accounting for all the inequalities in (25) (essentially all the rows of ) one then further has

 P(Hx≥κ|x)=(P(Hi,:x≥κ|x))m. (29)

Using the union bound over all then gives

 P(∃x|Hx≥κ)≤2nP(Hx≥κ|x)=2n(P(Hi,:x≥κ|x))m. (30)

A combination of (28) and (30) then gives

 P(∃x|Hx≥κ)≤2n(12erfc(κ√2))m. (31)

From (31) one then has that if is such that

 α>−log(2)log(12erfc(κ√2)), (32)

then

 limn→∞P(∃x|Hx≥κ)≤limn→∞2n(12erfc(κ√2))m=0. (33)

The upper bounds one can obtain on the storage capacity based on the above consideration (in particular based on (32)) are presented in Figure 3. Of course, these bounds can be improved (as mentioned earlier, one of possible such improvements was already presented in [17]). However, here our goal is more to recall on the results that relate to the ones that we will present in this paper rather than on the best possible ones.

## 4 ±1 perceptrons

In this section we will present a collection of mathematically rigorous results related to perceptrons. We will rely on many simplifications of the original perceptron setup from Section 2 introduced in [24, 29, 31] and presented in Section 4. To that end we start by recalling that for all practical purposes needed here (and those we needed in [24, 29, 31]) the storage capacity of perceptron can be considered through the feasibility problem given in (25) which we restate below

 Hx≥κ x2i=1n,1≤i≤n. (34)

We recall as well, that as argued in [24, 29, 31] (and as mentioned in the previous section) one can assume that the elements of are i.i.d. standard normals and that the dimension of is , where as earlier we keep the linear regime, i.e. continue to assume that where is a constant independent of . Now, if all inequalities in (34) are satisfied one can have that the dynamics established will be stable and all patterns could be successfully stored. Following the strategy presented in [24, 29, 31] (and briefly recalled on in Section 3.1.3 one can then reformulate (34) so that the feasibility problem of interest becomes

 ξ±1=minxmaxλ≥0 κλT1−λTHx subject to ∥λ∥2=1 (35) x2i=1n,1≤i≤n.

Clearly, following the logic we presented in Section 3.1.3, the sign of determines the feasibility of (34). In particular, if then (34) is infeasible. Given the random structure of the problem (we recall that is random) one can then pose the following probabilistic feasibility question: how small can be so that in (35) is positive and (34) is infeasible with overwhelming probability? In what follows we will attempt to provide an answer to such a question.

### 4.1 Probabilistic analysis

In this section we will present a probabilistic analysis of the above optimization problem given in (35). In a nutshell, we will provide a relation between and so that with overwhelming probability over . This will, of course, based on the above discussion then be enough to conclude that the problem in (34) is infeasible with overwhelming probability when and satisfy such a relation.

The analysis that we will present below will to a degree rely on a strategy we developed in [30, 28] and utilized in [24] when studying the storage capacity of the standard spherical perceptrons. We start by recalling on a set of probabilistic results from [13, 14] that were used as an integral part of the strategy developed in [30, 28, 24].

###### Theorem 4.

([14, 13]) Let and , , be two centered Gaussian processes which satisfy the following inequalities for all choices of indices

1. .

Then

 P(⋂i⋃j(Xij≥λij))≤P(⋂i⋃j(Yij≥λij)).

The following, more simpler, version of the above theorem relates to the expected values.

###### Theorem 5.

([13, 14]) Let and , , be two centered Gaussian processes which satisfy the following inequalities for all choices of indices

1. .

Then

 E(minimaxj(Xij))≤E(minimaxj(Yij)).

Now, since all random quantities of interest below will concentrate around its mean values it will be enough to study only their averages. However, since it will not make writing of what we intend to present in the remaining parts of this section substantially more complicated we will present a complete probabilistic treatment and will leave the studying of the expected values for the presentation that we will give in the following subsection where such a consideration will substantially simplify the exposition.

We will make use of Theorem 4 through the following lemma (the lemma is an easy consequence of Theorem 4 and in fact is fairly similar to Lemma 3.1 in [14], see also [26, 24] for similar considerations).

###### Lemma 1.

Let be an matrix with i.i.d. standard normal components. Let and be and vectors, respectively, with i.i.d. standard normal components. Also, let be a standard normal random variable and let be a function of . Then

 P(minx2i=1nmax∥λ∥2=1,λi≥0(−λTHx+g−ζλ)≥0)≥P(minx2i=1nmax∥λ∥2=1,λi≥0(gTλ+hTx−ζλ)≥0). (36)
###### Proof.

The proof is basically similar to the proof of Lemma 3.1 in [14] as well as to the proof of Lemma 7 in [26]. The only difference is in allowed sets of values for and . Such a difference introduces no structural changes in the proof though. ∎

Let with being an arbitrarily small constant independent of . We will first look at the right-hand side of the inequality in (36). The following is then the probability of interest

 P(minx2i=1max∥λ∥2=1,λi≥0(gTλ+hTx+κλT1−ϵ(g)5√n)≥ξ(l)±1). (37)

After solving the minimization over one obtains

 P(minx2i=1max∥λ∥2=1,λi≥0(gTλ+hTx+κλT1−ϵ(g)5√n)≥ξ(l)±1)=P(∥(g+κ1)+∥2−n∑i=1|hi|−ϵ(g)5√n≥ξ(l)±1), (38)

where is vector with negative components replaced by zeros. Following line by line what was done in [24] after equation one then has

 P(minx2i=1max∥λ∥2=1,λi≥0(gTλ+hTx+κλT1−ϵ(g)5√n)≥ξ(l)±1)≥(1−e−ϵ(m)2m)(1−e−ϵ(n)2n)P((1−ϵ(m)1)√mfgar(κ)−(1+ϵ(n)1)√n−ϵ(g)5√n≥ξ(l)±1). (39)

where

 fgar(κ)=1√2π∫∞−κ(gi+κ)2e−g2i2dgi, (40)

, , and are arbitrarily small positive constants and and are constants possibly dependent on , , and , respectively but independent of . If

 (1−ϵ(m)1)√mfgar(κ)−(1+ϵ(n)1)√2π√n−ϵ(g)5√n>ξ(l)±1 (41) ⇔ (1−ϵ(m)1)√αfgar(κ)−(1+ϵ(n)1)√2π−ϵ(g)5>ξ(l)±1√n,

one then has from (39)

 limn→∞P(minx2i=1max∥λ∥2=1,λi≥0(gTλ+hTx+κλT1−ϵ(g)5√n)≥ξ(l)±1)≥1. (42)

We will also need the following simple estimate related to the left hand side of the inequality in (36). From (36) one has the following as the probability of interest

 P(minx2i=1max∥λ∥2=1,λi≥0(κλT1−λTHx+g−ϵ(g)5√n−ξ(l)±1)≥0). (43)

Following again what was done in [24] between equations and one has, assuming that (41) holds,

 limn→∞P(minx2i=1max∥λ∥2=1,λi≥0(κλT1−λTHx)≥ξ(l)±1)≥limn→∞P(minx2i=1max∥λ∥2=1,λi≥0(gTy+hTx+κλT1−ϵ(g)5√n)≥ξ(l)±1)≥1. (44)

We summarize the above results in the following theorem.

###### Theorem 6.

Let be an matrix with i.i.d. standard normal components. Let be large and let , where is a constant independent of . Let be as in (35) and let be a scalar constant independent of . Let all ’s be arbitrarily small constants independent of . Further, let be a standard normal random variable and set

 fgar(κ)=1√2π∫∞−κ(gi+κ)2e−g2i2dgi=κe−κ22√2π+(κ2+1)erfc(−κ√2)2. (45)

Let be a scalar such that

 (1−ϵ(m)1)√αfgar(κ)−(1+ϵ(n)1)√2π−ϵ(g)5>ξ(l)±1√n. (46)

Then

 limn→∞P(ξ±1≥ξ(l)±1)=limn→∞P(minx2i=1max∥λ∥2=1,λi≥0(κλT1−λTHx)≥ξ(l)±1)≥1. (47)
###### Proof.

Follows from the above discussion and the analysis presented in [24]. ∎

In a more informal language (as earlier, essentially ignoring all technicalities and ’s) one has that as long as

 α>2π1fgar(κ), (48)

the problem in (34) will be infeasible with overwhelming probability. It is an easy exercise to show that the right hand side of (48) matches the right-hand side of (26). This is then enough to conclude that the prediction for the storage capacity given in [12] for perceptron is in fact a rigorous upper bound on its true value.

The results obtained based on the above theorem as well as those predicted based on the replica theory and given in (26) (and of course in [12]) are presented in Figure 4. For the values of that are to the right of the given curve the memory will not operate correctly with overwhelming probability. This of course follows from the fact that with overwhelming probability over the inequalities in (34) will not be simultaneously satisfiable.

### 4.2 Lowering the storage capacity upper bound

The results we presented in the previous section provide a rigorous upper bound on the storage capacity of perceptron. As we have mentioned in Section 3 it had been known already from the initial considerations in [12] that the upper bounds we presented in the previous sections for certain values of are strict (and in fact quite far away from the optimal values). In this section we will follow the strategy we employed in [29, 31] for studying scenarios where the standard upper bounds are potentially non-exact. Such a strategy essentially attempts to lower the upper bounds provided in the previous subsection. It does so by attempting to lift the lower bounds on . After doing so we will come to a point to reveal an interesting phenomenon happening in the analysis of perceptrons. Namely, in certain range of the upper bounds of the previous sections will indeed end up being lowered by the strategy that we will present. However, it will turn out that the only lowering that we were able to uncover is the one that corresponds to the simple combinatorial bounds given in Section 3.2.3. However, before arriving to such a conclusion we will need to resolve a few technical problems.

Before proceeding further with the presentation of the above mentioned strategy, we first recall on a few technical details from previous sections that we will need here again. We start by recalling on the optimization problem that we will consider here. As is probably obvious, it is basically the one given in (35)

 ξ±1=minxmaxλ≥0 κλT1−λTHx subject to ∥λ∥2=1 (49) x2i=1.

As mentioned below (35), a probabilistic characterization of the sign of would be enough to determine the storage capacity or its bounds. Below, we provide a way similar to the one from the previous subsection that can also be used to probabilistically characterize . Moreover, as mentioned at the beginning of the previous section, since will concentrate around its mean for our purposes here it will then be enough to study only its mean . We do so by relying on the strategy developed in [27] (and employed in [29, 31]) and ultimately on the following set of results from [13]. (The following theorem presented in [27] is in fact a slight alternation of the original results from [13].)

###### Theorem 7.

([13]) Let and , , be two centered Gaussian processes which satisfy the following inequalities for all choices of indices

1. .

Let be increasing functions on the real axis. Then

 E(minimaxjψij(Xij))≤E(minimaxjψij(Yij)).

Moreover, let be decreasing functions on the real axis. Then

 E(maximinjψij(Xij))≥E(maximinjψij(Yij)).
###### Proof.

The proof of all statements but the last one is of course given in [13]. The proof of the last statement trivially follows and in a slightly different scenario is given for completeness in [27]. ∎

The strategy that we will present below will utilize the above theorem to lift the above mentioned lower bound on (of course since we talk in probabilistic terms, under bound on we essentially assume a bound on ). We do mention again that in Section 4.1 we relied on a variant of the above theorem to create a probabilistic lower bound on . However, the strategy employed in Section 4.1 relied only on a basic version of the above theorem which assumes . Here, we will substantially upgrade the strategy from Section 4.1 by looking at a very simple (but way better) different version of .

#### 4.2.1 Lifting lower bound on ξ±1

In [27, 29] we established lemmas very similar to the following one:

###### Lemma 2.

Let be an matrix with i.i.d. standard normal components. Let and be and vectors, respectively, with i.i.d. standard normal components. Also, let be a standard normal random variable and let be a positive constant. Then

 E(maxx2i=1min∥λ∥2=1,λi≥0e−c3(−λTHx+g+κλT1))≤E(maxx2i=1min∥λ∥2=1,λ1≥0e−c3(gTλ+hTx+κλT1)). (50)
###### Proof.

The proof is the same as the proof of the corresponding lemma in [27]. The only difference is in the structure of the sets of allowed values for and . However, such a difference introduces no structural changes in the proof. ∎

Following step by step what was done after Lemma 3 in [27] one arrives at the following analogue of [27]’s equation :

 E(minx2i=1max∥λ∥2=1,λi≥0(−λTHx+κλT1))≥c32−1c3log(E(maxx2i=1(e−c3hTx)))−1c3log(E(min∥λ∥2=1,λi≥0(e−c3(gTλ+κλT1)))). (51)

Let where is a constant independent of . Then (LABEL:eq:chneg8) becomes

 E(minx2i=1max∥λ∥2=1,λi≥0(−λTHx+κλT1))√n≥−(−c(s)32+Isph(c(s)3)+I±1(c(s)3,α,κ)), (52)

where

 I±1(c(s)3) = 1nc(s)3log(E(maxx2i=1(e−c(s)3√nhTx))) Iper(c(s)3,α,κ) = 1nc(s)3log(E(min∥λ∥2=1,λi≥0(e−c(s)3√n(gTλ