## 1 Introduction

In this paper we will study a special type of the classical spherical perceptron problem. Of course, spherical perceptrons are a well studied class of problems with applications in various fields, ranging from neural networks and statistical physics/mechanics to high-dimensional geometry and biology. While the spherical perceptron like problems had been known for a long time (for various mathematical versions see, e.g.

[19, 11, 18, 35, 34, 9, 16, 6, 33]), it is probably the work of Gardner

[12] that brought them in the research spotlight. One would be inclined to believe that the main reason for that was Gardner’s ability to quantify many of the features of the spherical perceptrons that were not so easy to handle through the standard mathematical tools typically used in earlier works. Namely, in [12], Gardner introduced a fairly neat type of analysis based on a statistical mechanics approach typically called the replica theory. As a result she was able to quantify almost any of the spherical perceptrons typical features of interest. While some of the results she obtained were known (for example, the storage capacity with zero-thresholds, see, e.g. [19, 11, 18, 35, 34, 9, 16, 6, 33]) many other ones were not (storage capacity with non-zero thresholds, typical volume of interactions strengths for which the memory functions properly, the storage capacities of memories with errors, and so on). Moreover, many of the results that she obtained remained as mathematical conjectures (either in the form of those related to quantities which are believed to be the exact predictions or in the form of those related to quantities which may be solid approximations). In recent years some of those that had been believed to be exact have indeed been rigorously proved (see, e.g. [20, 21, 32, 22]) whereas many of those that are believed to be solid approximations have been shown to be at the very least rigorous bounds (see, e.g. [22, 28]).In this paper we will also look at one of the features of the spherical perceptron. The quantity that we will be interested in this paper in particular is fairly closely related to the well-known storage capacity. Namely, we will indeed attempt to evaluate the storage capacity of the spherical perceptron, however, instead of insisting that all the patterns should be memorized correctly we will also allow for a certain fraction of errors. In other words, we we will allow that a certain fraction of patterns can in fact be memorized incorrectly. Throughout the paper, we will often refer to the capacity of such a memory as the storage capacity with errors. Of course, this problem was already studied in [12] and a nice set of observations related to it has already been made there. Here, we will through a mathematically rigorous analysis attempt to confirm many of them.

Before going into the details of our approach we will recall on the basic definitions related to the spherical perceptron and needed for its analysis. Also, to make the presentation easier to follow we find it useful to briefly sketch how the rest of the paper is organized. In Section 2 we will, as mentioned above, introduce a more formal mathematical description of how a perceptron operates. In Section 3 we will present several results that are known for the classical spherical perceptron. In Section 4 we will discuss the storage capacity when the errors are allowed. We will recall on the known results and later on in Section 5 present a powerful mechanism that can be used to prove that many of the known results are actually rigorous bounds on the quantities of interest. In Section 6 we will then present a further refinement of the mechanism from Section 5 that can be used to potentially lower the values of the storage capacity obtained in Section 5. Finally, in Section 7 we will discuss obtained results and present several concluding remarks.

## 2 Mathematical setup of a perceptron

To make this part of the presentation easier to follow we will try to introduce all important features of the spherical perceptron that we will need here by closely following what was done in [12] (and for that matter in our recent work [22, 28]). So, as in [12], we start with the following dynamics:

(1) |

Following [12] for any fixed we will call each , the icing spin, i.e. . Continuing further with following [12], we will call , the interaction strength for the bond from site to site . To be in a complete agreement with [12], we in (1) also introduced quantities . is typically called the threshold for site in pattern . However, to make the presentation easier to follow, we will typically assume that . Without going into further details we will mention though that all the results that we will present below can be easily modified so that they include scenarios where .

Now, the dynamics presented in (1) works by moving from a to and so on (of course one assumes an initial configuration for say ). Moreover, the above dynamics will have a fixed point if say there are strengths , such that for any

(2) | |||||

Of course, the above is a well known property of a very general class of dynamics. In other words, unless one specifies the interaction strengths the generality of the problem essentially makes it easy. After considering the general scenario introduced above, [12] then proceeded and specialized it to a particular case which amounts to including spherical restrictions on . A more mathematical description of such restrictions considered in [12] essentially boils down to the following constraints

(3) |

The fundamental question that one typically considers then is the so-called storage capacity of the above dynamics or alternatively a neural network that it would represent (of course this is exactly one of the questions considered in [12]). Namely, one then asks how many patterns (-th pattern being ) one can store so that there is an assurance that they are stored in a stable way. Moreover, since having patterns being fixed points of the above introduced dynamics is not enough to insure having a finite basin of attraction one often may impose a bit stronger threshold condition

(4) | |||||

where typically is a positive number. We will refer to a perceptron governed by the above dynamics and coupled with the spherical restrictions and a positive threshold as the positive spherical perceptron. Alternatively, when is negative we will refer to it as the negative spherical perceptron (such a perceptron may be more of an interest from a purely mathematical point of view rather than as a neural network concept; nevertheless we will view it as an interesting mathematical problem; consequently, we will on occasion, in addition to the results that we will present for the standard positive perceptron, present quite a few results related to the negative case as well).

Also, we should mentioned that beyond the above mentioned negative case many other variants of the model that we study here are possible from a purely mathematical perspective. Moreover, many of them have found applications in various other fields as well. For example, a nice set of references that contains a collection of results related to various aspects of different neural networks models and their bio- and many other applications is [2, 1, 4, 5, 3, 23, 8].

## 3 Standard spherical perceptron – known results

As mentioned above, our main interest in this paper will be a particular type of the spherical perceptron, namely the one that functions as a memory with a limited fraction of errors. However, before proceeding with the problem that we will study here in great detail we find it useful to first recall on several results known for the standard spherical perceptron, i.e. the one that functions as a storage memory without errors. That way it will be easier to properly position the results we intend to present here within the scope of what is already known.

### 3.1 Statistical mechanics

We of course start with recalling on what was presented in [12]. In [12] a replica type of approach was designed and based on it a characterization of the storage capacity was presented. Before showing what exactly such a characterization looks like we will first formally define it. Namely, throughout the paper we will assume the so-called linear regime, i.e. we will consider the so-called *linear* scenario where the length and the number of different patterns, and , respectively are large but proportional to each other. Moreover, we will denote the proportionality ratio by (where obviously is a constant independent of ) and will set

(5) |

Now, assuming that

, are i.i.d. symmetric Bernoulli random variables,

[12], using the replica approach, gave the following estimate for

so that (4) holds with overwhelming probability (under overwhelming probability we will in this paper assume a probability that is no more than a number exponentially decaying in away from )(6) |

Based on the above characterization one then has that achieves its maximum over positive ’s as . One in fact easily then has

(7) |

Also, to be completely exact, in [12], it was predicted that the storage capacity relation from (6) holds for the range .

### 3.2 Rigorous results – positive spherical perceptron ()

The result given in (7

) is of course well known and has been rigorously established either as a pure mathematical fact or even in the context of neural networks and pattern recognition

[19, 11, 18, 35, 34, 9, 16, 6, 33]. In a more recent work [20, 21, 32] the authors also considered the storage capacity of the spherical perceptron and established that when (6) also holds. In our own work [22] we revisited the storage capacity problems and presented an alternative mathematical approach that was also powerful enough to reestablish the storage capacity prediction given in (6). We below formalize the results obtained in [20, 21, 32, 22].###### Theorem 1.

[20, 21, 32, 22] Let be an matrix with i.i.d.Bernoulli components. Let be large and let , where is a constant independent of . Let be as in (6) and let be a scalar constant independent of . If then with overwhelming probability there will be no such that and (4) is feasible. On the other hand, if then with overwhelming probability there will be an such that and (4) is feasible.

As mentioned earlier, the results given in the above theorem essentially settle the storage capacity of the positive spherical perceptron or the Gardner problem. However, there are a couple of facts that should be pointed out (emphasized):

1) The results presented above relate to the *positive* spherical perceptron. It is not clear at all if they would automatically translate to the case of the negative spherical perceptron. As we hinted earlier, the case of the negative spherical perceptron () may be more of interest from a purely mathematical point of view than it is from say the neural networks point of view. Nevertheless, such a mathematical problem may turn out to be a bit harder than the one corresponding to the standard positive case. In fact, in [32], Talagrand conjectured (conjecture 8.4.4) that the above mentioned remains an upper bound on the storage capacity even when , i.e. even in the case of the negative spherical perceptron. However, he does seem to leave it as an open problem what the exact value of the storage capacity in the negative case should be. In our own work [22] we confirmed this Talagrand’s conjecture and showed that even in the negative case from (6) is indeed an upper bound on the storage capacity.

2) It is rather clear but we do mention that the overwhelming probability statement in the above theorem is taken with respect to the randomness of . To analyze the feasibility of (9) we in [22] relied on a mechanism we recently developed for studying various optimization problems in [29]. Such a mechanism works for various types of randomness. However, the easiest way to present it was assuming that the underlying randomness is standard normal. So to fit the feasibility of (9) into the framework of [29] we in [22] formally assumed that the elements of matrix are i.i.d. standard normals. In that regard then what was proved in [22] is a bit different from what was stated in the above theorem. However, as mentioned in [22] (and in more detail in [29, 26]) all our results from [22] continue to hold for a very large set of types of randomness and certainly for the Bernouilli one assumed in Theorem 1.

3) We will continue to call the critical value of so that (4) is feasible the storage capacity even when , even though it may be linguistically a bit incorrect, given the neural network interpretation of finite basins of attraction mentioned above.

### 3.3 Rigorous results – negative spherical perceptron ()

In our recent work [28] we went a step further and considered the negative version of the standard spherical perceptron. While the results that we will present later on in Sections 5 and 6 will be valid for any our main concern will be from a neural network point of view and as such will be related to the positive case, i.e. to scenario. In that regard the results that we review in this subsection may seem as not as important as those from the previous subsections. However, once we present the main results in Sections 5 and 6 it will be clear that there is an interesting conceptual similarity that is deeply rooted in a combinatorial similarity of what we will present in this subsection (and what was essentially proved in [22, 28]) and the results that we will present in Sections 5 and 6.

As mentioned above under point 3), we in [28] called the corresponding limiting in case the storage capacity of the negative spherical perceptron. Before presenting the storage capacity results that we obtained in [22, 28] we will find it useful to slightly redefine the original feasibility problem considered above. This will of course be of a great use in the exposition that will follow as well.

We first recall that in [28] we studied the so-called uncorrelated case of the spherical perceptron (more on an equally important correlated case can be found in e.g. [22, 12]). This is the same scenario that we will study here (so the simplifications that we made in [28] and that we are about to present below will be in place later on as well). In the uncorrelated case, one views all patterns , as uncorrelated (as expected,

stands for vector

). Now, the following becomes the corresponding version of the question of interest mentioned above: assuming that is an matrix with i.i.d. Bernoulli entries and that , how large can be so that the following system of linear inequalities is satisfied with overwhelming probability(8) |

This of course is the same as if one asks how large can be so that the following optimization problem is feasible with overwhelming probability

(9) |

To see that (8) and (9) indeed match the above described fixed point condition it is enough to observe that due to statistical symmetry one can assume . Also the constraints essentially decouple over the columns of (so one can then think of in (8) and (9) as one of the columns of ). Moreover, the dimension of in (8) and (9) should be changed to ; however, since we will consider a large scenario to make writing easier we keep the dimension as . Also, as mentioned under point 2) above, we will, without a loss of generality, treat in (9) as if it has i.i.d. standard normal components. Moreover, in [22] we also recognized that (9) can be rewritten as the following optimization problem

subject to | (10) | ||||

where is an -dimensional column vector of all ’s. Clearly, if then (9) is feasible. On the other hand, if then (9) is not feasible. That basically means that if we can probabilistically characterize the sign of then we could have a way of determining such that . That is exactly what we have done in [22] on an ultimate level for and on a say upper-bounding level for . Of course, we do mention again, that as far as point 2) goes, we in [28] (and will in this paper as well) without loss of generality again made the same type of assumption that we had made in [22] related to the statistics of . In other words, as far as the presentation below is concerned, we will continue to assume that the elements of matrix are i.i.d. standard normals (as mentioned above, such an assumption changes nothing in the validity of the results that we will present; also, more on this topic can be found in e.g. [24, 25, 29] where we discussed it a bit further). Relying on the strategy developed in [29, 27] and on a set of results from [14, 15] we in [22] proved the following theorem that essentially extends Theorem 1 to the case and thereby resolves Conjecture 8.4.4 from [32] in positive:

###### Theorem 2.

[22] Let be an matrix with i.i.d. standard normal components. Let be large and let , where is a constant independent of . Let be as in (10) and let be a scalar constant independent of . Let all ’s be arbitrarily small constants independent of . Further, let be a standard normal random variable and set

(11) |

Let and be scalars such that

(12) |

If then

(13) |

Moreover, if then

(14) |

###### Proof.

Presented in [22]. ∎

In a more informal language (essentially ignoring all technicalities and ’s) one has that as long as

(15) |

the problem in (9) will be infeasible with overwhelming probability. On the other hand, one has that when as long as

(16) |

the problem in (9) will be feasible with overwhelming probability. This of course settles the case completely and essentially establishes the storage capacity as which of course matches the prediction given in the introductory analysis presented in [12] and of course rigorously confirmed by the results of [20, 21, 32]. On the other hand, when it only shows that the storage capacity with overwhelming probability is not higher than the quantity given in [12]. As mentioned above this confirms Talagrand’s conjecture 8.4.4 from [32]. However, it does not settle problem (question) 8.4.2 from [32].

The results obtained based on the above theorem as well as those obtained based on Theorem 1 are presented in Figure 1. When (i.e. when ) the curve indicates the exact breaking point between the “overwhelming” feasibility and infeasibility of (9). On the other hand, when (i.e. when ) the curve is only an upper bound on the storage capacity, i.e. for any value of the pair that is above the curve given in Figure 1, (9) is infeasible with overwhelming probability.

Since the case did not appear as settled based on the above presented results we then in [28] attempted to lower the upper bounds given in Theorem 2. We created a fairly powerful mechanism that produced the following theorem as a way of characterizing the storage capacity of the negative spherical perceptron.

###### Theorem 3.

Let be an matrix with i.i.d. standard normal components. Let be large and let , where is a constant independent of . Let be a scalar constant independent of . Let all ’s be arbitrarily small constants independent of . Set

(17) |

and

(18) |

Set

(19) |

and

(20) |

Further, set

(21) |

If is such that

(22) |

then (9) is infeasible with overwhelming probability.

###### Proof.

Presented in [28]. ∎

The results one can obtain for the storage capacity based on the above theorem are presented in Figure 2 (as mentioned in [28], due to numerical optimizations involved the results presented in Figure 2 should be taken only as an illustration; also as discussed in [28] taking in Theorem 3 produces the results of Theorem 2). Even as such, they indicate that a visible improvement in the values of the storage capacity may be possible, though in a range of values of substantially larger than (i.e. in a range of ’s somewhat smaller than zero). While at this point this observation may look as unrelated to the problem that we will consider in the following section one should keep it in mind (essentially, a conceptually similar conclusion will be made later on when we study the capacities with limited errors).

## 4 Spherical perceptron with errors

What we described in the previous section is a typical setup of a standard spherical perceptron. To be a bit more precise, it is a setup one can use to in a way quantify the storage capacity of the standard spherical perceptron. In this section we will slightly change this standard notion of how the spherical perceptron operates. In fact, what we will change will actually be what is an acceptable way of spherical perceptron’s operation. Of course, such a chnage is not our invention. While it had been known for a long time, it is the work of Gardner [12] that popularized its an analytical study. Before, we present the known analytical predictions we will briefly sketch the main idea behind the spherical perceptrons that will be allowed to function as memories with errors. We will rely on many simplifications of the original perceptron setup from Section 2 introduced in [22, 28] and presented in Section 3.

To that end we start by recalling that for all practical purposes needed here (and those we needed in [22, 28]) the storage capacity of the standard spherical perceptron can be considered through the feasibility problem given in (9) which we restate below

(23) |

We of course recall as well, that as argued in [22, 28] (and as mentioned in the previous section) one can assume that the elements of are i.i.d. standard normals and that the dimension of is , where as earlier we keep the linear regime, i.e. continue to assume that where is a constant independent of . Now, if all inequalities in (23) are satisfied one can have that the dynamics established will be stable and all patterns could be successfully stored. On the other hand if one relaxes such a constraint so that only a fraction of them (say larger than ) is satisfied then only such a fraction of patterns could be successfully stored (of course one views storage at each site ; however, due to symmetry as discussed earlier, one can simply just switch to consideration of (23)). This is of course similar to saying if a fraction (say smaller than ) of the inequalities may not hold then such a fraction of patterns could be incorrectly stored. One can then reformulate (23) so that it provides a mathematical description for such a scenario. The resulting feasibility problem one can then consider becomes

(24) |

Using the replica approach Gardner developed for a problem similar to this one in [12], Gardner and Derrida in [13] proceeded and characterized the feasibility of (24). Namely, they gave a prediction for the value of the critical storage capacity as a function of and so that (24) is feasible (as mentioned earlier, in what follows we may often refer to as the storage capacity of the spherical perceptron with limited errors). The prediction given in [13] essentially boils down to the following two equations: first one determines as the solution of

(25) |

Then one determines a prediction for the storage capacity as

(26) |

Now, assuming the standard setup (where no errors are allowed) one has which from (25) implies . One then from (26) has

(27) |

In other words, if no errors are allowed (25) and (26) give the same result for the storage capacity as does (6). Now, looking back at what was presented in Figure 1, one should note that when (the case primarily of interest here) the curve denotes the exact values of the storage capacity for any . On the other hand, one from the same plot has that if a pair is above the curve the memory is not stable, i.e. it is with overwhelming probability that one can not find a spherical such that (9) is feasible. However, if one attempts to be a bit more precise with respect to this instability one may find it useful to introduce a number of allowed wrong patterns (bits). This is in essence what (25) and (26) do. They basically attempt to characterize the number of incorrectly stored patterns when and a pair is above the curve given in Figure 1 (in fact one can use them to give a prediction for the number of incorrectly stored patterns (say ) even when ). Alternatively, as framed above, one can think of all of this as a way of finding the storage capacity if a fraction of errors (incorrectly stored patterns), say is allowed. This is of course exactly the problem that we will be attacking below and based on the above is exactly what (25) and (26) characterize.

Before proceeding further we should provide a few comments as for the potential accuracy of the above predictions. As is now well known if and then the above prediction boils down to the standard storage capacity of the positive spherical perceptron which is based on [20, 21] (and later on [32, 22]) known to be correct. On the other hand, as discussed in [28] (and briefly in the previous section), the above prediction is only a rigorous upper bound on the storage capacity of the negative spherical perceptron. In fact, many of the conclusions already made in [12, 13] indicated this kind of behavior. Namely, a stability analysis of the replica approach done in [13] indicated that some of the predictions (essentially in a certain range of plane) related to the storage capacities when the errors are allowed may not be accurate. In [7] the replica stability range given in [13] was corrected a bit and as a consequence [7] actually established that the replica analysis of [13] may in fact produce incorrect results in the entire regime above the curve given in Figure 1. Still, even if the results given in (25) and (26) are to be incorrect, they may be a fairly good approximate predictions for the storage capacity (or alternatively the fraction of incorrectly stored patterns) or they may even be say rigorous bounds on the true values (as were the predictions of [12] related to the negative spherical perceptron). Below we will show that the above given predictions (namely, those given in (25) and (26)) are in fact rigorous upper bounds on the storage capacity of the spherical perceptron when a fraction of incorrectly stored patterns is allowed.

## 5 Upper bounds on the storage capacity of the spherical perceptrons with limited errors

As we have mentioned at the end of the previous section, in this section we will create a set of results that will essentially establish the predictions obtained in [13] (and given in (25) and (26)) as rigorous upper bounds on the storage capacity of the spherical perceptron with limited errors. We start by writing an analogue to for the feasibility problem of interest here, namely the one given in (24)

subject to | (28) | ||||

Although it is probably obvious, we mention that is an matrix with elements of vector on its main diagonal and zeros elsewhere. Clearly, following the logic we presented in previous sections, the sign of determines the feasibility of (24). In particular, if then (24) is infeasible. Given the random structure of the problem (we recall that is random) one can then pose the following probabilistic feasibility question: how small can be so that in (28) is positive and (24) is infeasible with overwhelming probability? In what follows we will attempt to provide an answer to such a question.

### 5.1 Probabilistic analysis

In this section we will present a probabilistic analysis of the above optimization problem given in (28). In a nutshell, we will provide a relation between and so that with overwhelming probability over . This will, of course, based on the above discussion then be enough to conclude that the problem in (24) is infeasible with overwhelming probability when and satisfy such a relation.

The analysis that we will present below will to a degree rely on a strategy we developed in [29, 27] and utilized in [22] when studying the storage capacity of the standard spherical perceptrons. We start by recalling on a set of probabilistic results from [14, 15] that were used as an integral part of the strategy developed in [29, 27, 22].

###### Theorem 4.

The following, more simpler, version of the above theorem relates to the expected values.

###### Theorem 5.

Now, since all random quantities of interest below will concentrate around its mean values it will be enough to study only their averages. However, since it will not make writing of what we intend to present in the remaining parts of this section substantially more complicated we will present a complete probabilistic treatment and will leave the studying of the expected values for the presentation that we will give in the following section where such a consideration will substantially simplify the exposition.

We will make use of Theorem 4 through the following lemma (the lemma is an easy consequence of Theorem 4 and in fact is fairly similar to Lemma 3.1 in [15], see also [24, 22] for similar considerations).

###### Lemma 1.

Let be an matrix with i.i.d. standard normal components. Let and be and vectors, respectively, with i.i.d. standard normal components. Also, let be a standard normal random variable and let be a function of . Then

(29) |

###### Proof.

The proof is basically similar to the proof of Lemma 3.1 in [15] as well as to the proof of Lemma 7 in [24]. However, one has to be a bit careful about the structures of sets of allowed values for . For completeness we will sketch the core of the argument. The remaining parts follow easily as in Lemma 3.1 in [15] (or as in the proof of Lemma 7 in [24]). Namely, one starts by defining processes and in the following way

(30) |

Then clearly

(31) |

One then further has

(32) |

and clearly

(33) |

Moreover,

(34) |

And after a small algebraic transformation

(35) | |||||

Combining (31), (33), and (35) and using results of Theorem 4 one then easily obtains (29). ∎

Let with being an arbitrarily small constant independent of . We will first look at the right-hand side of the inequality in (29). The following is then the probability of interest

(36) |

After solving the minimization over one obtains

(37) |

where

(38) |

Since is a vector of i.i.d. standard normal variables it is rather trivial that

(39) |

where is an arbitrarily small constant and is a constant dependent on but independent of . Along the same lines, due to the linearity of the objective function in the definition of and the fact that is a vector of i.i.d. standard normals, one has

(40) |

where

(41) |

and is an arbitrarily small constant and analogously as above is a constant dependent on and but independent of . Then a combination of (37), (39), and (40) gives

(42) |

If

(43) | |||||

one then has from (42)

(44) |

To make the result in (44) operational one needs an estimate for . In the following subsection we will present a way that can be used to estimate . Before doing so we will briefly take a look at the left-hand side of the inequality in (29).