A Non-Binary Associative Memory with Exponential Pattern Retrieval Capacity and Iterative Learning: Extended Results

02/05/2013
by   Amir Hesam Salavati, et al.
EPFL
Qualcomm
0

We consider the problem of neural association for a network of non-binary neurons. Here, the task is to first memorize a set of patterns using a network of neurons whose states assume values from a finite number of integer levels. Later, the same network should be able to recall previously memorized patterns from their noisy versions. Prior work in this area consider storing a finite number of purely random patterns, and have shown that the pattern retrieval capacities (maximum number of patterns that can be memorized) scale only linearly with the number of neurons in the network. In our formulation of the problem, we concentrate on exploiting redundancy and internal structure of the patterns in order to improve the pattern retrieval capacity. Our first result shows that if the given patterns have a suitable linear-algebraic structure, i.e. comprise a sub-space of the set of all possible patterns, then the pattern retrieval capacity is in fact exponential in terms of the number of neurons. The second result extends the previous finding to cases where the patterns have weak minor components, i.e. the smallest eigenvalues of the correlation matrix tend toward zero. We will use these minor components (or the basis vectors of the pattern null space) to both increase the pattern retrieval capacity and error correction capabilities. An iterative algorithm is proposed for the learning phase, and two simple neural update algorithms are presented for the recall phase. Using analytical results and simulations, we show that the proposed methods can tolerate a fair amount of errors in the input while being able to memorize an exponentially large number of patterns.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

03/13/2014

Noise Facilitation in Associative Memories of Exponential Capacity

Recent advances in associative memory design through structured pattern ...
05/31/2018

Forgetting Memories and their Attractiveness

We study numerically the memory which forgets, introduced in 1986 by Par...
07/24/2014

Convolutional Neural Associative Memories: Massive Capacity with Noise Tolerance

The task of a neural associative memory is to retrieve a set of previous...
02/13/2012

Multi-Level Error-Resilient Neural Networks with Learning

The problem of neural network association is to retrieve a previously me...
01/08/2013

Coupled Neural Associative Memories

We propose a novel architecture to design a neural associative memory th...
01/26/2013

Neural Networks Built from Unreliable Components

Recent advances in associative memory design through strutured pattern s...
09/15/2017

Dynamic Capacity Estimation in Hopfield Networks

Understanding the memory capacity of neural networks remains a challengi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Neural associative memory is a particular class of neural networks capable of memorizing (learning) a set of patterns and recalling them later in presence of noise, i.e. retrieve the correct memorized pattern from a given noisy version. Starting from the seminal work of Hopfield in 1982

[1], various artificial neural networks have been designed to mimic the task of the neuronal associative memory (see for instance [2], [3], [4], [5], [6]).

In essence, the neural associative memory problem is very similar to the one faced in communication systems where the goal is to reliably and efficiently retrieve a set of patterns (so called codewords) form noisy versions. More interestingly, the techniques used to implement an artificial neural associative memory looks very similar to some of the methods used in graph-based modern codes to decode information. This makes the pattern retrieval phase in neural associative memories very similar to iterative decoding techniques in modern coding theory.

However, despite the similarity in the task and techniques employed in both problems, there is a huge gap in terms of efficiency. Using binary codewords of length , one can construct codes that are capable of reliably transmitting codewords over a noisy channel, where is the code rate [7]. The optimal (i.e. the largest possible value that permits the almost sure recovery of transmitted codewords from the corrupted received versions) depends on the noise characteristics of the channel and is known as the Shannon capacity [8]. In fact, the Shannon capacity is achievable in certain cases, for example by LDPC codes over AWGN channels.

In current neural associative memories, however, with a network of size one can only memorize binary patterns of length [9], [2]. To be fair, it must be mentioned that these networks are designed such that they are able to memorize any possible set of randomly chosen patterns (with size of course) (e.g., [1], [2], [3], [4]). Therefore, although humans cannot memorize random patterns, these methods provide artificial neural associative memories with a pleasant sense of generality.

However, this generality severely restricts the efficiency of the network since even if the input patterns have some internal redundancy or structure, current neural associative memories could not exploit this redundancy in order to increase the number of memorizable patterns or improve error correction during the recall phase. In fact, concentrating on redundancies within patterns is a fairly new viewpoint. This point of view is in harmony to coding techniques where one designs codewords with certain degree of redundancy and then use this redundancy to correct corrupted signals at the receiver’s side.

In this paper, we focus on bridging the performance gap between the coding techniques and neural associative memories. Our proposed neural network exploits the inherent structure of the input patterns in order to increase the pattern retrieval capacity from to with . More specifically, the proposed neural network is capable of learning and reliably recalling given patterns when they come from a subspace with dimension of all possible -dimensional patterns. Note that although the proposed model does not have the versatility of traditional associative memories to handle any set of inputs, such as the Hopfield network [1], it enables us to boost the capacity by a great extent in cases where there is some input redundancy. In contrast, traditional associative memories will still have linear pattern retrieval capacity even if the patterns good linear algebraic structures.

In [10], we presented some preliminary results in which two efficient recall algorithms were proposed for the case where the neural graph had the structure of an expander [11]. Here, we extend the previous results to general sparse neural graphs as well as proposing a simple learning algorithm to capture the internal structure of the patterns (which will be used later in the recall phase).

The remainder of this paper is organized as follows: In Section II, we will discuss the neural model used in this paper and formally define the associative memory problem. We explain the proposed learning algorithm in Section III. Sections IV and V are respectively dedicated to the recall algorithm and analytically investigating its performance in retrieving corrupted patterns. In Section VI we address the pattern retrieval capacity and show that it is exponential in . Simulation results are discussed in Section VII. Section VIII concludes the paper and discusses future research topics. Finally, the Appendices contain some extra remarks as well as the proofs for certain lemmas and theorems.

Ii Problem Formulation and the Neural Model

Ii-a The Model

In the proposed model, we work with neurons whose states are integers from a finite set of non-negative values . A natural way of interpreting this model is to think of the integer states as the short-term firing rate of neurons (possibly quantized). In other words, the state of a neuron in this model indicates the number of spikes fired by the neuron in a fixed short time interval.

Like in other neural networks, neurons can only perform simple operations. We consider neurons that can do linear summation over the input and possibly apply a non-linear function (such as thresholding) to produce the output. More specifically, neuron updates its state based on the states of its neighbors as follows:

  1. It computes the weighted sum where denotes the weight of the input link from the neighbor.

  2. It updates its state as where is a possibly non-linear function from the field of real numbers to .

We will refer to these two as ”neural operations” in the sequel.

Ii-B The Problem

The neural associative memory problem consists of two parts: learning and pattern retrieval.

Ii-B1 The learning phase

We assume to be given vectors of length with integer-valued entries belonging to . Furthermore, we assume these patterns belong to a subspace of with dimension . Let be the matrix that contains the set of patterns in its rows. Note that if , then we are back to the original associative memory problem. However, our focus will beon the case where , which will be shown to yield much larger pattern retrieval capacities. Let us denote the model specification by a triplet .

The learning phase then comprises a set of steps to determine the connectivity of the neural graph (i.e. finding a set of weights) as a function of the training patterns in

such that these patterns are stable states of the recall process. More specifically, in the learning phase we would like to memorize the patterns in by finding a set of non-zero vectors that are orthogonal to the set of given patterns. Remark here that such vectors exist (for instance the basis of the null-space).

Our interest is to come up with a neural scheme to determine these vectors. Therefore, the inherent structure of the patterns are captured in the obtained null-space vectors, denoted by the matrix , whose row is . This matrix can be interpreted as the adjacency matrix of a bipartite graph which represents our neural network. The graph is comprised on pattern and constraint neurons (nodes). Pattern neurons, as they name suggest, correspond to the states of the patterns we would like to learn or recall. The constrain neurons, on the other hand, should verify if the current pattern belongs to the database . If not, they should send proper feedback messages to the pattern neurons in order to help them converge to the correct pattern in the dataset. The overall network model is shown in Figure 1.

Fig. 1: A bipartite graph that represents the constraints on the training set.

Ii-B2 The recall phase

In the recall phase, the neural network should retrieve the correct memorized pattern from a possibly corrupted version. In this case, the states of the pattern neurons are initialized with the given (noisy) input pattern. Here, we assume that the noise is integer valued and additive111It must be mentioned that neural states below and above will be clipped to and , respectively. This is biologically justified as the firing rate of neurons can not exceed an upper bound and of course can not be less than zero.. Therefore, assuming the input to the network is a corrupted version of pattern , the state of the pattern nodes are , where is the noise. Now the neural network should use the given states together with the fact that to retrieve pattern

, i.e. it should estimate

from and return . Any algorithm designed for this purpose should be simple enough to be implemented by neurons. Therefore, our objective is to find a simple algorithm capable of eliminating noise using only neural operations.

Ii-C Related Works

Designing a neural associative memory has been an active area of research for the past three decades. Hopfield was the first to design an artificial neural associative memory in his seminal work in 1982 [1]. The so-called Hopfield network is inspired by Hebbian learning [12] and is composed of binary-valued () neurons, which together are able to memorize a certain number of patterns. In our terminology, the Hopfield network corresponds to a neural model. The pattern retrieval capacity of a Hopfield network of neurons was derived later by Amit et al. [13] and shown to be

, under vanishing bit error probability requirement. Later, McEliece et al.

[9] proved that under the requirement of vanishing pattern error probability, the capacity of Hopfield networks is .

In addition to neural networks with online learning capability, offline methods have also been used to design neural associative memories. For instance, in [2] the authors assume the complete set of pattern is given in advance and calculate the weight matrix using the pseudo-inverse rule [14] offline. In return, this approach helps them improve the capacity of a Hopfield network to , under vanishing pattern error probability condition, while being able to correct one bit of error in the recall phase. Although this is a significant improvement to the scaling of the pattern retrieval capacity in [9], it comes at the price of much higher computational complexity and the lack of gradual learning ability.

While the connectivity graph of a Hopfield network is a complete graph, Komlos and Paturi [15] extended the work of McEliece to sparse neural graphs. Their results are of particular interest as physiological data is also in favor of sparsely interconnected neural networks. They have considered a network in which each neuron is connected to other neurons, i.e., a -regular network. Assuming that the network graph satisfies certain connectivity measures, they prove that it is possible to store a linear number of random patterns (in terms of ) with vanishing bit error probability or random patterns with vanishing pattern error probability. Furthermore, they show that in spite of the capacity reduction, the error correction capability remains the same as the network can still tolerate a number of errors which is linear in .

It is also known that the capacity of neural associative memories could be enhanced if the patterns are of low-activity nature, in the sense that at any time instant many of the neurons are silent [14]. However, even these schemes fail when required to correct a fair amount of erroneous bits as the information retrieval is not better compared to that of normal networks.

Extension of associative memories to non-binary neural models has also been explored in the past. Hopfield addressed the case of continuous neurons and showed that similar to the binary case, neurons with states between and can memorize a set of random patterns, albeit with less capacity [16]. Prados and Kak considered a digital version of non-binary neural networks in which neural states could assume integer (positive and negative) values [17]. They show that the storage capacity of such networks are in general larger than their binary peers. However, the capacity would still be less than in the sense that the proposed neural network can not have more than patterns that are stable states of the network, let alone being able to retrieve the correct pattern from corrupted input queries.

In [3] the authors investigated a multi-state complex-valued neural associative memory for which the estimated capacity is . Under the same model but using a different learning method, Muezzinoglu et al. [4] showed that the capacity can be increased to . However the complexity of the weight computation mechanism is prohibitive. To overcome this drawback, a Modified Gradient Descent learning Rule (MGDR) was devised in [18]. In our terminology, all these models are neural associative memories.

Given that even very complex offline learning methods can not improve the capacity of binary or multi-sate neural associative memories, a group of recent works has made considerable efforts to exploit the inherent structure of the patterns in order to increase capacity and improve error correction capabilities. Such methods focus merely on memorizing those patterns that have some sort of inherent redundancy. As a result, they differ from previous methods in which the network was deigned to be able to memorize any random set of patterns. Pioneering this approach, Berrou and Gripon [19] achieved considerable improvements in the pattern retrieval capacity of Hopfield networks, by utilizing Walsh-Hadamard sequences. Walsh-Hadamard sequences are a particular type of low correlation sequences and were initially used in CDMA communications to overcome the effect of noise. The only slight downside to the proposed method is the use of a decoder based on the winner-take-all approach which requires a separate neural stage, increasing the complexity of the overall method. Using low correlation sequences has also been considered in [5], where the authors introduced two novel mechanisms of neural association that employ binary neurons to memorize patterns belonging to another type of low correlation sequences, called Gold family [20]. The network itself is very similar to that of Hopfield, with a slightly modified weighting rule. Therefore, similar to a Hopfield network, the complexity of the learning phase is small. However, the authors failed to increase the pattern retrieval capacity beyond and it was shown that the pattern retrieval capacity of the proposed model is , while being able to correct a fair number of erroneous input bits.

Later, Gripon and Berrou came up with a different approach based on neural cliques, which increased the pattern retrieval capacity to [6]. Their method is based on dividing a neural network of size into clusters of size each. Then, the messages are chosen such that only one neuron in each cluster is active for a given message. Therefore, one can think of messages as a random vector of length , where the part specifies the index of the active neuron in a given cluster. The authors also provide a learning algorithm, similar to that of Hopfield, to learn the pair-wise correlations within the patterns. Using this technique and exploiting the fact that the resulting patterns are very sparse, they could boost the capacity to while maintaining the computational simplicity of Hopfield networks.

In contrast to the pairwise correlation of the Hopfield model, Peretto et al. [21] deployed higher order neural models: the models in which the state of the neurons not only depends on the state of their neighbors, but also on the correlation among them. Under this model, they showed that the storage capacity of a higher-order Hopfield network can be improved to , where is the degree of correlation considered. The main drawback of this model is the huge computational complexity required in the learning phase, as one has to keep track of neural links and their weights during the learning period.

Recently, the present authors introduced a novel model inspired by modern coding techniques in which a neural bipartite graph is used to memorize the patterns that belong to a subspace [10]. The proposed model can be also thought of as a way to capture higher order correlations in given patterns while keeping the computational complexity to a minimal level (since instead of weights one needs to only keep track of of them). Under the assumptions that the bipartite graph is known, sparse, and expander, the proposed algorithm increased the pattern retrieval capacity to , for some , closing the gap between the pattern retrieval capacities achieved in neural networks and that of coding techniques. For completeness, this approach is presented in the appendix (along with the detailed proofs). The main drawbacks in the proposed approach were the lack of a learning algorithm as well as the expansion assumption on the neural graph.

In this paper, we focus on extending the results described in [10] in several directions: first, we will suggest an iterative learning algorithm, to find the neural connectivity matrix from the patterns in the training set. Secondly, we provide an analysis of the proposed error correcting algorithm in the recall phase and investigate its performance as a function of input noise and network model. Finally, we discuss some variants of the error correcting method which achieve better performance in practice.

It is worth mentioning that an extension of this approach to a multi-level neural network is considered in [22]. There, the novel structure enables better error correction. However, the learning algorithm lacks the ability to learn the patterns one by one and requires the patterns to be presented all at the same time in the form of a big matrix. In [23] we have further extended this approach to a modular single-layer architecture with online learning capabilities. The modular structure makes the recall algorithm much more efficient while the online learning enables the network to learn gradually from examples. The learning algorithm proposed in this paper is also virtually the same as the one we proposed in [23], giving it the advantage of

Another important point to note is that learning linear constraints by a neural network is hardly a new topic as one can learn a matrix orthogonal to a set of patterns in the training set (i.e., ) using simple neural learning rules (we refer the interested readers to [24] and [25]). However, to the best of our knowledge, finding such a matrix subject to the sparsity constraints has not been investigated before. This problem can also be regarded as an instance of compressed sensing [26], in which the measurement matrix is given by the big patterns matrix and the set of measurements are the constraints we look to satisfy, denoted by the tall vector , which for simplicity reasons we assume to be all zero. Thus, we are interested in finding a sparse vector such that . Nevertheless, many decoders proposed in this area are very complicated and cannot be implemented by a neural network using simple neuron operations. Some exceptions are [27] and [28] which are closely related to the learning algorithm proposed in this paper.

Ii-D Solution Overview

Before going through the details of the algorithms, let us give an overview of the proposed solution. To learn the set of given patterns, we have adopted the neural learning algorithm proposed in [29] and modified it to favor sparse solutions. In each iteration of the algorithm, a random pattern from the data set is picked and the neural weights corresponding to constraint neurons are adjusted is such a way that the projection of the pattern along the current weight vectors is reduced, while trying to make the weights sparse as well.

In the recall phase, we exploit the fact that the learned neural graph is sparse and orthogonal to the set of patterns. Therefore, when a query is given, if it is not orthogonal to the connectivity matrix of the weighted neural graph, it is noisy. We will use the sparsity of the neural graph to eliminate this noise using a simple iterative algorithm. In each iteration, there is a set of violated constraint neurons, i.e. those that receive a non-zero sum over their input links. These nodes will send feedback to their corresponding neighbors among the pattern neurons, where the feedback is the sign of the received input-sum. At this point, the pattern nodes that receive feedback from a majority of their neighbors update their state according to the sign of the sum of received messages. This process continues until noise is eliminated completely or a failure is declared.

In short, we propose a neural network with online learning capabilities which uses only neural operations to memorize an exponential number of patterns.

Iii Learning Phase

Since the patterns are assumed to be coming from a subspace in the -dimensional space, we adapt the algorithm proposed by Oja and Karhunen [29] to learn the null-space basis of the subspace defined by the patterns. In fact, a very similar algorithm is also used in [24] for the same purpose. However, since we need the basis vectors to be sparse (due to requirements of the algorithm used in the recall phase), we add an additional term to penalize non-sparse solutions during the learning phase.

Another difference with the proposed method and that of [24] is that the learning algorithm proposed in [24] yields dual vectors that form an orthogonal set. Although one can easily extend our suggested method to such a case as well, we find this requirement unnecessary in our case. This gives us the additional advantage to make the algorithm parallel and adaptive. Parallel in the sense that we can design an algorithm to learn one constraint and repeat it several times in order to find all constraints with high probability. And adaptive in the sense that we can determine the number of constraints on-the-go, i.e. start by learning just a few constraints. If needed (for instance due to bad performance in the recall phase), the network can easily learn additional constraints. This increases the flexibility of the algorithm and provides a nice trade-off between the time spent on learning and the performance in the recall phase. Both these points make an approach biologically realistic.

It should be mentioned that the core of our learning algorithm here is virtually the same as the one we proposed in [23].

Iii-a Overview of the proposed algorithm

The problem to find one sparse constraint vector is given by equations (1), (2), in which pattern is denoted by .

(1)

subject to:

(2)

In the above problem, is the inner-product, represent the vector norm, a penalty function to encourage sparsity and is a positive constant. There are various ways to choose . For instance one can pick to be , which leads to -norm penalty and is widely used in compressed sensing applications [27], [28]. Here, we will use a different penalty function, as explained later.

To form the basis for the null space of the patterns, we need vectors, which we can obtain by solving the above problem several times, each time from a random initial point222It must be mentioned that in order to have exactly linearly independent vectors, we should pay some additional attention when repeating the proposed method several time. This issue is addressed later in the paper..

As for the sparsity penalty term in this problem, in this paper we consider the function

where is chosen appropriately. Intuitively, approximates in -norm. Therefore, the larger is, the closer will be to . By calculating the derivative of the objective function, and by considering the update due to each randomly picked pattern , we will get the following iterative algorithm:

(3)
(4)
(5)

In the above equations, is the iteration number, is the sample pattern chosen at iteration uniformly at random from the patterns in the training set , and is a small positive constant. Finally, is the gradient of the penalty term for non-sparse solutions. This function has the interesting property that for very small values of , . To see why, consider the entry of the function

It is easy to see that for relatively small ’s. And for larger values of , we get (see Figure 2). Therefore, by proper choice of and , equation (4) suppresses small entries of by pushing them towards zero, thus, favoring sparser results. To simplify the analysis, with some abuse of notation, we approximate the function with the following function:

(6)

where is a small positive threshold.

Fig. 2: The sparsity penalty , which suppresses small values of the entry of in each iteration as a function of and . Note that the normalization constant has been omitted here to make comparison with function possible.

Following the same approach as [29] and assuming to be small enough such that equation (5) can be expanded as powers of , we can approximate equation (III-A) with the following simpler version:

(7)
(8)

In the above approximation, we also omitted the term since would be negligible, specially as in equation (6) becomes smaller.

The overall learning algorithm for one constraint node is given by Algorithm 1. In words, in Algorithm 1 is the projection of on the basis vector . If for a given data vector , is equal to zero, namely, the data is orthogonal to the current weight vector , then according to equation (8) the weight vector will not be updated. However, if the data vector has some projection over then the weight vector is updated towards the direction to reduce this projection.

0:   Set of patterns with , stopping point .
0:  
  while   do
     Choose at random from patterns in
     Compute
     Update .
     .
  end while
Algorithm 1 Iterative Learning

Since we are interested in finding basis vectors, we have to do the above procedure at least times in parallel.333In practice, we may have to repeat this process more than times to ensure the existence of a set of linearly independent vectors. However, our experimental results suggest that most of the time, repeating times would be sufficient.

Remark 1.

Although we are interested in finding a sparse graph, note that too much sparseness is not desired. This is because we are going to use the feedback sent by the constraint nodes to eliminate input noise at pattern nodes during the recall phase. Now if the graph is too sparse, the number of feedback messages received by each pattern node is too small to be relied upon. Therefore, we must adjust the penalty coefficient such that resulting neural graph is sufficiently sparse. In the section on experimental results, we compare the error correction performance for different choices of .

Iii-B Convergence analysis

In order to prove that Algorithm 1

converges to the proper solution, we use results from statistical learning. More specifically, we benefit from the convergence of Stochastic Gradient Descent (SGD) algorithms

[30]. To prove the convergence, let be the cost function we would like to minimize. Furthermore, let be the corelation matrix for the patterns in the training set. Therefore, due to uniformity assumption for the patterns in the training set, one can rewrite . Finally, denote . Now consider the following assumptions:

  1. and .

  2. , and , where is the small learning rate defined in III-A.

The following lemma proves the convergence of Algorithm 1 to a local minimum .

Lemma 1.

Let assumptions A1 and A2 hold. Then, Algorithm 1 converges to a local minimum for which .

Proof.

To prove the lemma, we use the convergence results in [30] and show that the required assumptions to ensure convergence holds for the proposed algorithm. For simplicity, these assumptions are listed here:

  1. The cost function is three-times differentiable with continuous derivatives. It is also bounded from below.

  2. The usual conditions on the learning rates are fulfilled, i.e. and .

  3. The second moment of the update term should not grow more than linearly with size of the weight vector. In other words,

    for some constants and .

  4. When the norm of the weight vector is larger than a certain horizon , the opposite of the gradient points towards the origin. Or in other words:

  5. When the norm of the weight vector is smaller than a second horizon , with , then the norm of the update term is bounded regardless of . This is usually a mild requirement:

To start, assumption holds trivially as the cost function is three-times differentiable, with continuous derivatives. Furthermore, . Assumption holds because of our choice of the step size , as mentioned in the lemma description.

Assumption ensures that the vector could not escape by becoming larger and larger. Due to the constraint , this assumption holds as well.

Assumption holds as well because:

(9)

Finally, assumption holds because:

(10)

Therefore, such that as long as :

(11)

Since all necessary assumptions hold for the learning algorithm 1, it converges to a local minimum where . ∎

Next, we prove the desired result, i.e. the fact that at the local minimum, the resulting weight vector is orthogonal to the patterns, i.e. .

Theorem 2.

In the local minimum where , the optimal vector is orthogonal to the patterns in the training set.

Proof.

Since , we have:

(12)

The first term is always greater than or equal to zero. Now as for the second term, we have that and , where is the entry of . Therefore, . Therefore, both terms on the right hand side of (12) are greater than or equal to zero. And since the left hand side is known to be equal to zero, we conclude that and . The former means . Therefore, we must have , for all . This simply means that the vector is orthogonal to all the patterns in the training set. ∎

Remark 2.

Note that the above theorem only proves that the obtained vector is orthogonal to the data set and says nothing about its degree of sparsity. The reason is that there is no guarantee that the dual basis of a subspace be sparse. The introduction of the penalty function in problem (III-A) only encourages sparsity by suppressing the small entries of , i.e. shifting them towards zero if they are really small or leaving them intact if they are rather large. And from the fact that , we know this is true as the entries in are either large or zero, i.e. there are no small entries. Our experimental results in section VII show that in fact this strategy works perfectly and the learning algorithm results in sparse solutions.

Iii-C Avoiding the all-zero solution

Although in problem (III-A) we have the constraint to make sure that the algorithm does not converge to the trivial solution , due to approximations we made when developing the optimization algorithm, we should make sure to choose the parameters such that the all-zero solution is still avoided.

To this end, denote and consider the following inequalities:

Now in order to have , we must have that . Given that, , it is therefore sufficient to have . On the other hand, we have:

(14)

As a result, in order to have , it is sufficient to have . Finally, since we have (entry-wise), we know that . Therefore, having ensures .

Remark 3.

Interestingly, the above choice for the function looks very similar to the soft thresholding function (15) introduced in [27] to perform iterative compressed sensing. The authors show that their choice of the sparsity function is very competitive in the sense that one can not get much better results by choosing other thresholding functions. However, one main difference between their work and that of ours is that we enforce the sparsity as a penalty in equation (4) while they apply the soft thresholding function in equation (15) to the whole , i.e. if the updated value of is larger than a threshold, it is left intact while it will be put to zero otherwise.

(15)

where is the threshold at iteration and tends to zero as grows.

Iii-D Making the Algorithm Parallel

In order to find constraints, we need to repeat Algorithm 1 several times. Fortunately, we can repeat this process in parallel, which speeds up the algorithm and is more meaningful from a biological point of view as each constraint neuron can act independently of other neighbors. Although doing the algorithm in parallel may result in linearly dependent constraints once in a while, our experimental results show that starting from different random initial points, the algorithm converges to different distinct constraints most of the time. And the chance of getting redundant constraints reduces if we start from a sparse random initial point. Besides, as long as we have enough distinct constraints, the recall algorithm in the next section can start eliminating noise and there is no need to learn all the distinct basis vectors of the null space defined by the training patterns (albeit the performance improves as we learn more and more linearly independent constraints). Therefore, we will use the parallel version to have a faster algorithm in the end.

Iv Recall Phase

In the recall phase, we are going to design an iterative algorithm that corresponds to message passing on a graph. The algorithm exploits the fact that our learning algorithm resulted in the connectivity matrix of the neural graph which is sparse and orthogonal to the memorized patterns. Therefore, given a noisy version of the learned patterns, we can use the feedback from the constraint neurons in Fig. 1 to eliminate noise. More specifically, the linear input sums to the constraint neurons are given by the elements of the vector , with being the integer-valued input noise (biologically speaking, the noise can be interpreted as a neuron skipping some spikes or firing more spikes than it should). Based on observing the elements of , each constraint neuron feeds back a message (containing info about ) to its neighboring pattern neurons. Based on this feedback, and exploiting the fact that is sparse, the pattern neurons update their states in order to reduce the noise .

It must also be mentioned that we initially assume assymetric neural weights during the recall phase. More specifically, we assume the backward weight from constraint neuron to pattern neuron , denoted by be equal to the sign of the weight from pattern neuron to constraint neuron , i.e. , where sign(x) is equal to , or if , or , respectively. This assumption simplifies the error correction analysis. Later in section IV-B, we are going to consider another version of the algorithm which works with symmetric weights, i.e. , and compare the performance of all suggested algorithms together in section VII.

Iv-a The Recall Algorithms

The proposed algorithm for the recall phase comprises a series of forward and backward iterations. Two different methods are suggested in this paper, which slightly differ from each other in the way pattern neurons are updated. The first one is based on the Winner-Take-All approach (WTA) and is given by Algorithm 2. In this version, only the pattern node that receives the highest amount of normalized feedback updates its state while the other pattern neurons maintain their current states. The normalization is done with respect to the degree of each pattern neuron, i.e. the number of edges connected to each pattern neuron in the neural graph. The winner-take-all circuitry can be easily added to the neural model shown in Figure 1 using any of the classic WTA methods [14].

0:  Connectivity matrix , iteration
0:  
1:  for  do
2:     Forward iteration: Calculate the weighted input sum for each constraint neuron and set:
3:     Backward iteration: Each neuron with degree computes
4:     Find
5:     Update the state of winner : set
6:     
7:  end for
Algorithm 2 Recall Algorithm: Winner-Take-All

The second approach, given by Algorithm 3, is much simpler: in every iteration, each pattern neuron decides locally whether or not to update its current state. More specifically, if the amount of feedback received by a pattern neuron exceeds a threshold, the neuron updates its state; otherwise, it remains unchanged.444Note that in order to maintain the current value of a neuron in case no input feedback is received, we can add self-loops to pattern neurons in Figure 1. These self-loops are not shown in the figure for clarity.

0:  Connectivity matrix , threshold , iteration
0:  
1:  for  do
2:     Forward iteration: Calculate the weighted input sum for each neuron and set:
3:     Backward iteration: Each neuron with degree computes
4:     Update the state of each pattern neuron according to only if .
5:     
6:  end for
Algorithm 3 Recall Algorithm: Majority-Voting

In both algorithms, the quantity can be interpreted as the number of feedback messages received by pattern neuron from the constraint neurons. On the other hand, the sign of provides an indication of the sign of the noise that affects , and indicates the confidence level in the decision regarding the sign of the noise.

It is worthwhile mentioning that the Majority-Voting decoding algorithm is very similar to the Bit-Flipping algorithm of Sipser and Spielman to decode LDPC codes [31] and a similar approach in [32] for compressive sensing methods.

Remark 4.

To give the reader some insight about why the neural graph should be sparse in order for the above algorithms to work, consider the backward iteration of both algorithms: it is based on counting the fraction of received input feedback messages from the neighbors of a pattern neuron. In the extreme case, if the neural graph is complete, then a single noisy pattern neuron results in the violation of all constraint neurons in the forward iteration. As a result, in the backward iteration all the pattern neurons receive feedback from their neighbors and it is impossible to tell which of the pattern neuron is the noisy one.

However, if the graph is sparse, a single noisy pattern neuron only makes some of the constraints unsatisfied. Consequently, in the recall phase only the nodes which share the neighborhood of the noisy node receive input feedbacks. And the fraction of the received feedbacks would be much larger for the original noisy node. Therefore, by merely looking at the fraction of received feedback from the constraint neurons, one can identify the noisy pattern neuron with high probability as long as the graph is sparse and the input noise is reasonable bounded.

Iv-B Some Practical Modifications

Although algorithm 3 is fairly simple and practical, each pattern neuron still needs two types of information: the number of received feedbacks and the net input sum. Although one can think of simple neural architectures to obtain the necessary information, we can modify the recall algorithm to make it more practical and simpler. The trick is to replace the degree of each node with the -norm of the outgoing weights. In other words, instead of using , we use . Furthermore, we assume symmetric weights, i.e .

Interestingly, in some of our experimental results corresponding to denser graphs, this approach performs much better, as will be illustrated in section VII. One possible reason behind this improvement might be the fact that using the -norm instead of the -norm in 3 will result in better differentiation between two vectors that have the same number of non-zero elements, i.e. have equal -norms, but differ from each other in the magnitude of the element, i.e. their -norms differ. Therefore, the network may use this additional information in order to identify the noisy nodes in each update of the recall algorithm.

V Performance Analysis

In order to obtain analytical estimates on the recall probability of error, we assume that the connectivity graph is sparse. With respect to this graph, we define the pattern and constraint degree distributions as follows.

Definition 1.

For the bipartite graph , let () denote the fraction of edges that are adjacent to pattern (constraint) nodes of degree (). We call and the pattern and constraint degree distribution form the edge perspective, respectively. Furthermore, it is convenient to define the degree distribution polynomials as

The degree distributions are determined after the learning phase is finished and in this section we assume they are given. Furthermore, we consider an ensemble of random neural graphs with a given degree distribution and investigate the average performance of the recall algorithms over this ensemble. Here, the word ”ensemble” refers to the fact that we assume having a number of random neural graphs with the given degree distributions and do the analysis for the average scenario.

To simplify analysis, we assume that the noise entries are . However, the proposed recall algorithms can work with any integer-valued noise and our experimental results suggest that this assumption is not necessary in practice.

Finally, we assume that the errors do not cancel each other out in the constraint neurons (as long as the number of errors is fairly bounded). This is in fact a realistic assumption because the neural graph is weighted, with weights belonging to the real field, and the noise values are integers. Thus, the probability that the weighted sum of some integers be equal to zero is negligible.

We do the analysis only for the Majority-Voting algorithms since if we choose the Majority-Voting update threshold , roughly speaking, we will have the winner-take-all algorithm.555It must be mentioned that choosing does not yield the WTA algorithm exactly because in the original WTA, only one node is updated in each round. However, in this version with , all nodes that receive feedback from all their neighbors are updated. Nevertheless, the performance of the both algorithms is rather similar.

As mentioned earlier, in this paper we will perform the analysis for general sparse bipartite graphs. However, restricting ourselves to a particular type of sparse graphs known as ”expander” allows us to prove stronger results on the recall error probabilities. More details can be found in Appendix C and in [10]. However, since it is very difficult, if not impossible in certain cases, to make a graph expander during an iterative learning method, we focus on the more general case of sparse neural graphs.

To start the analysis, let denote the set of erroneous pattern nodes at iteration , and be the set of constraint nodes that are connected to the nodes in , i.e. these are the constraint nodes that have at least one neighbor in . In addition, let denote the (complimentary) set of constraint neurons that do not have any connection to any node in . Denote also the average neighborhood size of by . Finally, let be the set of correct pattern nodes.

Based on the error correcting algorithm and the above notations, in a given iteration two types of error events are possible:

  1. Type-1 error event: A node decides to update its value. The probability of this phenomenon is denoted by .

  2. Type-2 error event: A node updates its value in the wrong direction. Let denote the probability of error for this type.

We start the analysis by finding explicit expressions and upper bounds on the average of and over all nodes as a function . We then find an exact relationship for as a function of , which will provide us with the required expressions on the average bit error probability as a function of the number of noisy input symbols, . Having found the average bit error probability, we can easily bound the block error probability for the recall algorithm.

V-a Error probability - type 1

To begin, let be the probability that a node with degree updates its state. We have:

(16)

where is the neighborhood of . Assuming random construction of the graph and relatively large graph sizes, one can approximate by

(17)

In the above equation, represents the probabaility of having one of the edges connected to the constraint neurons that are neighbors of the erroneous pattern neurons.

As a result of the above equations, we have:

(18)

where denote the expectation over the degree distribution .

Note that if , the above equation simplifies to

V-B Error probability - type 2

A node makes a wrong decision if the net input sum it receives has a different sign than the sign of noise it experiences. Instead of finding an exact relation, we bound this probability by the probability that the neuron shares at least half of its neighbors with other neurons, i.e. , where . Letting , we will have:

(19)

where

Therefore, we will have:

(20)

Combining equations (18) and (20), the bit error probability at iteration would be

(21)

And finally, the average block error rate is given by the probability that at least one pattern node is in error. Therefore:

(22)

Equation (22) gives the probability of making a mistake in iteration . Therefore, we can bound the overall probability of error, , by setting . To this end, we have to recursively update in equation (21) and using . However, since we have assumed that the noise values are , we can provide an upper bound on the total probability of error by considering

(23)

In other words, we assume that the recall algorithms either correct the input error in the first iteration or an error is declared. Obviously, this bound is not tight as in practice and one might be able to correct errors in later iterations. In fact simulation results confirm this expectation. However, this approach provides a nice analytical upper bound since it only depends on the initial number of noisy nodes. As the initial number of noisy nodes grow, the above bound becomes tight. Thus, in summary we have:

(24)

where and is the number of noisy nodes in the input pattern initially.

Remark 5.

One might hope to further simplify the above inequalities by finding closed form approximation of equations (17) and (19). However, as one expects, this approach leads to very loose and trivial bounds in many cases. Therefore, in our experiments shown in section VII we compare simulation results to the theoretical bound derived using equations (17) and (19).

Now, what remains to do is to find an expression for and as a function of . The following lemma will provide us with the required relationship.

Lemma 3.

The average neighborhood size in iteration is given by:

(25)

where is the average degree for pattern nodes.

Proof.

The proof is given in Appendix A. ∎

Vi Pattern Retrieval Capacity

It is interesting to see that, except for its obvious influence on the learning time, the number of patterns does not have any effect in the learning or recall algorithm. As long as the patterns come from a subspace, the learning algorithm will yield a matrix which is orthogonal to all of the patterns in the training set. And in the recall phase, all we deal with is , with being the noise which is independent of the patterns.

Therefore, in order to show that the pattern retrieval capacity is exponential with , all we need to show is that there exists a ”valid” training set with patterns of length for which , for some and . By valid we mean that the patterns should come from a subspace with dimension and the entries in the patterns should be non-negative integers. The next theorem proves the desired result.

Theorem 4.

Let be a matrix, formed by vectors of length with non-negative integers entries between and . Furthermore, let for some . Then, there exists a set of such vectors for which , with , and .

Proof.

The proof is based on construction: we construct a data set with the required properties. To start, consider a matrix with rank and , with . Let the entries of be non-negative integers, between and , with .

We start constructing the patterns in the data set as follows: consider a set of random vectors , , with integer-valued entries between and , where . We set the pattern to be , if all the entries of are between and . Obviously, since both and have only non-negative entries, all entries in are non-negative. Therefore, it is the upper bound that we have to worry about.

The entry in is equal to , where is the column of . Suppose has non-zero elements. Then, we have:

Therefore, denoting , we could choose , and such that

(26)

to ensure all entries of are less than .

As a result, since there are vectors with integer entries between and , we will have patterns forming . Which means , which would be an exponential number in if . ∎

As an example, if can be selected to be a sparse matrix with entries (i.e. ) and , and is also chosen to be a vector with elements (i.e. ), then it is sufficient to choose to have a pattern retrieval capacity of .

Remark 6.

Note that inequality (26) was obtained for the worst-case scenario and in fact is very loose. Therefore, even if it does not hold, we will still be able to memorize a very large number of patterns since a big portion of the generated vectors will have entries less than . These vectors correspond to the message vectors that are ”sparse” as well, i.e. do not have all entries greater than zero. The number of such vectors is a polynomial in , the degree of which depends on the number of non-zero entries in .

Vii Simulation Results

Vii-a Simulation Scenario

We have simulated the proposed learning and recall algorithms for three different network sizes , with for all cases. For each case, we considered a few different setups with different values for , , and in the learning algorithm 1, and different for the Majority-Voting recall algorithm 3. For brevity, we do not report all the results for various combinations but present only a selection of them to give insight on the performance of the proposed algorithms.

In all cases, we generated random training sets using the approach explained in the proof of theorem 4, i.e. we generated a generator matrix at random with entries and . We also used generating message words and put to ensure the validity of the generated training set.

However, since in this setup we will have patterns to memorize, doing a simulation over all of them would take a lot of time. Therefore, we have selected a random sample sub-set each time with size for each of the generated sets and used these subsets as the training set.

For each setup, we performed the learning algorithm and then investigated the average sparsity of the learned constraints over the ensemble of instances. As explained earlier, all the constraints for each network were learned in parallel, i.e. to obtain constraints, we executed Algorithm 1 from random initial points time.

As for the recall algorithms, the error correcting performance was assessed for each set-up, averaged over the ensemble of instances. The empirical results are compared to the theoretical bounds derived in Section V as well.

Vii-B Learning Phase Results

In the learning algorithm, we pick a pattern from the training set each time and adjust the weights according to Algorithm 1. Once we have gone over all the patterns, we repeat this operation several times to make sure that update for one pattern does not adversely affect the other learned patterns. Let be the iteration number of the learning algorithm, i.e. the number of times we have gone over the training set so far. Then we set to ensure the conditions of Theorem 1 is satisfied. Interestingly, all of the constraints converged in at most two learning iterations for all different setups. Therefore, the learning is very fast in this case.

Figure 3 illustrates the percentage of pattern nodes with the specified sparsity measure defined as , where is the number of non-zero elements. From the figure we notice two trends. The first is the effect of sparsity threshold, which as it is increased, the network becomes sparser. The second one is the effect of network size, which as it grows, the connections become sparser.

Fig. 3: The percentage of variable nodes with the specified sparsity measure and different values of network sizes and sparsity thresholds. The sparsity measure is defined as , where is the number of non-zero elements.

Vii-C Recall Phase Results

For the recall phase, in each trial we pick a pattern randomly from the training set, corrupt a given number of its symbols with noise and use the suggested algorithm to correct the errors. A pattern error is declared if the output does not match the correct pattern. We compare the performance of the two recall algorithms: Winner-Take-All (WTA) and Majority-Voting (MV). Table I shows the simulation parameters in the recall phase for all scenarios (unless specified otherwise).

Parameter
Value
TABLE I: Simulation parameters

Figure 4 illustrates the effect of the sparsity threshold on the performance of the error correcting algorithm in the recall phase. Here, we have and . Two different sparsity thresholds are compared together, namely and . Clearly, as network becomes sparser, i.e. increases, the performance of both recall algorithms improve.

Fig. 4: Pattern error rate against the initial number of erroneous nodes for two different values of . Here, the network size is and . The blue curves correspond to the sparser network (larger ) and clearly show a better performance.

In Figure 5 we have investigated the effect of network size on the performance of recall algorithms by comparing the pattern error rates for two different network size, namely and with in both cases. As obvious from the figure, the performance improves to a great extent when we have a larger network. This is partially because of the fact that in larger networks, the connections are relatively sparser as well.

Fig. 5: Pattern error rate against the initial number of erroneous nodes for two different network sizes and . In both cases .

Figure 6 compares the results obtained in simulation with the upper bound derived in Section V. Note that as expected, the bound is quite loose since in deriving inequality (22) we only considered the first iteration of the algorithm.

Fig. 6: Pattern error rate against the initial number of erroneous nodes and comparison with theoretical upper bounds for , , and .

We have also investigated the tightness of the bound given in equation (23) with simulation results. To this end, we compare and in our simulations for the case of noise. Figure 7 illustrates the result and it is evident that allowing the recall algorithm to iterate improves the final probability of error to a great extent.

Fig. 7: Pattern error rate in the first and last iterations against the initial number of erroneous nodes for , , , and .

Finally, we investigate the performance of the modified more practical version of the Majority-Voting algorithm, which was explained in Section IV-B. Figure 8 compares the performance of the WTA and original MV algorithms with the modified version of MV algorithm for a network with size , and learning parameters , and . The neural graph of this particular example is rather dense, because of small and sparsity threshold . Therefore, here the modified version of the Majority-Voting algorithm performs better because of the extra information provided by the -norm (than the -norm in the original version of the Majority-Voting algorithm). However, note that we did not observe this trend for the other simulation scenarios where the neural graph was sparser.

Fig. 8: Pattern error rate against the initial number of erroneous nodes for two different values of . Here, the network size is and . The blue curves correspond to the sparser network (larger ) and clearly show a better performance.

Viii Conclusions and Future Works

In this paper, we proposed a neural associative memory which is capable of exploiting inherent redundancy in input patterns to enjoy an exponentially large pattern retrieval capacity. Furthermore, the proposed method uses simple iterative algorithms for both learning and recall phases which makes gradual learning possible and maintain rather good recall performances. The convergence of the proposed learning algorithm was proved using techniques from stochastic approximation. We also analytically investigated the performance of the recall algorithm by deriving an upper bound on the probability of recall error as a function of input noise. Our simulation results confirms the consistency of the theoretical results with those obtained in practice, for different network sizes and learning/recall parameters.

Improving the error correction capabilities of the proposed network is def