I Introduction
We are all familiar with conventional memory systems where the address space and the information content stored in the memory are kept separate. For instance, given a mailbox number, we can fetch the parcels inside, and in a modern computer, the CPU retrieves a stored integer from RAM by accessing a specified  or bit hardware address.
An associative memory is a device or data structure that maps input patterns to output patterns. It differs from conventional memory systems in that no explicit addresses are constructed. Associative memories store paired patterns. Then, given an input pattern, the associative memory produces the paired output pattern. Since no explicit address is involved in its operation, the content of the input pattern itself associates directly with the paired output pattern, from which the name associative memory originates. Although associative memories could be implemented using conventional memory systems, neural networks have been used as associative memories which retrieve patterns without having to search through the stored pattern space. It is worth noting that hash tables, implemented using conventional memory systems, resemble associative memories since they map keys (inputs) to values (outputs), but still an explicit address needs to be generated first.
Associative memories can be categorized into two types [1]: heteroassociative (e.g., linear associator [2, 3], bidirectional associative memories [4] and Willshaw networks [5]) and autoassociative (e.g., Hopfield networks [6, 7]). Heteroassociative memories associate input with output patterns of possibly distinct nature and formats, whereas autoassociative memories are a special case where input and output patterns coincide. This paper focuses on autoassociative memories.
Associative memories have applications in a variety of domains. For instance, in communication networks [8], routers need to quickly determine which port an incoming frame should be forwarded to based on the destination IP address. In signal and image processing [9], one commonly needs to match noisy or corrupted data to a predefined template. Similar tasks appear in database engines [10], anomaly detection systems [11], data compression algorithms [12], face recognition systems [13]
and many other machine learning frameworks.
Ia Historical Background
Associative memories have a long history within the field of neural networks. Associative memories provide two operations: storing and retrieving. In the storing operation, pairs of patterns are fed into the memory and the internal connections between neurons are modified, forming an aggregated representation of the stored pairs. In the retrieving operation (also referred to as “decoding”), the associative memory is presented with a probe pattern, which may be a corrupted or modified version of the stored pattern, and the memory should retrieve the most relevant pattern that was previously stored in a quick and reliable manner.
The linear associator [2, 3] is one of the simplest and earliest associative memory models; see Fig. 1
for an illustration. A linear associator has an input layer and an output layer. Synapses only exist between these two layers, hence the network can be viewed as a bipartite graph. Connections in the network are directed from input to output neurons. The number of neurons in each layer can be different in general, so the linear associator can be used as both an autoassociative and a heteroassociative memory. While storing patterns, the linear associator modifies link weights according to Hebb’s rule
[14]. While decoding a pattern, the network is presented with a given input pattern, and the paired output pattern is retrieved from the output layer immediately after one step of feed forward computation. Since the paired pattern depends on a linear combination of the input pattern values, if all the input patterns are pairwise orthogonal, then the linear associator can reconstruct the paired patterns perfectly. However, in most cases the orthogonality does not hold, thus the network diversity (i.e., the number of patterns that the network can store) is extremely low.The first formal analysis of associative memories by Willshaw [5, 15] dates back to early 1970’s. The structure of a Willshaw network is similar to a linear associator; it is a twolayer fully connected network, with the exception that the weights on the synapses are constrained to be 0 or 1. The plausibility of biological neural networks discourages a fully connected network. Therefore, Buckingham and Willshaw [16] study an incomplete connected network and propose several retrieval strategies to recall the patterns. Although simple, the Willshaw network is one of the most efficient model in terms of information stored per bit of memory (0.68 for heteroassociative and half of that for autoassociative [17, 18], compared to 0.14 for a Hopfield network [19]). For the history and interesting developments of the Willshaw network, see the recent survey [20] and the references therein.
Hopfield’s seminal work [6, 7] on associative memories brought these structures to the attention of the neural network community in the early 1980’s. Fig. 2 shows an example of a Hopfield network, which is a bidirectional complete graph. Instead of having two layers, Hopfield networks comprise one layer of a bidirectional complete graph, acting as both input and output. Therefore, it can only be used as an autoassociative memory. Retrieval of a pattern from the network proceeds recurrently; i.e., when an impulse enters the network, the (output) values at iteration serve as the input values at iteration , and the values iterate until the network reaches its stable configuration if it ever converges. Kosko [4] extends the Hopfield network into a twolayer bidirectional associative memory (BAM). BAMs are different from linear associators because the edges in a BAM are not directed, and the retrieval rule is different. In a BAM, values at the input and output iterate until an equilibrium is reached. Since a BAM incorporates distinct input and output layers, it can be used as both a heteroassociative and an autoassociative memory, filling the gap between the linear associator and Hopfield networks.
IB Related Work
The recent work of Gripon and Berrou [21, 22] proposes a new family of sparse neural network architectures for associative memories. We refer to these as GriponBerrou neural networks (GBNNs). In short, GBNNs are a variant of the Willshaw networks with a partite structure for some . The GBNN combines the notion of recurrence from Hopfield networks with ideas from the field of error correcting codes, and achieves nearly optimal retrieval performance. A detailed description of the GBNN architecture and operation is given in Section II.
The GBNN is not the first attempt to link the associative memory with error correcting codes. For example, Berrou and Gripon [23] successfully introduce a set of WalshHadamard codes in the framework of BAMs. The same authors also consider the use of sparse coding in a Hopfield network. They show that, given the same amount of storage, the GBNN outperforms conventional Hopfield networks in diversity, capacity (i.e., the maximum amount of stored information in bits), and efficiency (i.e., the ratio between capacity and the amount of information in bits consumed by the network when diversity reaches its maximum), while decreasing the retrieval error. In [24], GBNNs are interpreted using the formalism of error correcting codes, and a new retrieval rule is introduced to further decrease the error rate. Jiang et al. [25] modify the GBNN structure to learn long sequences by incorporating directed edges into the network. Aliabadi et al. [26] make the extension to learn sparse messages.
The literature mentioned in the paragraphs above focuses on studying theoretical properties of GBNNs. To be useful in many applications, it is also essential to develop fast and efficient implementations of GBNNs. Jarollahi et al. [27] demonstrate a proofofconcept implementation using the field programmable gate array (FPGA). Due to hardware limitations, their implementation is constrained to have at most neurons. Larras et al. [28] implement an analog version of the same network which consumes less energy but is more efficient both in the surface of the circuit and speed, compared with an equivalent digital circuit. However, the network size is further constrained to neurons in total.
IC Contributions
The primary contribution of this paper is to demonstrate an implementation of GBNNs on a GPU using the compute unified device architecture (CUDA). Our massively parallel implementation supports a much larger number of neurons than existing ones, and is faster than a CPU implementation using optimized C++ libraries for linear algebra operations, without any loss of retrieval accuracy. All the existing algorithms can hopefully benefit from the exciting result we present.
Towards developing an efficient parallel GBNN implementation, we study two retrieval rules: sumofsum and sumofmax, which have been previously proposed in [22] and [24]. sumofsum is fast to implement in CUDA, because it requires only a matrixvector multiplications, a highly optimized operation. sumofmax is slower because it involves nonlinear operations, but it gives superior retrieval performance (lower error rates). We illustrate that, although faster, sumofsum can lead to problematic oscillations. We also prove that the sumofmax rule is guaranteed to converge, and we derive properties of both rules.
The tremendous speedup mentioned above comes from two main sources. First, we exploit the highly parallel architecture of the GPU to carry out operations efficiently. Second, we develop a hybrid retrieval scheme using aspects of both sumofsum and sumofmax, which is tailored to parallel decoding architectures. Although we discuss a GPU implementation, we believe the ideas presented here can be used to accelerate associative memory implementations on other parallel architectures.
We emphasize that this work neither focuses on the GBNN model itself (see [21, 22]), nor carries out comparative studies with other associative memory implementations, e.g., [29]. Instead, we are interested in developing robust and fast procedures to recall stored information given incomplete probes. As a motivating example, consider recovering messages over an erasure channel, which is a common scenario, especially in the ubiquitous IPbased communications. The errors encountered by the IP packets can be mitigated by checksums and error correcting codes. However, missing packets have to be recovered to avoid timeconsuming and unreliable retransmissions.
ID Paper Organization
The rest of this paper is structured as follows. Section II reviews the GBNN associative memory architecture. Section III reviews the sumofsum and sumofmax retrieval rules. Section IV presents the proposed acceleration techniques and discusses the customized CUDA kernel functions which implement these techniques. Section V provides theoretical analysis and discussion of some properties of the retrieval rules considered in this work. Section VI proposes the novel hybrid retrieval rule. Section VII presents experimental results demonstrating the significant performance improvements obtained using GPUs. The paper concludes in Section VIII.
Ii GriponBerrou Neural Networks (GBNNs)
Iia Structure
A message (pattern) can be divided into a tuple of smaller symbols. Specifically, we divide the message into symbols, , where each symbol takes values in a finite set of size . For example, English words of length characters could be represented as symbols from an alphabet of size ; alternatively, they could be represented as symbols from an alphabet of size . Similarly, in an image, a symbol could correspond to the intensity of a specific pixel, or to the collective intensities of a patch of pixels. Here we work in the abstract setting of messages and symbols defined above; precisely how the associative memory is used is applicationdependent.
A GBNN [22] architecture to learn such messages comprises binaryvalued ( or ) neurons. The neurons are grouped into clusters of neurons each, and edges only exist between different clusters. A message is represented in the network by activating (i.e., setting to ) one neuron in each cluster corresponding to the value of , and setting all other neurons to . In this way, the message is naturally encoded as a binary string of length with exactly ones.
When a network is initialized, all edge weights are set to zero (equivalently, there are no edges in the network). When storing a message, we add edges to the network connecting all pairs of nodes which are activated for the particular message. For example, consider the network depicted in Fig. 3, where each message contains symbols and each symbol takes one of different values. Let us use the convention that clusters are numbered from left to right and from top to bottom, so that ’s are cluster , ’s are cluster , and so on; let us use the same convention within each cluster so that the neurons within each cluster are numbered from in the first row, and so on. The message indicated by the bold edges is . The edges corresponding to any single message stored in the network thus correspond to a clique, since the neurons connected for that message form a complete subgraph. The binary code that represents the bold clique in Fig. 3 reads 0000000010000000 0001000000000000 0010000000000000 0000000001000000.
For retrieval, the network is presented with an incomplete message as a probe, e.g., , and it must determine which (if any) stored message matches this input best. In this paper we focus on the case where only partial messages are presented for retrieval. If the network is presented with an entire message as a probe, then the problem boils down to deciding whether or not this message has been stored. For this case, it has been shown that the missed detection rate is zero (i.e., messages which were previously stored are always recognized by the GBNN), and the false positive rate depends on the number of messages which have been stored in the network [22].
The retrieval rules studied in this paper (Section III) are specifically designed for the case where the probe contains missing values. GBNNs can also be used with inputs which contain errors (e.g., flipped bits), but the decoding rule must be changed significantly and the decoding rules studied in this paper are no longer applicable.
IiB Characteristics
In a classic Willshaw network model, a unique activation threshold needs to be chosen globally, e.g., either the number of active neurons in the input probe or the maximum number of signals accumulated in output neurons. This global threshold is no longer required in a GBNN, since each cluster can naturally decide for itself. The most closely related model in the literature is that of Shim et al. [30]. However, there are two main advantages of GBNNs, both of which are supported by simulations in Section VII:
Sparseness has been heavily exploited to perform machine learning tasks and statistical inferences. One of the most famous example is compressive sensing [31]. Neuroscientists are also aware of the sparseness principle [32], not only because of its practical applications but also the low firing rate of the neurons in biological networks [33]. The cluster structure of GBNN produces an extremely sparse binary code by definition (one active neuron per cluster), which makes the network biologically plausible, and also makes fast implementations possible. It is also mentioned in [34]
that the performance of an associative memory is severely affected given correlated patterns. There, an intermediate “grandmother cell” layer is suggested to encode each pair of patterns using a neuron. However, for GBNN, this particular problem can be mitigated by padding messages with extra random symbols at the cost of additional materials. The cluster structure again makes the extension straightforward.
Iii Retrieval Rules
In this section, we review two existing retrieval rules for GBNN, i.e., sumofsum and sumofmax.
Iiia The sumofsum Rule
The simplest rule [22] is to add all the signals a neuron receives in the current iteration. When presented with an incomplete message, we initialize the network by deactivating (i.e., setting to ) all the neurons within the clusters associated with erased symbols. We then repeat the following iterations. First, each neuron compute the sum of all connected neurons which are presently active. Then the neurons within each cluster with the most active connected neurons remain activated at the beginning of the next iteration.
Formally, let denote the ^{th} neuron in the ^{th} cluster, and let denote an indicator variable for whether or not a connection is present between and ; i.e.,
(1) 
We also denote by and respectively the score function for the number of signals receives and the indicator function for whether or not is activated at iteration , with being the corresponding value for in the probe; i.e.,
(2) 
As a consequence, the retrieval procedure can be formalized as
(3)  
(4)  
(5) 
where is a reinforcement factor. Essentially, Eq. (3) counts the score for each neuron. It involves summing over all clusters and all neurons within each cluster, hence the name sumofsum. Eq. (4) finds the value of the neurons with the strongest signal in each cluster, and Eq. (5) keeps them activated.
At the retrieval stage, the variables are fixed. These binaryvalued variables are only changed when storing new messages. The only parameter to be tuned for retrieval using sumofsum is , which influences the extent to which a neuron’s own value influences its signal at the current iteration.
IiiB Problems with the sumofsum Rule
The sumofsum rule, although straightforward and natural, might lead to unnecessary errors. This is due to the fact that during iterations, after evaluating Eq. (5), there might be multiple neurons in one cluster achieving the maximum value simultaneously. In this case, all these neurons will stay activated and contribute to the signal strengths in the next iteration.
Consider the scenario shown in Fig. 4, where two neurons and both receive the same number of signals. Neuron receives two signals from cluster , while receives one signal from each cluster. In this case, should be favored, because we know that for any individual pattern that has been stored, only one neuron in each cluster should be activated. A possible but worse situation arises when receives more signals than , since then will be the only activated neuron in this cluster at the beginning of the next iteration, even if was actually the correct neuron in cluster . An increasing number of clusters will complicate the problem even further. This can also cause sumofsum to diverge see Section V.
IiiC The sumofmax Rule
To avoid the problem mentioned in the previous subsection, the sumofmax rule is proposed in [24]. The rule is formally described as follows:
(6)  
(7) 
Eq. (6) involves a summation over max operation, hence the name sumofmax. The basic idea is that, to retrieve the correct message, the score of a neuron should not be larger if it receives multiple signals from the same cluster, and the maximum taken in Eq. (6) ensures each neuron receives at most one signal from each cluster. Since each stored message corresponds to a clique of neurons, one in each cluster, a neuron should be activated if it receives exactly signals from the other clusters plus some value from the self loop.
For sumofmax to work properly, the network must be initialized appropriately when a probe is presented. Instead of initializing all neurons associated with erased symbols to be as in sumofsum, we initialize them to be . In that case, other neurons will definitely receive signals from these missing clusters, signals per missing cluster, but they will be regulated by Eq. (7).
Iv Accelerating Retrieval
In this section, we will first briefly introduce the CUDA architecture. We discuss different approaches to speeding up the GBNN retrieval procedure in general, and then we focus on specific techniques for sumofsum and sumofmax separately. We also illustrate graphically the dedicated CUDA kernel functions for both rules. Note that, although we implement GBNN using CUDA, the accelerating techniques do not depend on any CUDA specific attribute, thus can be easily extended to other architectures.
Iva Cuda
The Compute Unified Device Architecture (CUDA), introduced in 2007, is NVIDIA’s computing platform solution to general purpose computing on graphics processing units (GPGPU), which enables dramatic increases in computing performance by harnessing the massively parallel resources of GPUs. See [35] by Kirk and Hwu for more information.
The basic programming pattern in CUDA is as shown in Fig. 5, where CPUs play the role of managers, invoking on the GPUs some computational intensive functions called kernel functions. After the kernel function is executed on the GPU, the CPU collects the results back to the host and then may invoke more kernel functions if necessary. Although a GPU can spawn many threads working simultaneously, each thread must run the same sequence of instructions. Kernel functions, and hence GPU computing in general, fit the category of “single instruction multiple data” (SIMD) [36] parallel computing platforms. The data are transferred back and forth between the CPU and GPU over the (slow) PCI or PCIe bus, one of the performance bottlenecks. Unfortunately, since the code control flow is on the CPU side, the timecostly transfers between the host and the video card are inevitable. Therefore, keeping the transfer of data to a minimum is one of the crucial concerns.
IvB General Tricks
IvB1 Vectorization
Although GBNN is a recurrent model, conceptually we can treat it as a layered network nevertheless. We repeat each iteration as one layer, so that the number of layers can grow as large as the network needs to converge. Let denote the the total number of iterations to be run. The only two constraints to be satisfied are
The benefit is to borrow the matrix notation from layered networks which is more efficient in the parallel implementation. We map the original clustered structure into a flat space, where becomes , with , ranging from to . Then Eq. (1) and Eq. (2) can be rewritten as
(8)  
(9) 
We consider the edge weights as elements of an matrix and neuron potentials as elements of a vector . Taking into account the reinforcement factor , we can rewrite Eq. (3) as
(10) 
with being a symmetric matrix whose diagonal elements are all equal to and whose offdiagonal elements are all binary valued; i.e.,
Thus, the score equation (10) is a matrixvector product, computed efficiently in parallel on a GPU.
IvB2 Batch Retrieval
A straightforward extension to vectorization is to bundle and process probes simultaneously. To do so, we collect the test messages into a value matrix
with each column being a value vector in Eq. (10), so that Eq. (10) becomes
(11) 
Instead of retrieving messages one after another, we aggregate messages together and feed them into the GPU card at one shot. Speedups are achieved using this approach because it allows us to exploit the SIMD nature of GPUs. It is also more efficient to perform one large I/O transfer over the bus rather than multiple smaller transfers.
Batch retrieval arises naturally in applications where simultaneous retrievals are preferred. For instance, in face recognition, an associative memory can be used to recognize face features even when areas are obstructed by sun glasses or a scarf. If we treat each image as a single message, the hardware requirement is simply prohibitive. A level gray image of the size requires an adjacency matrix of elements. Alternatively, we can divide the image into smaller patches, treat each patch as a different message, and process them in parallel. For another example, consider a network anomaly detection algorithm where we are given a batch of IP addresses, and we would like to check whether each belongs to a predefined blacklist. In Section VII below, we will refer to Eq. (11) as parallel decoding and Eq. (10) as serial decoding.
IvB3 Reduction
Reduction refers to an operation that aggregates a vector of elements into a scalar (e.g., sum, max and min). In sumofsum, the max operation is needed when evaluating Eq. (4) to determine which neurons remain active in the next iteration. In both rules, when deciding whether or not the retrieval procedure has converged, we need to compare two long vectors and of length , and test if all of the neuron values stay unchanged. This reduction operation can be done in an efficient manner as illustrated in Fig. 6, where we invoke threads in the first step, afterwards halving the number of threads in every successive operation. The time complexity thus decreases from to .
IvB4 Sparsification
Memory access is an expensive operation on the video card, where both reading from and writing to memory cells are much slower than on the host CPU. In order to combat this inefficiency, we can reduce the number of memory accesses by accounting for the fact that GBNN is actually a sparse network; i.e., for a given message, ideally only one neuron should be activated for each cluster. Typically, the network structure should also be sparse, so we could implement both and as sparse matrices using compressed format, where we only record the nonzero elements and their coordinates. Then evaluating Eq. (3) and Eq. (6) requires many fewer terms. However, the compressed format does not lead to a significant performance gain for both rules — sumofmax benefits from the sparsification, while sumofsum does not. The reason is that the dense matrix product in Eq. (11) for sumofsum is an optimized operation on the GPU, whereas the compressed format deviates from the optimized pattern. Moreover, since changes from one iteration to the next, it is not economical to implement using compressed format either. On the contrary, is fixed at the retrieval stage. We use a sparse matrix representation only for sumofmax. Detailed numerical results are presented in Section VII.
IvC Accelerating the sumofsum Rule
The pseudocode for the sumofsum procedure is given in Algorithm 1. It requires as inputs the maximum number of iterations permitted , the weight matrix with all of the clique structures preserved during the storing stage, and the message matrix , with the ^{th} column being the value vector for test message and the erased clusters deactivated. On Line 4, is the score matrix for iteration , where the ^{th} column is the score vector of length for test message . On Line 5, the kernel function takes as input and essentially produces by evaluating Eq. (4).
The first two columns of are drawn in Fig. 7. In this particular example, each message can be divided into clusters. In our implementation, a dedicated thread processes one cluster, finding the maximum value in that cluster, and then keeping the neurons that reach the maximum value activated. Assuming that there are messages to be recovered, a total of threads are used. The retrieval procedure terminates when either the network converges or it reaches the maximum number of iterations permitted.
IvD Accelerating the sumofmax Rule
The pseudocode for sumofmax is almost the same with Algorithm 1, except that Lines 4 and 5 are replaced by another kernel function illustrated in Fig. 8. In order to better explain the concept, the serial decoding of a single message is presented here, where the same number of threads are needed as the number of neurons in the network. The extension to the parallel decoding scheme of bundled messages is straightforward, where threads are needed.
We do not follow strictly Eq. (6) and Eq. (7) to evaluate a max function. Instead, we apply an alternative procedure. Essentially, we check if a neuron receives signals from every cluster; hence, for , the ^{th} row of and require examination. Since is symmetric and the memory storage on the GPU is column major, we check the ^{th} column of instead to make the computation more pleasant. To update a neuron value , a dedicated thread is required, scanning through both and the ^{th} column of .
Thread loops through cluster , from to .

For any positive , if belongs to cluster , we directly set . ( is the standard floor operator.)

Otherwise, we check within the same cluster, i.e., and , where goes from to . The first time we encounter and , we set , and proceed to the next cluster without further investigation.

If cluster does not contribute any signal to , i.e., does not change, we stop right away without checking following clusters.
We call this procedure bailoutearly and favor it over Eq. (6) and Eq. (7) for two reasons:

It explicitly clarifies the requirement that every cluster should contribute one and only one signal.

It proceeds to subsequent clusters or stops processing as quickly as possible so that further expensive memory accesses are avoided.
Theorem 1.
The bailoutearly approach is equivalent to sumofmax, i.e., for any positive , given and , bailoutearly produces the same as sumofmax.
Proof:
For cluster , there is only one nonnegative , since by design, within the same cluster, a neuron can only receive contributions from itself. bailoutearly directly sets , which effectively makes any positive equivalent to .
For other clusters, is either or , depending on whether or not a connection exists between and . Notice in either case, is always a binary vector.
For a neuron to be activated, it needs to receive signals from the other clusters, plus some self contribution from itself. Since any does not affect the dynamic due to Eq. (6) and Eq. (7), we deliberately set . The weight matrix thus becomes binary valued, which can be more efficiently implemented.
V Properties
In this section, we discuss properties of GBNNs and the two retrieval rules introduced in previous sections. We illustrate these properties via examples and theoretical claims.
Va The sumofsum Rule May Oscillate
We first give an example which illustrates that sumofsum may oscillate. Consider a small network with clusters, each cluster has neurons, i.e., neurons in total. We set . There are messages to store: , , and . The test message is . Clearly all of the stored messages match the nonerased part of the test message. In such a scenario, we expect that the retrieval rule either returns an arbitrary stored message which matches the input, or returns all of the stored messages matching the input. Unfortunately, sumofsum does not converge to any output. After constructing the network and initializing the neurons to be deactivated for the 1^{st} and 2^{nd} clusters of the test message, we have
It is easy to verify that
The underlined values indicate the maximum value within the same cluster. Note that every cluster activates its own neurons with the most signals respectively. Therefore, three neurons are activated in . In this case, , so that the network does not converge, oscillating between and forever.
There is another level of complication: the reinforcement factor plays a delicate role in the retrieval procedure. If we increase , then the network converges. However, we will see in Section VII below that enlarging leads to a worse retrieval rate in general.
VB The sumofmax Rule Converges
We now show that sumofmax (bailoutearly) always converges when all the neurons in erased clusters are initialized to be activated.
Lemma 1.
For sumofmax, once deactivated, a neuron stays deactivated forever, i.e., if then .
Proof:
Recall, from Eq. (7), that if and only if . Assume in iteration that is deactivated, i.e., . Then . Since the only possible contribution a neuron might obtain from its own cluster is the self loop, , thus . ∎
Lemma 2.
A clique is stable, i.e., once all neurons of a clique are activated, they stay activated forever.
Proof:
Lemma 3.
Given a partially erased message, sumofmax always converges to a state which contains an ensemble of cliques.
Proof:
As each previously stored message corresponds to a clique, a partially erased message corresponds to parts of the clique, with the neurons in the nonerased clusters activated. sumofmax initializes all the neurons in the missing clusters to be activated. Therefore, the already activated neurons in nonerased clusters will receive contributions from the missing clusters, staying activated in the next iteration. The neurons in the missing clusters which, together with the already activated neurons in nonerased clusters, form a clique will also receive exactly signals and will stay activated in the next iteration. By Lemma 2, the ensemble of these cliques will be present in the converged state. ∎
Theorem 2.
Given a partially erased message, if a neuron is the only one activated in its cluster, it remains activated, i.e., for a given cluster , if there exists an such that and for all , , then .
Proof:
Suppose, to arrive at a contradiction, that at some point, cluster has no neuron activated, i.e., , the other clusters will not receive any signal from cluster . By Eq. (7), every neuron throughout the network will be deactivated in the next iteration. By Lemma 1, the network converges to this alldeactivated state forever, which violates Lemma 3. Therefore, if a neuron is the only one activated in its cluster, it remains activated. ∎
Theorem 3.
For any given probe pattern, sumofmax always converges.
Proof:
For a partially erased message, this theorem has already been proved by Lemma 3.
We consider an input probe such that some parts of a previously stored message are modified (corrupted). If the probe can still be explained by a clique in the network, the memory converges to this clique by Lemma 2. If the probe cannot be explained by any clique in the network, the activated neurons in the unchanged clusters cannot receive signals from the corrupted clusters. Hence by Eq. (7), the memory converges to the alldeactivated state. ∎
Since bailoutearly is equivalent to sumofmax (see Theorem 1), we also have the following.
Corollary 1.
For any given probe pattern, bailoutearly always converges.
It is worth emphasizing that sumofmax converges to a state which contains an ensemble of cliques by Lemma 3. We can randomly choose one of them as the reconstructed message.
Vi Joint Retrieval Rule
Via Proposal
We have just seen that sumofsum is not guaranteed to converge, whereas sumofmax is. In Section VII below we will see that sumofsum is generally much faster than sumofmax, but the accuracy of sumofmax is much better when either the number of stored messages or the number of erased symbols increases. It is natural to ask whether we can merge the two rules to obtain a fast and accurate reconstruction scheme.
In this section we propose such a hybrid retrieval scheme which combines the best aspects from both procedures. The pseudocode for the joint decoding scheme is given in Algorithm 2. Essentially this decoding algorithm performs one refined iteration of sumofsum followed by subsequent, optimized iterations of bailoutearly until a convergence criterion is satisfied.
ViB Justification and Convergence
As mentioned in Section IV, memory access is extremely expensive on GPUs in comparison to on the host CPU. Therefore, it is of vital importance that we eliminate any unnecessary memory operations. We notice that Lemma 1 and Theorem 2 have crucial implications in designing our new scheme. The former suggests that if then there is no need to loop through two long vectors of length , i.e., and the ^{th} column of , since we will have . Thus, we only need to focus on updating those for which . In this sense, the currently active neurons can be considered as a candidate pool that needs to be investigated further in the next iteration. The latter suggests that clusters with only one active neuron (including those which are not erased in the test message) will not change during decoding. Hence, we only update neurons in erased clusters that have not yet reached convergence. In general, this notion of “freezing good clusters” can also be justified as preventing good neurons from being altered undesirably by any retrieval procedure.
One final but important consideration is the allactivated initialization scheme. Although it is crucial for the correctness of the sumofmax rule, it also introduces too many candidates from the beginning. We will show a motivating example later in Section VIIE. Fortunately, sumofsum can help us bypass this particular problem.
Theorem 4.
The first iteration of sumofsum affects neither the correctness of sumofmax nor its convergence.
Proof:
For correctness, let us revisit sumofsum. The only problem making sumofsum inferior to sumofmax is that during the retrieval procedure, as in Fig. 4, it is possible for multiple neurons to be activated simultaneously in one cluster without regulation, which in turn propagates the error or even causes oscillation. However, during the initialization phase, if we deactivate all the neurons in erased clusters, preserving good clusters only, by definition there will be at most one activated neuron per cluster. The aforementioned flaw does not exist anymore.
For convergence, recall a clique structure in Fig. 3. For a given message with clusters erased, there are good neurons transmitting signals. Therefore, the desired neuron in the erased clusters should receive signals exactly. After one iteration of sumofsum, we only keep these neurons with signals in the candidate pool. The sole effect of the first iteration sumofsum is to shrink the pool size, with the convergence untouched. ∎
Ideally, there should be only one such neuron per erased cluster in the candidate pool, rather than candidates for sumofmax, with two exceptional cases.

There are two memorized messages and which only differ in erased clusters, e.g., we have and , and the test message is . In this case, both neuron and in the erased cluster will be present in the pool.

Spurious cliques. While storing messages, distinct cliques may overlap and produce a spurious clique that does not correspond to a stored message, where different edges in this clique were added for different stored messages. In other words, a stored message corresponds to a clique, but not vice versa.
We argue that for a relatively large network and a reasonable number of messages to memorize, the candidate pool size is sufficiently small.
Corollary 2.
The new joint retrieval scheme always converges.
Proof:
Combining all these factors, we propose the joint scheme as in Algorithm 2.
Vii Experiments
In this section, we compare sumofsum and sumofmax using the different acceleration approaches discussed previously in Section IV and Section VI. We show that a significant performance gain is achieved in terms of running time after parallelizing the retrieval procedure and applying our new joint retrieval scheme.
All the CPU experiments are executed on a GHz AMD Phenom (tm) 9950 QuadCore Processor with GB of RAM, and all the GPU experiments are executed on an NVIDIA C1060 card, which runs at a frequency of GHz with GB memory and has stream multiprocessors. In order to make as fair a comparison as possible, our CPU code makes use of the Armadillo library [37], linked to BLAS and LAPACK, for optimized linear algebra operations.
Viia sumofsum versus sumofmax
First, we compare sumofsum and sumofmax. In this experiment, we have clusters with neurons each, and the reinforcement factor . We randomly generate and store messages, each of which consists of symbols uniformly sampled from the integers to . After the storing stage, we randomly select out of the stored messages, erase some parts of them and try to retrieve the erased messages from the GBNN associative memory. We refer to this experiment setting as Scenario . Since sumofsum does not necessarily converge, we set the maximum number of iterations to . We vary the number of erased clusters, and plot the retrieval rate, i.e., the fraction of successfully retrieved messages, in Fig. 9.
Observe that when the number of erased clusters is relatively small ( erased clusters), both rules perform equally well above . As the number of erased clusters increases, although both rules make more errors, the performance of sumofsum degrades more drastically than that of sumofmax. When out of clusters are erased, sumofsum can only recover slightly above of the messages, while sumofmax still recovers over . If clusters are erased, sumofmax is still able to retrieve over , which is significantly higher than sumofsum.
ViiB Influence of
Second, we explore the subtle dependence of the retrieval rate on the value of the reinforcement factor used in sumofsum. We plot the trend for different in Fig. 10 using the same experiment Scenario as above. In general, increasing hurts the retrieval rate with the only exception of , which suggests that can be used as a default value.
ViiC CPU versus GPU
Next, we consider the improvements in runtime achieved by running both rules on a GPU versus on a CPU. A larger network is simulated in this case. We have clusters with neurons each, out of which clusters are erased. We generate and store random messages, and we use a random subset of of these to test. We refer to this experiment setting as Scenario . The runtime, in seconds, of both parallel decoding (i.e., decoding a batch of messages concurrently) and serial decoding (i.e., decoding message one after another) on both GPU and CPU are shown in Fig. 11.
We make three observations: First, for each CPU versus GPU and parallel versus serial decoding configuration, sumofmax is always significantly slower than sumofsum. For now, let us keep in mind that the fastest retrieval configuration of this entire experiment is roughly seconds for sumofsum parallel decoding on a GPU. We have previously seen that sumofmax leads to a much better retrieval accuracy, and so below we focus on achieving the accuracy of the sumofmax method while improving its runtime.
Second, in each group, the bars at the ^{st} and ^{rd} locations are results for the CPU implementation, and the ^{nd} and ^{th} bars show results for the GPU implementation. Comparing each adjacent pair, we see that the GPU versions consistently run much faster than CPU, as expected. The GPU accelerations without any further optimization are respectively (from lefttoright) , , and faster.
Finally, parallel decoding is faster than serial decoding on GPU, while the situation reverses on CPU. This is reasonable, since parallel decoding can take full advantage of the GPU’s computing power. However, in the CPU case, if we consider a bundle of messages, even if only one message does not converge, all messages will be updated. On the other hand, with serial decoding, the retrieval rule will stop as soon as each individual message converges.
ViiD Further Accelerating the sumofmax Rule
In Fig. 12 we show the effect of applying the different techniques discussed in Sections IV and V to accelerate the sumofmax rule on a GPU. Although all of the techniques combined reduce the runtime eightfold, from roughly seconds to seconds, the sumofmax rule still cannot compete with sumofsum’s second spec, which is highlighted in yellow and bold font in the figure. However, the proposed joint scheme cuts the record by another two thirds, achieving the fastest runtime of only seconds for Scenario .
In Fig. 11, the faster configuration for sumofmax on CPU is the serial decoding scheme, to which we compare, our joint scheme achieves a speedup while retaining the decoding accuracy.
ViiE Motivation for Combining Rules
Here we provide an example to better illustrate why the joint scheme achieves a significant speedup. We again use Scenario and apply all of the acceleration techniques discussed in Section IV. We initialize the matrix according to the vanilla sumofmax, so that all neurons in clusters corresponding to erased symbols are activated, and only one neuron within each cluster corresponding to a nonerased symbol is active. Fig. 13 depicts a typical run of the experiment. Fig. (a)a shows the total runtime spent in each iteration of the sumofmax decoding. One observation is that every subsequent iteration requires less time than its previous one due to the application of Lemma 1 and Theorem 2; otherwise, the runtime of each iteration should be roughly the same. Another observation is that the majority of the total runtime is spent in the ^{st} iteration; this occurs because initially there are too many unnecessary active neurons in erased clusters, and sumofmax demands time to process each one of them. Fig. (b)b shows the number of test messages (out of ) which have converged after each iteration.
ViiF New Joint Scheme
Finally, we demonstrate the behavior of the joint decoding scheme across a range of experimental settings. Fig. 14 shows the runtime (in seconds) and retrieval rate compared with sumofsum and sumofmax for both of Scenarios and , while varying the number of erased symbols. The spikes in runtime for sumofmax and for the joint scheme in Fig. (a)a are due to the fact that decoding becomes more difficult as the number of erased clusters increases, consequently more iterations are required in these cases. In these settings ( out of clusters erased for Scenario , and out of clusters erased for Scenario ), although the sumofsum rule is only a bit faster than sumofmax and the joint scheme, the retrieval rate is significantly lower. Another reason that sumofsum runs faster here is due to the limit on the number of iterations which we impose in our experiments. Note that increasing this limit does not improve the retrieval rate, but it can make the runtime arbitrarily worse because sumofsum oscillates. Also observe that in both Fig. (b)b and Fig. (d)d, the retrieval rates of sumofmax and the joint scheme are identical. In Fig. (d)d, all three approaches achieve effectively a retrieval rate for up to erased clusters. This is because the number of messages stored () is relatively small for this network. If this number increases, the deviation in retrieval rate between the joint scheme (as well as sumofmax) and sumofsum will be more pronounced. We conclude from Fig. 14 that the joint retrieval scheme combines the benefits of both existing rules, achieving fast decoding while also maintaining a high retrieval rate.
ViiG Correlated vs. Uncorrelated Messages
The experiments above involve storing and retrieving random messages which are generated uniformly and independently. If messages are correlated and the GBNN architecture is unchanged, we expect the performance to degrade in comparison to the results reported above. Modifying GBNNs to accommodate correlated messages is an area of ongoing investigation. As mentioned in Section II, one approach is to use a “grandmother layer”, similar to [34]. Another approach is to append each message with i.i.d. uniform random bits. This requires additional neurons, but reduces the correlation.
Viii Summary
In this work, we present optimized implementations of the GriponBerrou neural network associative memory on a GPU. We analyze two existing retrieval rules, namely sumofsum and sumofmax. We show that sumofsum may lead to network oscillation. However, we manage to prove the convergence of sumofmax. In order to achieve the full speedup, we combine the two rules and propose a hybrid scheme, minimizing the unnecessary computation burdens. The experimental results show an exciting acceleration against a CPU implementation using an optimized linear algebra library.
GBNNs embrace a LDPClike sparse encoding setup, which makes the network extremely resilient to noises and errors. As associative memories serve as building blocks for many machine learning algorithms, we hope the parallel scheme proposed here can be helpful in paving the path to more widespread adoptions of large scale associative memory applications.
An obvious future work is to extend GBNNs to deal with correlated messages, which will severely impair the retrieval performance. However, the nice structure of GBNNs seems promising in this direction. In the future, we will try to develop other retrieval schemes, e.g., to handle corrupted patterns as well as incomplete probes. Since sumofsum runs orders of magnitude faster, another sensible topic is to emulate sumofmax using sumofsum so that both performance and speed can be retained simultaneously. We may also seek the way to generalize GBNN and extend the sparse neural network’s use in tasks other than associative memory, e.g., classification and regression.
References

[1]
S. Rajasekaran and G. Pai,
Neural Networks, Fuzzy Logic and Genetic Algorithms: Synthesis and Applications
. PHI Learning Pvt. Ltd., 2004.  [2] J. A. Anderson and E. Rosenfeld, Neurocomputing: foundations of research, ser. Bradford Books. MIT Press, 1988, vol. 1.
 [3] J. A. Anderson, A. Pellionisz, and E. Rosenfeld, Neurocomputing 2: Directions of Research, ser. Bradford Books. MIT Press, 1993, vol. 2.
 [4] B. Kosko, “Bidirectional associative memories,” IEEE Transactions on Systems, Man and Cybernetics, vol. 18, no. 1, pp. 49–60, 1988.
 [5] D. J. Willshaw, O. P. Buneman, and H. C. LonguetHiggins, “Nonholographic associative memory.” Nature, vol. 222, pp. 960–962, 1969.
 [6] J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558, 1982.
 [7] ——, “Neurons with graded response have collective computational properties like those of twostate neurons,” Proceedings of the National Academy of Sciences, vol. 81, no. 10, pp. 3088–3092, 1984.
 [8] S. Kaxiras and G. Keramidas, “Ipstash: a setassociative memory approach for efficient iplookup,” in INFOCOM 2005. Proc. 24th Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 2, Miami, FL, USA, 2005, pp. 992–1001.
 [9] M. E. Valle, “A class of sparsely connected autoassociative morphological memories for large color images,” IEEE Transactions on Neural Networks, vol. 20, no. 6, pp. 1045–1050, 2009.
 [10] C. S. Lin, D. C. P. Smith, and J. M. Smith, “The design of a rotating associative memory for relational database applications,” ACM Transactions on Database Systems, vol. 1, no. 1, pp. 53–65, March 1976.
 [11] L. Bu and J. Chandy, “FPGA based network intrusion detection using content addressable memories,” in IEEE Symposium on FieldProgrammable Custom Computing Machines, April 2004, pp. 316–317.
 [12] K.J. Lin and C.W. Wu, “A lowpower CAM design for LZ data compression,” IEEE Transactions on Computers, vol. 49, no. 10, pp. 1139–1145, October 2000.
 [13] H. Zhang, B. Zhang, W. Huang, and Q. Tian, “Gabor wavelet associative memory for face recognition,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 275–278, 2005.
 [14] A. Jain, J. Mao, and K. Mohiuddin, “Artificial neural networks: A tutorial,” Computer, vol. 29, no. 3, pp. 31–44, March 1996.
 [15] D. Willshaw, “Models of distributed associative memory.” Ph.D. dissertation, Edinburgh University, 1971.
 [16] J. Buckingham and D. Willshaw, “On setting unit thresholds in an incompletely connected associative net,” Network: Computation in Neural Systems, vol. 4, no. 4, pp. 441–459, 1993.
 [17] G. Palm, “On associative memory,” Biological Cybernetics, vol. 36, no. 1, pp. 19–31, 1980.
 [18] H. Bosch and F. J. Kurfess, “Information storage capacity of incompletely connected associative memories,” Neural Networks, vol. 11, no. 5, pp. 869–876, 1998.
 [19] D. J. Amit, H. Gutfreund, and H. Sompolinsky, “Statistical mechanics of neural networks near saturation,” Annals of Physics, vol. 173, no. 1, pp. 30–67, 1987.
 [20] G. Palm, “Neural associative memories and sparse coding,” Neural Networks, vol. 37, pp. 165–171, January 2013.
 [21] V. Gripon and C. Berrou, “A simple and efficient way to store many messages using neural cliques,” in IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB). Paris, France: IEEE, 2011, pp. 1–5.
 [22] ——, “Sparse neural networks with large learning diversity,” IEEE Transactions on Neural Networks, vol. 22, no. 7, pp. 1087–1096, 2011.
 [23] C. Berrou and V. Gripon, “Coded hopfield networks,” in International Symposium on Turbo Codes and Iterative Information Processing (ISTC), September 2010, pp. 1–5.
 [24] V. Gripon and C. Berrou, “Nearlyoptimal associative memories based on distributed constant weight codes,” in Information Theory and Applications Workshop (ITA). San Diego, CA, USA: IEEE, 2012, pp. 269–273.
 [25] X. Jiang, V. Gripon, and C. Berrou, “Learning long sequences in binary neural networks,” in International Conference on Advanced Cognitive Technologies and Applications, 2012, pp. 165–170.
 [26] B. K. Aliabadi, C. Berrou, V. Gripon, and X. Jiang, “Learning sparse messages in networks of neural cliques,” ACM Computing Research Repository, 2012, arXiv:1208.4009v1 [cs.NE]. [Online]. Available: http://arxiv.org/abs/1208.4009v1
 [27] H. Jarollahi, N. Onizawa, V. Gripon, and W. Gross, “Architecture and implementation of an associative memory using sparse clustered networks,” in IEEE International Symposium on Circuits and Systems (ISCAS). Seoul, Korea: IEEE, 2012, pp. 2901–2904.
 [28] B. Larras, C. Lahuec, M. Arzel, and F. Seguin, “Analog implementation of encoded neural networks,” in IEEE International Symposium on Circuits and Systems, Beijing, China, 2013, pp. 1–4.
 [29] J. Austin, J. Kennedy, and K. Lees, “The advanced uncertain reasoning architecture, AURA,” RAMbased Neural Networks, Ser. Progress in Neural Processing, vol. 9, pp. 43–50, 1998.
 [30] G. Shim, D. Kim, and M. Choi, “Statisticalmechanical formulation of the willshaw model with local inhibition,” Physical Review A, vol. 43, no. 12, p. 7012, 1991.
 [31] R. G. Baraniuk, “Compressive sensing [lecture notes],” Signal Processing Magazine, IEEE, vol. 24, no. 4, pp. 118–121, 2007.
 [32] S. Ganguli and H. Sompolinsky, “Compressed sensing, sparsity, and dimensionality in neuronal information processing and data analysis,” Annual review of neuroscience, vol. 35, pp. 485–508, 2012.
 [33] D. Golomb, N. Rubin, and H. Sompolinsky, “Willshaw model: Associative memory with sparse coding and low firing rates,” Physical Review A, vol. 41, no. 4, p. 1843, 1990.
 [34] A. Knoblauch, “Neural associative memory for brain modeling and information retrieval,” Information Processing Letters, vol. 95, no. 6, pp. 537–544, 2005.
 [35] D. B. Kirk and W. H. Wenmei, Programming massively parallel processors: a handson approach. Morgan Kaufmann, 2010.
 [36] M. Flynn, “Some computer organizations and their effectiveness,” IEEE Transactions on Computers, vol. 100, no. 9, pp. 948–960, 1972.
 [37] C. Sanderson, “Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments,” Technical report, NICTA, Tech. Rep., 2010.
Comments
There are no comments yet.