Improving Sparse Associative Memories by Escaping from Bogus Fixed Points

08/27/2013 ∙ by Zhe Yao, et al. ∙ McGill University 0

The Gripon-Berrou neural network (GBNN) is a recently invented recurrent neural network embracing a LDPC-like sparse encoding setup which makes it extremely resilient to noise and errors. A natural use of GBNN is as an associative memory. There are two activation rules for the neuron dynamics, namely sum-of-sum and sum-of-max. The latter outperforms the former in terms of retrieval rate by a huge margin. In prior discussions and experiments, it is believed that although sum-of-sum may lead the network to oscillate, sum-of-max always converges to an ensemble of neuron cliques corresponding to previously stored patterns. However, this is not entirely correct. In fact, sum-of-max often converges to bogus fixed points where the ensemble only comprises a small subset of the converged state. By taking advantage of this overlooked fact, we can greatly improve the retrieval rate. We discuss this particular issue and propose a number of heuristics to push sum-of-max beyond these bogus fixed points. To tackle the problem directly and completely, a novel post-processing algorithm is also developed and customized to the structure of GBNN. Experimental results show that the new algorithm achieves a huge performance boost in terms of both retrieval rate and run-time, compared to the standard sum-of-max and all the other heuristics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Background

Associative memories are different from conventional memory systems in that they do not require explicit addresses to the information we are interested in. They store paired patterns. When an associative memory is given an input pattern as the probe, the content of the input itself addresses the paired output pattern directly. The parallel nature of the associative memory and its ability to perform pattern queries efficiently makes it suitable in a variety of application domains. For instance, in communication networks [1], routers need to determine quickly the destination port of an incoming data frame based on IP addresses. In signal and image processing [2], one often needs to match a incomplete or noisy version of the information with predefined templates. Database engines [3]

, anomaly detection systems 

[4], compression algorithms [5]

, face recognition systems 

[6]

and many other machine learning tasks are all legitimate users of associated memories.

Among all different architectures to implement associative memories, the neural network is the most popular and widely adopted approach. Associative memories provide two operations: storing and retrieving (also known as decoding). In the storing operation, pairs of patterns are fed into the network, modifying the internal connections between neurons. An aggregated representation of the patterns stored thus far is obtained. In the retrieving operation, probes (input) are presented, which might be a noisy or incomplete version of some previously stored pattern, and the associative memory needs to retrieve the most relevant or associated pattern quickly and reliably.

Although early exploits of this approach (e.g., linear associators [7, 8] and Willshaw networks [9, 10]) date back to the 1960’s, it was not until the seminal work of Hopfield [11, 12] in the 1980’s that the neural network community started to catch on. For the history and interesting developments on associative memories, see the recent survey [13] and the references therein. Quite recently, Gripon and Berrou propose a new family of sparse neural networks for associative memories [14, 15], that we refer to as the Gripon-Berrou neural network (GBNN), which is, in short, a variant of Willshaw networks with partite cluster structures. GBNN resemble the model proposed by Moopenn et al.  [16], although the motivations underneath are different. Moopenn et al. choose the cluster structure mainly because the resulting dilute binary codes suppress the retrieval errors due to the appearance of spurious ones in electronic circuits, whereas GBNNs use the structure to provide an immediate mapping between patterns and activated neurons, so that efficient iterative algorithms can be developed. The cluster structure also permits GBNNs to consider “blurred” or “imprecise” inputs where some symbols are not exactly known. In a conventional Willshaw network, a global threshold is required to determine the neuron activities, which can be either the number of active neurons in the probe, or the maximum number of signals accumulated in output neurons. However, for GBNNs, the global threshold is no longer needed, since each cluster can naturally decide for itself. In addition to Moopenn’s model, GBNNs also allow self excitations, as well as a new retrieval scheme sum-of-max [17]; both help to increase the retrieval rate given incomplete patterns as probes. This manuscript focuses on an issue with the state-of-the-art retrieval rule (sum-of-max) for GBNNs. A brief description of GBNNs is given in Section II.

I-B Related Work

There are three important concepts to describe the quality of an associative memory: diversity (the number of paired patterns that the network can store), capacity (the maximum amount of stored information in bits) and efficiency (the ratio between the capacity and the amount of information that the network can store when the diversity reaches its maximum). In [15], Gripon and Berrou have shown that, given the same amount of storage, GBNN outperforms the conventional Hopfield network in all of them, while decreasing the retrieval error rate. The initial retrieval rule used in [15] was sum-of-sum. Later in [17], the same authors also interpret GBNN using the formalism of error correcting codes, and propose a second retrieval rule sum-of-max which further decreases the error rate. We will discuss the mechanics of both rules in Section II. Jiang et al.  [18] modify GBNN to learn long sequences by incorporating directed links. Aliabadi et al.  [19] extend GBNN to learn sparse messages.

Another line of research focuses on efficient implementations of GBNNs. Jarollahi et al.  demonstrate a proof-of-concept implementation of sum-of-sum using field programmable gate array (FPGA) in [20], though the network size is constrained to 400 neurons due to hardware limitations. The same authors implement sum-of-max in [21] and runs 1.9 faster than [20], since bitwise operations are used in place of a resource-demanding module required by sum-of-sum. In [22], the same group of authors also develop a content addressable memory using GBNNs which saves 90% of the energy consumption. Larras et al.  [23] implement an analog version of the network which consumes less energy but is more efficient both in the surface of the circuit and speed, compared with an equivalent digital circuit. However, the network size is further constrained to neurons in total. After analyzing the convergence and computation properties of both sum-of-sum and sum-of-max, Yao et al.  [24] propose a hybrid scheme and successfully implement GBNNs on a GPU. An acceleration of 900 is witnessed without any loss of accuracy.

I-C Contributions

The state-of-the-art activation rule for GBNN is sum-of-max, outperforming sum-of-sum in terms of successful retrieval rate by a large margin given incomplete pattern probes; see [17, 24]. In prior discussions and experiments, it is believed that sum-of-max always converges to an ensemble of neuron cliques corresponding to previously stored patterns. Lemma 3 in [24] proves that the ensemble always exists in the final converged state. It is also argued in [24] that “We can randomly choose one of them (cliques) as the reconstructed message.” However, this interesting random selection step itself was disregarded, which in fact deserves additional attention.

The contributions of this work are three folds:

  1. We identify the bogus fixed point problem that the ensemble of neuron cliques only comprises a subset of the converged state where sum-of-max gets trapped.

  2. We propose six different heuristics, pushing sum-of-max beyond the bogus fixed point, which helps to improve the retrieval rate.

  3. We develop a novel post-processing algorithm which involves finding a maximum clique in the network and improves both retrieval rate and run-time.

Although finding the maximum cliques in an arbitrary graph is a well known NP-hard problem [25], our algorithm can serve its purpose efficiently. This is accomplished by taking into account the special structure of GBNN; see Section II. Experimental results show that the new algorithm achieves a huge performance boost in terms of both retrieval rate and run-time, compared to the standard sum-of-max and all the other heuristics, which also indicates that the activation rule itself still has a plenty of room for improvements.

I-D Paper Organization

The rest of this paper is structured as follows. Section II reviews the architecture of GBNN to setup the context and explains in brief the reason that sum-of-max outperforms sum-of-sum in terms of retrieval rate. Section III depicts the bogus fixed point problem after sum-of-max has converged. We show that interestingly the cause of this problem is also the reason that sum-of-max outperforms sum-of-sum, and we develop a number of heuristics in Section IV to push sum-of-max beyond the bogus fixed point. In Section V, we propose the clique finding algorithm as our post-processing step to fix the problem directly and completely. Section VI compares different approaches proposed in this work numerically and the paper concludes in Section VII.

Ii Gripon-Berrou Neural Networks

Ii-a Structure

The structure of GBNN [14] is closely related to the patterns it tries to store. A pattern or a message can be viewed as a tuple of symbols. Let us consider a message of length symbols, i.e., . Each symbol takes a value from a finite alphabet of size , i.e., . To store messages of such a kind, we use a network of neurons, which comprises clusters containing neurons each. In this setup, cluster corresponds to the symbols and neuron corresponds to the specific value which takes.

GBNN is a binary valued neural network with a neuron’s state being either 0 (inactive) or 1 (active). We denote as the th neuron in cluster . Therefore, to express a message in GBNN, if , it is equivalent to set . Since a symbol can only take one value at a time, for a given message, in each cluster, there is only one single active neuron accordingly. Consequently, a message can be naturally encoded as a sparse binary string of length with exactly s. In other words, the locations of the active neurons express a particular message. Once the values of the symbols are determined, all corresponding neurons are fixed. These neurons form a clique (complete sub-graph), which is the representation of a message stored in a GBNN.

Fig. 1 illustrates an example of GBNN with 4 clusters and each cluster has 16 neurons, with for cluster 1, for cluster 2, for cluster 3 and for cluster 4. We also number the neurons sequentially row by row from 1 to 16 in each cluster. There are three cliques drawn in Fig. 1. Therefore, this particular instance of GBNN stores three messages, with the black clique indicating the message (9, 4, 3, 10).

Fig. 1: An example of a network with clusters of neurons each [15]. We number the clusters from left to right and from top to bottom as . The same scheme applies for neurons within each cluster.

Initially, there are no edges in the network. In the storing phase, as a message is presented, a clique with all its edges is added into the network. As more messages are stored, more edges are added. There will be no edge within the same cluster. The weights on these edges can only take the value as well, i.e., either an edge exists or not.

For retrieval, an incomplete probe is given, e.g., , and the network is asked which stored message is most similar to the query. Since the values for cluster 1 and 4 are known, to complete the query, the network activates and . Iterative activation rules can be exploited to determine which neurons in cluster 2 and 3 need to be active, sum-of-sum [14] and sum-of-max [17] are both such examples, and we discuss them next.

Ii-B Activation Rules

We regard active neurons to be energy sources sending out signals along the edges. We first explain the operations of sum-of-sum and sum-of-max respectively at a high level, and then introduce rigorous notation.

sum-of-sum is the default activation rule for GBNN [14, 15], which is also used in the model of Moopenn et al.  [16]. Initially, the neurons corresponding to the remaining probes are active, transmitting signals, whereas all the neurons in the missing clusters are deactivated. After each iteration, the neurons might receive different numbers of signals. In each cluster, only the neurons with the most signals will remain active in the next iteration. In contrast, sum-of-max [17] keeps a neuron active if and only if it receives signals from every other clusters plus the self excitation. Multiple signal contributions from the same cluster do not sum up. However, in order for sum-of-max to proceed correctly, we initially activate all the neurons in the missing clusters instead, which is opposite to sum-of-sum.

Let denote the indicator function of whether connects to , i.e.,

(1)

Let denote the indicator function of the potential for in iteration , i.e.,

(2)

We denote by the count of the number of signals receives at iteration .

The sum-of-sum decoding dynamics are given by

(3)
(4)
(5)

where is a reinforcement factor, representing the strength of self excitations. sum-of-max modifies the procedure to be

(6)
(7)

Thus for sum-of-max, one neuron receives at most one signal from each cluster.

This process continues until the network converges if it ever does. In fact, [24] already provides a simple example which shows that sum-of-sum might oscillate and also proves that sum-of-max is guaranteed to converge. This is one of the reasons we prefer sum-of-max over sum-of-sum.

The other reason can be illustrated by Fig. 2, where both neurons and in cluster 3 have two individual signals. Only the signals flowing into cluster 3 are drawn. Neuron receives two signals from neurons in the same cluster (cluster 1), whereas receives two signals from different clusters. In this case, should be favored, since we know by design, a stored message corresponds to a clique, and each cluster can only accommodate one single active neuron. sum-of-sum activates both and since they have equal number of signals received, whereas sum-of-max can differentiate between them and favor as desired. A worse but possible situation is that receives more signals than . In this case, only will remain active, and the correct is deactivated. Therefore, for sum-of-sum, the decoding errors may propagate from the current iteration to the next.

Fig. 2: Illustration of the sum-of-sum trap. Only the signals flowing into cluster are drawn.

In summary, sum-of-sum counts individual signals possibly propagating decoding errors, whereas sum-of-max counts cluster-wise contributions with the desired decoding procedure preserved. This is exactly the reason that sum-of-max outperforms sum-of-sum in terms of retrieval rate by a large margin, especially in challenging scenarios, e.g., either the number of stored messages or the number of erased symbols increases. For detailed performance comparisons and different initialization schemes, see [17, 24].

Iii Bogus Fixed Point Problem

Concerning the activation rules, all prior discussions and experiments concentrate on comparative studies between sum-of-sum and sum-of-max. However, little effort has been put into investigations of the activation rules themselves. In this section, we will illustrate a formerly disregarded aspect and identify a hidden issue embedded in sum-of-max.

Fig. 3 depicts a part of the final state after sum-of-max has converged. For brevity and clearness, we do not draw the edges of the network described by , but only the signal paths. For the same purpose, we also omit a large number of signal paths, where each dashed circle is some active neuron in a different cluster, contributing signals to each of the solid neurons.

Fig. 3: Illustration of the overlooked sum-of-max problem after the network has converged. Only some effective signal paths are drawn. Dashed circles are active neurons in other clusters, they all have signal contributions to each of the sold neurons on the left corner.

Let us focus on the solid neurons on the left corner. There are three clusters and each of them has two active neurons in the converged state. All six neurons receive signals from every other cluster, including those signals from the dashed neurons which we do not draw in Fig. 3. Hence, sum-of-max will keep them all active. To decode a message, we try to find a -partite clique, since we have clusters. The problem is that after sum-of-max has converged, other signal paths which do not form a clique also exist, e.g., the thinner lines in Fig. 3. A subset of a clique is a clique of a smaller size. Therefore, if the message is decoded correctly, a clique of size three, the only dark triangle we can find, needs to be identified. This is simply a small illustration of a three-cluster scenario. Imagine the complication when all the signal paths are presented and the network gets larger. If we do not select active neurons strategically but arbitrarily pick some random neurons in the converged state as the final answer, it is highly likely that we will not choose the ones that can actually form a clique.

One might wonder, since each stored message corresponds to a clique, how it is possible for these thinner lines which cannot form a clique to remain in the converged state. For a neuron to receive a signal from others, two prerequisites exist:

  1. There is an edge. This part is fixed after the storing phase finishes.

  2. The neuron on the other side of the edge is active, i.e., not all edges are active. This part is dynamic as the retrieval process continues.

The storing phase preserves all cliques, whereas the retrieving phase only searches for cliques with active edges (edges with signals). As a matter of fact, these thinner lines are part of the cliques which corresponds to other messages (we do not draw them in Fig. 3), but they fail to form an clique as the decoding result, which requires all of its edges to be active. However, sum-of-max preserves all these thinner lines undesirably.

Iv Heuristic Solutions

In Section III, we have shown the bogus fixed point problem of sum-of-max, that the cliques corresponding to stored messages are hidden in the converged state. Interestingly, the cause of the problem is exactly the reason that sum-of-max outperforms sum-of-sum in terms of retrieval rate: it does not differentiate individual signals from the same cluster. For instance, in Fig. 3, both neuron and receive contributions from the other two clusters, hence at the end they both remain active. We need some post-processing to break the tie and let sum-of-max continue with its work.

We choose to deactivate some neurons when sum-of-max gets trapped and does not make further improvements until a clique of size is hopefully found. Since it is the cluster-wise signal contributions that lead sum-of-max into the bogus fixed point, we take individual signals into account to fix the problem. For example in Fig. 3, neuron receives three individual signals while neuron receives only two. In this case, neuron should be favored, because more individual signals means that more stored messages have a symbol corresponding to neuron , which also means that we have a larger chance to successfully find a clique using this neuron.

Two options are available: either we keep the neuron with the most individual signals active, or we turn the neuron with the fewest individual signals deactivated. Once the tie is broken, sum-of-max is able to continue eliminating neurons. In this particular example, both options will favor neuron over , and irrelevant neurons in other clusters will all be deactivated in the next iteration, with the desired dark triangle preserved.

Up to now, we are discussing the operations within a single cluster. The next question is in which cluster we should break the tie. Fig. 3 is again a simple illustration in the sense that all three clusters of interest have exactly the same number of active neurons. In general, the bogus fixed point will have clusters with different numbers of active neurons. We can choose the cluster with either the most or the fewest active neurons to start off.

Therefore, we consider the following four heuristics:

  • activating the neuron with the most individual signals in the cluster with the most active neurons, (mm).

  • activating the neuron with the most individual signals in the cluster with the fewest active neurons, (mf).

  • deactivating the neuron with the fewest individual signals in the cluster with the most active neurons, (fm).

  • deactivating the neuron with the fewest individual signals in the cluster with the fewest active neurons, (ff).

We argue at this point that intuitively fm and ff ought to outperform mm and mf in terms of retrieval rate. mm and mf activate a particular neuron in some cluster, and they tend to take a guess too early in the retrieving process, whereas fm and ff eliminate unlikely neurons and let sum-of-max clean up the path in the coming iterations. In terms of run-time, the competition should be reversed due to the same reasoning. We also argue that mf ought to outperform mm, since both of them are required to choose a neuron in some cluster first and then presume that the selected neuron will be in the desired clique. If the neuron comes from a cluster with more active neurons (mm), we are more likely to make an erroneous guess. Simulation results are available in Section VI-B.

We also consider two other alternatives:

  • deactivating the neuron with the fewest edges across the network, i.e., the node with the fewest neighbors, (fe).

  • deactivating the neuron with the fewest signals across the network, i.e., the node with the fewest active neighbors, (fs).

These two options seem to have interesting justifications. fe actually tends to lower the activation rate of neurons that appeared less often in the stored messages. As a matter of fact, neurons with fewer edges are expected to have appeared less often in the storing process. Therefore, fe intuitively takes into account the relative frequency of the corresponding symbols in stored messages. fs does something similar. However, it is only interested in the subset of the neurons which the input probe tries to address.

Note that all these heuristics aim to increase the chance of finding a correct clique, but none of them guarantees a clique to be found. Although the heuristics do not have impact on the convergence of the retrieving process, it is possible that eventually some cluster will have all its neurons deactivated, and in the next iteration, all neurons across the network will deactivate according to Eq. (6) and Eq. (7). After applying these heuristics once, sum-of-max might get caught in some other state later again, thus we will have to break the tie several times along the way until either a clique is found or some cluster has all its neurons deactivated.

Here we will present an example showing that these heuristics are not guaranteed to find a desired clique. Let us assume that we use the heuristic ff, that is to eliminate the neuron with the fewest signals in the cluster with the fewest active neurons. We will see numerical experiments later in Section VI-B that ff actually performs much better compared to the other heuristics. However, look at Fig. 4. The dark triangle is again the only active clique we can find at this stage. According to ff, the cluster containing and will be chosen, since it has the fewest active neurons, and will be eliminated right away, since it involves fewer individual signals than .

Fig. 4: Illustration of the failure of the heuristic ff. Although ff performs better than the other heuristics, it still cannot guarantee an active clique to be found.

V Maximum Clique

We first introduce necessary notation and definitions. Let be an undirected graph, where is the set of the edges and is the set of the nodes with being the number of nodes in . We denote by the neighborhood of the node , i.e.,

(8)
Definition 1.

A maximal clique is a clique which is not a subgraph of any other clique in the graph .

Definition 2.

A maximum clique is a maximal clique of the largest size in the graph .

To better differentiate these two definitions, see Fig. 5. The clique and are both maximal cliques, since adding an extra node does not form a larger clique. The clique is a maximum clique, since it is a maximal clique of the largest size (4 in this example).

Fig. 5: The difference between a maximal clque and a maximum clique. There are two maximal cliques in the graph, i.e., and , whereas the latter is a maximum clique.

V-a Motivation

We have seen in previous sections that retrieving a message is equivalent to finding an active clique in the network. We have also seen in Section IV that although extremely helpful for sum-of-max to escape from the bogus fixed point, the heuristics proposed there are not guaranteed to find a clique.

Since there are no connections among the neurons within the same cluster, given a -clustered GBNN, the network is a -partite graph. The active cliques we aim to find are all cliques of size , and there is no clique that contains more than neurons. Thus these cliques are also the maximum cliques in the network. Therefore, the question becomes if we can find the maximum cliques in a -partite graph reliably and efficiently.

V-B Prior Work

Finding the maximum cliques in an arbitrary graph is one of Karp’s famous 21 NP-hard problems [25]. Generally speaking, to solve a difficult problem, two approaches exist: either answer it exactly but inefficiently or approximately but quickly. However, the maximum clique problem does not lend itself to the approximate approach [26] in the sense that for any , it cannot be approximated in polynomial time with the performance ratio

(9)

where here is the number of neurons in the network.

Modern algorithms for solving the maximum clique problem all follow the same branch-and-bound meta template. Typical representatives are the algorithms of Carraghan and Pardalos [27], Östergård [28], Tomita and Seki [29], Konc and Janezic [30], and Pattabiraman et al.  [31]. After presenting the classic algorithm [27] as in Algorithm 1, we will spend some time to explain the rationale behind. The improvements by the others will then be much easier to follow in an incremental manner.

Input: An undirected graph
Output: (a maximum clique of )
1 global and and clique() return function clique(): if  and  then
2      return
3      end if
4     while  do
5           if  then
6                return
7                end if
8                clique()
9                end while
Algorithm 1 The classic maximum clique finding algorithm by Canrraghan and Pardalos [27]

Algorithm 1 starts by defining two global sets and , where records the largest clique we have encountered thus far, and is the current clique we are investigating. After initializing and , the algorithm recursively call the function clique, which takes a node set

(when implemented, it can be a vector or an array in memory), the current sub-graph of interest, as its argument. At line 3, the full node set

is under consideration. Algorithm 1 checks the maximal clique containing first, and then that of and so on. It tries to enumerate all the maximal cliques and then keep the largest one as the final answer. When the algorithm terminates, is reported by .

Every time the algorithm reaches lines 6–8, a maximal clique has been recorded in . Thus, if its size exceeds the current largest clique, we replace with . From lines 14–18, as long as the node set is not empty, we take one node out at a time, assuming it is in the current clique under investigation, and then we recursively check a smaller sub-graph. Since always records a clique at any time, we have to ensure that all the nodes in the next sub-graph we are about to check need to connect to the nodes in , this is accomplished by the set intersection operation at line 17. Using the terminology of branch-and-bound algorithms, line 17 is the branch step, which goes one level deeper into the binary search tree, whereas lines 11–13 are the bound step. Even if all the nodes in the current set pool are added into the current clique , they still cannot form a larger clique than the largest one having been seen, and so we prune the search branch immediately. Here is a lower bound on the global optimum solution , whereas can be regarded to be an upper bound on the local optimum solution .

Östergård [28] accelerates the algorithm by some extra bookkeeping as well as reversing the ordering of investigating from back to , so that a new type of prune technique can be applied. The vertex coloring problem is a closely related NP-hard problem which can be used to accelerate the clique finding process. In this task, one is required to assign colors to vertices in the graph such that no adjacent vertices share the same color. Since solving the coloring problem exactly is also exponentially difficult, Tomita and Seki [29] use an approximate vertex coloring algorithm in a greedy manner at line 14 to construct a relaxed upper bound and then replace at line 11, so more search branches can be cut as early as possible. Konc and Janezic [30] take a similar approach to [29], but they also reduce the steps needed for the approximate vertex coloring algorithm by carefully maintaining a non-increasing coloring order of vertices. Since evaluating this auxiliary function also takes much time, it is not economical to recompute a new bound at every level. Therefore, [30] also provides a variant of the algorithm which only carries out the approximate vertex coloring at top levels based on empirical experience, so that the total run-time will not be affected most of the time by the unnecessary computations when the time-consuming vertex coloring starts to slow down the whole algorithm. The relatively recent work by Pattabiraman et al.  [31] argues that most of the real world networks are rarely dense networks. After updating at line 7 of Algorithm 1, they prune the search domain in successive recursive calls by only checking the nodes with degrees larger than , since a node with a smaller degree can never be part of a clique larger than the current one found.

V-C Proposed Approach

The aforementioned algorithms are all for finding the maximum cliques in general graphs, whereas GBNNs have a stringent nice structure. Algorithms in existing literature require enumerating all the maximal cliques until every node has been investigated, otherwise there is no way to ensure the final is the largest one. However, as pointed out above, a GBNN is a -partite graph with the active clique we try to find being a maximum clique of the exact size . In other words, we know in advance the targeted size, which can be used directly to prune the search space. We also know that once a maximal clique has been found, it is guaranteed to be a maximum clique. All of the extra information can be used to accelerate our approach. We present our algorithm as in Algorithm 2, mimicking the structure of Algorithm 1.

Input: GBNN structure with clusters after sum-of-max has converged to the bogus fixed point
Output: (an active clique)
1 global and found and found false obtain a smaller graph with clusters by eliminating non-erased clusters and erased clusters but with only one active neuron obtain an even smaller graph with clusters by eliminating inactive neurons in the remaining clusters split into sets with each accommodating the neurons in a different cluster clique() return function clique(): if  then
2      return
3      end if
4     sort according to the degrees of its nodes while  do
5           if any of  then
6                return
7                end if
8                clique(subgraph) if found then
9                     return
10                     end if
11                    
12                     end while
13                    function update(): for  do
14                         
15                          end for
sort according to the number of neurons in each set return
Algorithm 2 The proposed algorithm to fully exploit the nice structure of GBNN.

V-D Justifications

Two reasons prevent GBNNs in our context from being a sparse network:

  1. Although the representation of a given message in GBNN is extremely sparse, we focus on the situation when the retrieval scenario is challenging for the network. This means when the number of stored messages is large, the network is dense, which also means the converged state is a highly connected graph; see Fig. 4.

  2. Even if we store only a few messages, i.e., the total number of edges is small, we still get fully connected sub-graphs (cliques); meanwhile a large number of neurons are isolated. This is not a sparse network in the usual sense either.

These facts motivate our extra initialization at lines 3–5 of Algorithm 2 right before the main steps of the algorithm take place. Non-erased clusters and erased clusters but with only one single active neuron do not need to go into the recursions. This trick alone can save a great amount of time; see Section VI-C. Therefore, the recursive calls involve the reduced graph only, which consists of the active neurons in the bogus fixed point after sum-of-max has converged. The reduced graph has clusters, which is the number of the clusters with multiple neurons in the original network. Another main difference between Algorithm 2 and 1 is that the argument in the clique function is no longer a set, it is now an aggregation of sets, each of which stores the neurons in a different cluster, so that the partite feature is preserved even after the network has been reduced.

The global set stores the current clique under consideration. We choose on purpose that once the first clique has been found, our algorithm terminates because of efficiency considerations; see lines 10–13 and 24–26. These lines can be safely deleted if all the cliques, i.e., matched patterns by the input probe, are required to be retrieved instead of a particular one. At line 9, the variable level is simply for notation convenience, so that we know the current size of . Meanwhile, level is not only the current cluster we are checking, but also the current level of the binary search tree. The maximum value for level is bounded by .

In addition, we need a simple helper function update, which essentially does the same thing as at line 17 of Algorithm 1, updating the sub-graph of our next level recursion, adapted to the fact that is now a set of sets.

Two interesting steps in Algorithm 2 are lines 14 and 33, which are two sorting procedures. At line 14, the nodes in the current cluster are sorted according to their degrees, so that the ones with fewer connections will be expanded first due to line 19. At line 33, all the clusters in the next recursive level are sorted according to the active neurons they have after updating the sub-graph, so that the cluster with the fewest active neurons after the update will be the next one to expand. A nice property of the algorithm is that, once a cluster has been determined to be the next one to expand, we do not have to sort them again (line 33 starts with level+1, with upper levels untouched), because each time, line 19 will take out one neuron from the current cluster. Therefore, the current cluster will remain to be the one with fewest active neurons when the algorithm returns back from a deeper level. The purpose of these two sorting procedures is to ensure the search domain is a flat tree instead of a deep one. This arrangement can bring acceleration because of two reasons:

  1. From an algorithmic point of view, a flat tree means if we prune once, a larger portion of the search domain can be discarded.

  2. From a programming point of view, a flat tree also means fewer recursive function calls.

Due to the way we sort the neurons and clusters, the final clique tends to include neurons with fewer signals, (sub-messages with fewer frequencies), which appears to hurt the retrieval rate. This is rather counter intuitive at the first glance. However, a close second thought concludes the opposite. We need to choose a neuron from some cluster anyway, so the same reasoning of mf ought to outperform mm applies here as well. Therefore, our two sorting procedures not only produce a faster algorithm, but also increase retrieval rate. Simulations in Section VI-D provide evidences to support this claim.

Vi Experiments

In previous work [17, 24], the authors experiment mainly on easy scenarios, with the focus of comparative studies between sum-of-sum versus sum-of-max, which, we believe, is one of the reasons that the bogus fixed point problem was not previously uncovered. Hence in this section, we will concentrate on difficult scenarios where the number of erased clusters and the number of stored messages are challenging for a given network. All the simulations are performed on a 2.93GHz Intel Core 2 Duo Processor T9800 processor with 4GB memory.

Two testing scenarios are investigated. A small network contains 8 clusters with 128 neurons each, storing 5000 randomly generated messages. A large one contains 16 clusters with 256 neurons each, storing 40000 messages. Both scenarios use 5000 test messages for retrieving, and three quarters of the clusters are erased, i.e., 6 for the small, and 12 for the large. As suggested in [24], we set the reinforcement factor in all cases.

Vi-a Bogus Fixed Point Problem

We have already argued that the bogus fixed point problem happens when the testing scenarios are challenging for a given network. We plot in Fig. 6

the empirical probability that

sum-of-max reaches a bogus fixed point for the small scenario (with number of test messages being 4000) as the number of erased clusters and the number of stored messages increase. In order to find the probability, we first run sum-of-max and record the converged state as a binary vector with 1s being the active neurons. Then we run any clique finding algorithm. Instead of immediately quiting after finding the first clique, we try to find all the cliques in the converged state. We represent the ensemble of all the cliques found as another binary vector . If , there must exist active neurons that do not form a clique, hence the converged state is indeed a bogus fixed point. We increment a counter in this case, and the probability is equal to the ratio between the counter and the number of test messages.

Fig. 6: The probability that sum-of-max reaches a bogus fixed point for the small scenario as the number of erased clusters and the number of stored messages increase.

The bogus fixed point problem happens regardless of the network size, although we only plot the small scenario for illustration purposes. As we can see from Fig. 6, the probability that sum-of-max converges to a bogus fixed point is a complicated function of both the number of erased clusters and stored messages. When the number of erased clusters is small (e.g., 1 or 2), the probability increases with the number of stored messages, which also holds for the initial part of the curves when the number of erased clusters is large (e.g., 3, 4 or 5). Intuitively, more stored messages saturate the network by adding more cliques. Therefore, it is more likely for the network to converge to a bogus fixed point. Quite interestingly, the tail of the curves decreases to 0 when the number of erased clusters is large. This happens because too many clusters are missing and the input probe provides too little information to retrieve the desired clique. As a result, all neurons in erased clusters remain active, which makes identical with . In this case, although the probability of bogus fixed point is small, the retrieved pattern is useless.

Vi-B Different Heuristics

The approaches we test here include the original random selection scheme and six heuristics proposed in Section IV. We mentioned in Section IV, the proposed heuristics do not guarantee a clique to be found. Therefore, in our implementation we record the states along the way, so that once a cluster has no active neurons, we can rewind the state and provide the post-processing a second chance when converting the binary encoding back to the message form. We also compare them with the clique finding algorithm in [30] (mcqd). It is the high retrieval rate this algorithm offers that motivated us to develop the clique finding algorithm, Algorithm 2, which is more customized to GBNN. Comparisons and accelerations between different clique finding algorithms are carried out in the next sub-section.

(a)
(b)
(c)
(d)
Fig. 7: Comparisons of different heuristics to escape from the bogus fixed point, where random is the standard sum-of-max implemented in prior work and mcqd is the algorithm in [30]. LABEL:sub@fig:ex1smallrate and LABEL:sub@fig:ex1smalltime are the retrieval rate and run-time for the small testing scenario respectively, whereas LABEL:sub@fig:ex1largerate and LABEL:sub@fig:ex1largetime are for the large scenario. Only the post-processing time is reported. Before converging, sum-of-max runs in about 6 seconds for the small scenario and about 100 seconds for the large scenario across the board. The exciting results by the clique finding approach (mcqd) encourage us to develop our proposed Algorithm 2 customized to GBNN’s structure.

We report both the retrieval rate and the run-time for different heuristics in Fig. 7. The run-time reported is only for the post-processing part after sum-of-max has converged to the bogus fixed point. We see from Fig. (a)a and Fig. (c)c that all of the heuristics are better than the random selection scheme for the standard sum-of-max in terms of retrieval rate. In Section IV, we remarked that fm and ff ought to outperform mm and mf. This is true for the small scenario, but it does not hold for the large case (fm is much worse than the rest). We also argued in Section IV that mf ought to be better than mm, which is evident for both scenarios. Although fe tries to reflect the pattern frequencies, it does not work out for either case. This is mainly because we concentrate on difficult situations where the number of the stored patterns are demanding for a given network, the resulting GBNN is thus highly connected. The frequencies are no longer a valid indicator (imagine the extreme case, when every neuron connects to every other neurons in different clusters). fs performs better than fe since it is only interested in the subset of neurons addressed by the probe and a lot of unnecessary edges are eliminated before it makes the decision. The clique finding approach is quite encouraging in both scenarios: for the small case, the performance quadruples from 20% to 80%, and it almost perfectly recovers all the queries in the large setting.

In terms of run-time, from Fig. (b)b and Fig. (d)d, we can tell that mm and mf are much faster than the rest, which accords with our judgement in Section IV. The fs heuristic is really slow — even slower than the clique finding approach in the large scenario. Considering both retrieval rate and run-time, we argue that ff makes a better balance than any other heuristics between these two metrics.

The astonishing retrieval rate but slow run-time of the clique finding approach (mcqd) indeed stimulates us to work on faster solutions.

Vi-C Different Clique Finding Algorithms

In the previous sub-section, we see terrific performance gain from exploiting clique finding algorithms. However, the run-time is not satisfying. This sub-section deals with the problem of comparing and accelerating the clique finding approaches, so that we can have both correctness and speed.

We will focus on three algorithms, i.e., the fastest variant available of the vertex coloring approach [30], the classic [27] as in Algorithm 1 and our newly developed Algorithm 2. We first demonstrate that renumbering and reordering the original GBNN into a reduced graph brings tremendous acceleration. Fig. 8 shows the box plot of the number of recursive function calls, the original versus the redueced graph, for the large setting using mcqd. We cannot tell a significant disagreement between these two graphs in terms of the number of recursive calls. However, the run-time is a totally different story. The original graph requires 439.011 seconds, whereas the reduced graph only needs 6.197 seconds, which indicates that the reduced graph will cut the time in each level of the recursive calls.

Fig. 8: Comparisons of mcqd for the large scenario between the original and the reduced graph in terms of the number of recursive function calls.

Then we apply the same reducing trick to all the algorithms in comparison. Fig. 9 presents the retrieval rate and run-time for different clique finding algorithms in both small and large settings. All of the retrieval rates are identical and promising for different algorithms, which also validates that our implementations are correct. The run-time of the newly developed Algorithm 2 is a big win over all the other alternatives, not only among the clique finding algorithms, but also across all the other heuristics; see Fig. 7.

(a)
(b)
(c)
(d)
Fig. 9: Comparisons of different clique finding algorithms in different experiment settings. LABEL:sub@fig:ex2smallrate and LABEL:sub@fig:ex2smalltime are the retrieval rate and run-time for the small scenario respectively, whereas LABEL:sub@fig:ex2largerate and LABEL:sub@fig:ex2largetime are for the large setting.

Vi-D Sorting Procedures in Algorithm 2

First we would like to provide evidence that compared to a deep search tree, a flat one not only accelerates the algorithm but also brings us a better retrieval rate. To make such an argument, we first run the newly developed Algorithm 2, and then change both sorting procedures in the reversed order so that a deep search tree will be constructed. For a better illustration, we take the large scenario again and increase the number of stored messages to be 50000 to challenge the algorithms even further. The flat tree runs in 42.573 seconds, giving us 77.4% of successful retrievals, whereas the deep tree runs in 584.337 seconds, giving us 65.6% successful retrievals. This is a 14 faster version with over 10% performance gain using almost the same algorithm except for the branching order.

Also see Fig. 10, which shows the number of recursive calls either approach goes through to retrieve a message. The red dashline triangles are for the deep tree, and the blue circles are for the flat tree. We can easily tell that to retrieve some messages, the deep tree approach even has to invoke nearly 300000 functions calls, while in general the flat tree requires much fewer function calls than the deep one. To be more precise, the medians of the number of function calls for the flat and deep trees are 1008 and 19082 respectively, which is roughly 20 smaller for the flat case.

Fig. 10: The number of recursive function calls for each message by the flat and the deep tree approach respectively. The red dashline triangles are for the deep tree while the blue circles are for the flat tree. The flat tree needs much less function calls.

Finally, we would like to know if these two sorting procedures contribute equally to the algorithm. We first run the algorithm with both sorting turned on. The algorithm runs in 43.072 seconds with retrieval rate of 77.4%. Then we only keep on the sorting within the current cluster according to the node degrees (line 14 of Algorithm 2). It runs in 106.541 seconds with the retrieval rate unchanged. We again run the algorithm, this time with the sorting between clusters turned on (line 33 of Algorithm 2). It runs in 42.595 seconds with the retrieval rate dropping to 71.1%. All three versions run against the same data, eliminating all undesired factors. The results indicate that the order of the expanded cluster is crucial to the correctness of the decoding process. Once the cluster order has been determined, which node to branch first mainly affects the run-time.

Vii Summary

GBNN is a recently invented recurrent neural network embracing LDPC-like sparse encoding setup, which makes it extremely resilient to noise and errors. Two activation rules exist for the activation of the neuron dynamics, namely sum-of-sum and sum-of-max. In this work, we look into the activation rules themselves. sum-of-sum focuses on individual signals whereas sum-of-max concentrates on cluster-wise signal contributions. This particular trait helps sum-of-max to stand out in terms of successful retrieval rate by a large margin. However, the same peculiarity ensnares sum-of-max when it has already reached the converged state. We identify such an overlooked situation for the first time and propose a number of heuristics to facilitate sum-of-max’s decoding process, pushing it beyond the bogus fixed point, by taking into account the individual signals which was sum-of-sum’s original spirit. Prior work, e.g., [24], combines sum-of-sum and sum-of-max mainly due to computational considerations, so that sum-of-sum can be exploited to accelerate the sum-of-max process, whereas this work blends these two in a totally orthogonal perspective.

To solve the bogus fixed point problem directly and completely, a post-processing algorithm is also developed, which is to find a maximum clique in the network essentially. The algorithm is tailored to accommodate the special property of GBNN being a -partite graph. Experiment results show that the algorithm outperforms the random selection of the standard sum-of-max scheme and all the heuristics proposed in this work, in terms of both retrieval rate and run-time, which also suggest that there is plenty of room to improve the activation rules themselves.

Possible directions of future research may include stochastic activation schemes which has been done in Boltzmann machines or deep learning networks, so that the retrieval process for GBNN does not need to be divided into two distinct stages. It is also a good heuristic to introduce a heat parameter as in Boltzmann machines in the decoding process, since it might avoid the plateau that corresponds to the bogus fixed point of

sum-of-max. In addition, we are interested in adapting and testing sum-of-sum or sum-of-max in general networks other than partite graphs, e.g., Erdős-Rényi graphs.

Acknowledgement

This work was funded, in part, by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Fonds Québécois de la recherche sur la nature et les technologies (FQRNT) and the European Research Council project NEUCOD.

References

  • [1] S. Kaxiras and G. Keramidas, “Ipstash: a set-associative memory approach for efficient ip-lookup,” in INFOCOM 2005. Proc. 24th Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 2, Miami, FL, USA, 2005, pp. 992–1001.
  • [2] M. E. Valle, “A class of sparsely connected autoassociative morphological memories for large color images,” IEEE Transactions on Neural Networks, vol. 20, no. 6, pp. 1045–1050, 2009.
  • [3] C. S. Lin, D. C. P. Smith, and J. M. Smith, “The design of a rotating associative memory for relational database applications,” ACM Transactions on Database Systems, vol. 1, no. 1, pp. 53–65, March 1976.
  • [4] L. Bu and J. Chandy, “FPGA based network intrusion detection using content addressable memories,” in IEEE Symposium on Field-Programmable Custom Computing Machines, Napa, CA, USA, 2004, pp. 316–317.
  • [5] K.-J. Lin and C.-W. Wu, “A low-power CAM design for LZ data compression,” IEEE Transactions on Computers, vol. 49, no. 10, pp. 1139–1145, October 2000.
  • [6] H. Zhang, B. Zhang, W. Huang, and Q. Tian, “Gabor wavelet associative memory for face recognition,” IEEE Transactions on Neural Networks, vol. 16, no. 1, pp. 275–278, 2005.
  • [7] J. A. Anderson and E. Rosenfeld, Neurocomputing: foundations of research, ser. Bradford Books.   MIT Press, 1988, vol. 1.
  • [8] J. A. Anderson, A. Pellionisz, and E. Rosenfeld, Neurocomputing 2: Directions of Research, ser. Bradford Books.   MIT Press, 1993, vol. 2.
  • [9] D. J. Willshaw, O. P. Buneman, and H. C. Longuet-Higgins, “Non-holographic associative memory.” Nature, vol. 222, pp. 960–962, 1969.
  • [10] D. Willshaw, “Models of distributed associative memory.” Ph.D. dissertation, Edinburgh University, 1971.
  • [11] J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558, 1982.
  • [12] ——, “Neurons with graded response have collective computational properties like those of two-state neurons,” Proceedings of the National Academy of Sciences, vol. 81, no. 10, pp. 3088–3092, 1984.
  • [13] G. Palm, “Neural associative memories and sparse coding,” Neural Networks, vol. 37, pp. 165–171, January 2013.
  • [14] V. Gripon and C. Berrou, “A simple and efficient way to store many messages using neural cliques,” in IEEE Symposium on Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), Paris, France, 2011, pp. 1–5.
  • [15] ——, “Sparse neural networks with large learning diversity,” IEEE Transactions on Neural Networks, vol. 22, no. 7, pp. 1087–1096, 2011.
  • [16] A. Moopenn, J. Lambe, and A. Thakoor, “Electronic implementation of associative memory based on neural network models,” Systems, Man and Cybernetics, IEEE Transactions on, vol. 17, no. 2, pp. 325–331, 1987.
  • [17] V. Gripon and C. Berrou, “Nearly-optimal associative memories based on distributed constant weight codes,” in Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 2012, pp. 269–273.
  • [18] X. Jiang, V. Gripon, and C. Berrou, “Learning long sequences in binary neural networks,” in International Conference on Advanced Cognitive Technologies and Applications, Nice, France, 2012, pp. 165–170.
  • [19] B. K. Aliabadi, C. Berrou, V. Gripon, and X. Jiang, “Learning sparse messages in networks of neural cliques,” ACM Computing Research Repository, 2012. [Online]. Available: http://arxiv.org/abs/1208.4009v1
  • [20] H. Jarollahi, N. Onizawa, V. Gripon, and W. Gross, “Architecture and implementation of an associative memory using sparse clustered networks,” in IEEE International Symposium on Circuits and Systems (ISCAS), Seoul, Korea, 2012, pp. 2901–2904.
  • [21] ——, “Reduced-complexity binary-weight-coded associative memories,” in IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 2523–2527.
  • [22] H. Jarollahi, V. Gripon, N. Onizawa, and W. Gross, “A low-power content-addressable-memory based on clustered-sparse-networks,” in IEEE International Conference on Application-specific Systems, Architectures and Processors, Washington, DC, USA, 2013, pp. 305–308.
  • [23] B. Larras, C. Lahuec, M. Arzel, and F. Seguin, “Analog implementation of encoded neural networks,” in IEEE International Symposium on Circuits and Systems, Beijing, China, 2013, pp. 1–4.
  • [24] Z. Yao, V. Gripon, and M. Rabbat, “A massively parallel associative memory based on sparse neural networks,” IEEE Transactions on Neural Networks and Learning Systems, submitted for publication. [Online]. Available: http://arxiv.org/abs/1303.7032
  • [25] R. M. Karp, Reducibility among combinatorial problems.   Springer, 1972.
  • [26] D. P. Williamson and D. B. Shmoys, The Design of Approximation Algorithms.   Cambridge University Press, 2011.
  • [27] R. Carraghan and P. M. Pardalos, “An exact algorithm for the maximum clique problem,” Operations Research Letters, vol. 9, no. 6, pp. 375–382, 1990.
  • [28] P. R. Östergård, “A fast algorithm for the maximum clique problem,” Discrete Applied Mathematics, vol. 120, no. 1, pp. 197–207, 2002.
  • [29] E. Tomita and T. Seki, “An efficient branch-and-bound algorithm for finding a maximum clique,” in Discrete Mathematics and Theoretical Computer Science.   Springer, 2003, pp. 278–289.
  • [30] J. Konc and D. Janezic, “An improved branch and bound algorithm for the maximum clique problem,” MATCH Communications in Mathematical and Computer Chemistry, vol. 58, pp. 569–590, 2007.
  • [31] B. Pattabiraman, M. M. A. Patwary, and A. C. Assefaw H. Gebremedhin, Wei-keng Liao, “Fast algorithms for the maximum clique problem on massive sparse graphs,” ACM Computing Research Repository, 2012. [Online]. Available: http://arxiv.org/abs/1209.5818