1 Introduction
A network of simple neural units can form a physical system that exhibits computational properties. Notable examples include Hopfield network [16]
[1]. Such systems have global states that evolve over time through the interactions among local neural units. Typically, one is interested in a system whose motion converges towards locally stable limit points, with the limit points representing the computational objective of interest. For example, a Hopfield network’s limit points correspond to stored memory information and that of a Boltzmann machine, a data representation. These computational systems are particularly interesting from a hardware implementation standpoint. A subset of the neurons can be mapped to one processing element in a massively parallel architecture [9, 26]. By allocating private local memory to each processing element, the socalled von Neumann memory bottleneck in modern computers can be eliminated [21].We are interested in using such systems to solve the minimizing sparse coding and dictionary learning problem, which has fundamental importance in many areas, e.g., see [24]. It is wellknown that even just the sparse coding problem, with a prescribed dictionary, is nontrivial to solve, mainly due to the nonsmooth objective involving an norm [10, 3]. It is therefore remarkable that a dynamical network known as the LCA network [31] can be carefully constructed so that its limit points are identical to the solution of the sparse coding problem. Use of a dynamical network thus provides an alternative and potentially more power efficient method for sparse coding to standard numerical optimization techniques. Nevertheless, while extending numerical optimization algorithms to also learning the underlying dictionary is somewhat straightforward, there is very little understanding in using dynamical networks to learn a dictionary with provable guarantees.
In this work, we devise a new network topology and learning rules that enable dictionary learning. On a high level, our learning strategy is similar to the contrastive learning procedure developed in training Boltzmann machines, which also gathers much recent interest in deriving implementations of backpropagation under neural network locality constraints
[1, 27, 29, 40, 33, 39]. During training, the network is run in two different configurations – a “normal” one and a “perturbed” one.^{1}^{1}1In Botlzmann machine, the two configurations are called the freerunning phase and the clamped phase. The networks’ limit points under these two configurations will be identical if the weights to be trained are already optimal, but different otherwise. The learning process is a scheme to so adjust the weights to minimize the difference in the limit points. In Boltzmann machine, the weight adjustment can be formulated as minimizing a KL divergence objective function.For dictionary learning, we adopt a neuron model whose activation function corresponds to the unbounded ReLU function rather than the bounded sigmoidlike function in Hopfield networks or Boltzmann machines, and a special network topology where connection weights have dependency. Interestingly, the learning processes are still similar: We also rely on running our network in two configurations. The difference in states after a longenough evolution, called limiting states in short, is shown to hold the gradient information of a dictionary learning objective function which the network minimizes, as well as the gradient information for the network to maintain weight dependency.


1.1 Related Work
Dictionary learning is thought to be related to the formation of receptive fields in visual cortex [28]. The typical architecture studied is a feedforwardonly, twolayer neural network with inhibitory lateral connections among the second layer neurons [12, 41, 6, 19, 34, 38, 5], as shown in Figure 1(a). The lateral connections allow the coding neurons to compete among themselves and hence induce sparseness in neural activities, giving dynamics more complex than conventional deep neural networks which do not have intralayer connections.^{2}^{2}2
This should not be confused with the conventional recurrent neural networks. Although RNNs also have intralayer connections, these connections are still unidirectional over a sequence of input.
In [31], it is shown that the coding neuron activations can correspond to a sparse coding solution if the connection weights are set according to a global dictionary as , .^{3}^{3}3The exact formulation depends on the neuron model. In the spiking neuron formulation, we in fact have where is the firing thresholds. See Section 3.1 for more details. To enable learning in this network (that is, each neuron locally adjusts their connection weights to adapt the dictionary; see Section 2.2 for the definition of weight locality), one must address the following two questions:
How does individual neuron compute the gradient for learning locally?

How do the neurons collectively maintain the global weight consistency between and ?
The first line of work [12, 41, 6]
adopts the Hebbian/antiHebbian heuristics for learning the feedforward and lateral weights, respectively, and empirically demonstrated that such learning yielded Gaborlike receptive fields if trained with natural images. However, unlike the network in
[31], this learning heuristic does not correspond to a rigorous learning objective, and hence cannot address any of the two above questions. Recently, this learning strategy is linked to minimizing a similarity matching objective function between input and output correlations [19]. This formulation is somewhat different from the common autoencoderstyle dictionary learning formulation discussed in this work.
Another line of work [38, 5] notes the importance of balance between excitation and inhibition among the coding neurons, and proposes that the learning target of lateral connections should be to maintain such balance; that is, the inhibitory lateral weights should grow according to the feedforward excitations. This idea provides a solution to ensure weight consistency between and . Nevertheless, similar to the first line of work, both [38, 5] resort to pure Hebbian rule when learning the feedforward weights (or equivalently, learning the dictionary), which does not necessarily follow a descending direction that minimizes the dictionary learning objective function.
1.2 Contributions
The major advance in this work is to recognize the inadequacy of the customary feedforwardonly architecture, and to introduce topdown feedback connections shown in Figure 1(b). As will later be shown, this network structure allows the true learning gradients to be provably computable from the resulting network dynamics. Further, the existence of feedback allows us to devise a separate mechanism that acts as an inner loop during learning to continuously ensure weight consistency among all connections. Combining these two, we can successfully address both the above questions and the dictionary learning problem.
We will focus our discussion on a network that uses spiking neurons as the basic units that are suited for digital circuit implementations with high computational efficiency. Note that this does not result in a loss of generality. The principles of LCA network can be applied to both continuousvalued and spiking neurons [35, 37], and similarly the results established in this paper can be easily applied to construct a network of continuousvalued neurons for dictionary learning.
2 Background
2.1 IntegrateandFire Spiking Neuron Model and Network Dynamics
An integrateandfire neuron has two internal state variables that govern its dynamics: the current and the potential . The key output of a neuron is a time sequence of spikes – spike train – that it produces. A neuron’s spike train is generated by its potential ; is in turn driven by the current , which is in turn driven by a constant bias (bias in short) and the spike trains of other neurons to which it is connected. Specifically, each neuron has a configured firing threshold . When reaches , say at time , a spike given by the Dirac delta function is generated and is reset to 0: . For and before reaches again, .
In a system of neurons , , let denote the spike train of neuron . The current of is given in terms of its bias and the spike trains :
(1) 
where for , for and is the convolution operator. Neuron inhibits (excites) if (). If , neurons and are not connected. For simplicity, we consider only throughout the paper. Equation 1 yields the dynamics
(2) 
where the vectors
and denote the currents and spike trains.The network dynamics can studied via the filtered quantities of average current and spike rate:
(3) 
In terms of and , Equation 2 becomes
(4) 
The trajectory has interesting properties. In particular, Theorem 1 below (cf. [37]) shows that any limit point satisfies , and where is elementwise product. These properties are crucial to Section 3.
Theorem 1.
Let , , then
(5) 
where and .
As with all other theorems, Theorem 1 is given in a conceptual form where the corresponding rigorous “” versions are detailed in the Appendix.
2.2 Parallel Model of Dynamical Neural Networks
We view the dynamical network as a computational model where each neuron evolves in parallel and asynchronously. Onesided communication in the form of a onebit signal from Neuron to Neuron occurs only if the two are connected and only when the former spikes. The network therefore can be mapped to a massively parallel architecture, such as [9], where the connection weights are stored distributively in each processing element’s (PE) local memory. In the most general case, we assume the architecture has the same number of PEs and neurons; each PE hosts one neuron and stores the weights connected towards this neuron, that is, each PE stores one row of the matrix in Equation 2. With proper interconnects among PEs to deliver spike messages, the dynamical network can be realized to compute sparse coding solutions.
This architectural model imposes a critical weight locality constraint on learning algorithms for dynamical networks: The connection weights must be adjusted with rules that rely only on locally available information such as connection weights, a neuron’s internal states, and the rate of spikes it receives. The goal of this paper is to enable dictionary learning under this locality constraint.
3 Dictionary Learning
In dictionary learning, we are given images , . The goal is to find a dictionary consisting of a prescribed number of atoms, , such that each of the images can be sparsely coded in . We focus here on nonnegative dictionary and formulate our minimization problem as
(6) 
being a positive diagonal scaling matrix.
Computational methods such as stochastic online training [2] is known to be effective for dictionary learning. With this method, one iterates on the following two steps, starting with a random dictionary.

Pick a random image and obtain sparse code for the current dictionary and image , that is, solve Equation (6) with fixed.

Use gradient descent to update with a learning rate . The gradient with respect to is in a simple form and the update of is
(7)
Implementing these steps with a dynamical network is challenging. First, previous works have only shown that Step 1 can be solved when the configuration uses the dictionary in the feedforward connection weights and as the lateral connection weights ([35], c.f. Figure 1(a) and below). For dictionary learning, both sets of weights evolve without maintaining this exact relationship, casting doubt if Step 1 can be solved at all. Second, the network in Figure 1(a) only has , rendering the needed term uncomputable using information local to each neuron. Note that in general, gradients to minimize certain objective functions in a neural network can be mathematically derived, but often times they cannot be computed locally, e.g., standard backpropagation and general gradient calculations for spiking networks [20]. We now show that our design depicted in Figure 1(b) can indeed implement Steps 1 and 2 and solve dictionary learning.
3.1 Sparse Coding – Getting
Nonnegative sparse coding (Equation 6 with fixed) is a constrained optimization problem. The standard approach (cf. [4]) is to augment with nonnegative slack variables, with which the optimal solutions are characterized by the KKT conditions. Consider now Figure 1(b) that has explicit feedback weights whose strength is controlled by a parameter . Equation 5, reflecting the structure of the coding and input neurons, takes the form:
(8) 
and denote the average currents and spike rates for the coding and input neurons, respectively, and . When , , and at a limit point , the network is equivalent to Figure 1(a). Equation 19 is simplified and reduces to and that , and . This shows that and are the optimal primal and slack variables that satisfy the KKT conditions. In particular is the optimal sparse code.
We extend this previously established result [37] in several aspects: (1) can be set to any values in ; all are the optimal sparse code, (2) needs not be exactly; being small suffices, and (3) as long as is large enough, solves an approximate sparse coding problem. These are summarized as follows (where the rigorous form is presented in the Appendix).
Theorem 2.
Let , and be small. Then for large enough, is close to an exact solution to Equation 6 ( fixed) with replaced by where is small.
The significant implication is that despite slight discrepancies between and , the average spike rate at large enough is a practical solution to Step 1 of the stochastic learning procedure.
3.2 Dictionary Adjustment – Updating and
To obtain the learning gradients, we run the network for a long enough time to sparse code twice: at and , obtaining , , , and , at those two configurations. We use tilde to denote the obtained states and loosely call them as limiting states. Denote by .
Theorem 3.
The limiting states satisfy
(9)  
(10) 
We now show Theorem B.5 lays the foundation for computing all the necessary gradients that we need. Equation 9 shows that (recall )
In other words, the spike rate differences at the input layer reflect the reconstruction error of the sparse code we just computed. Following Equation 7, this implies that the update to each weight can be approximated from the spike rates of the two neurons that it connects, while the two spike rates surely are locally available to the destination neuron that stores the weight. Specifically, each coding neuron has a row of the matrix ; each input neuron has a row of the matrix . These neurons each updates its row of matrix via
(11)  
Note that is maintained.
Ideally, at this point the and stored distributively in the coding neurons will be updated to where . Unfortunately, each coding neuron only possesses one row of the matrix and does not have access to any values of the matrix . To maintain to be close to throughout the learning process, we do the following. First we aim to modify to be closer to (not ) by reducing the cost function . The gradient of this cost function is which is computable as follows. Equation 10 shows that
Using this approximation, coding neuron has the information to compute the th row of . We modify by where is some learning rate. This modification can be thought of as a catchup correction because and correspond to the updated values from a previous iteration. Because the magnitude of that update is of the order of , we have and . Thus should be bigger than lest be too small to be an effective correction. In practice, works very well.
In addition to this catchup correction, we also make correction of due to the update of and to and . These updates lead to a change of . Consequently, after Equation 11, we update by
(12) 
Note that the update to involves update to the weights as well as the thresholds (recall that ). Combining the above, we summarize the full dictionary learning algorithm below.
3.3 Discussions
Dictionary norm regularization. In dictionary learning, typically one needs to control the norms of atoms to prevent them from growing arbitrarily large. The most common approach is to constrain the atoms to be exactly (or at most) of unit norms, achieved by renormalizing each atom after a dictionary update. This method however cannot be directly adopted in our distributed setting. Each input neuron only has a row of the matrix but not a column of – an atom – so as to renormalize.
We chose instead to regularize the Frobenius norm of the dictionaries, translating to a simple decay term in the learning rules. This regularization alone may result in learning degenerate zeronorm atoms because sparse coding tends to favor largernorm atoms to be actively updated, leaving smallernorm ones subject solely to continual weight decays. By choosing a scaling factor set to , sparse coding favors smallernorm atoms to be active and effectively mitigates the problem of degeneracy.
Boundedness of network activities. Our proposed network is a feedback nonlinear system, and one may wonder whether the network activities will remain bounded. While we cannot yet rigorously guarantee boundedness and stability under some a priori conditions, currents and spike rates remain bounded throughout learning for all our experiments. One observation is that the feedback excitation amounts to and the inhibition is . Therefore when and , the feedback excitation is nullified, keeping the network from growing out of bound.
Network execution in practice. Theoretically, an accurate spike rate can only be measured at a very large as precision increases at a rate of . In practice, we observed that a small
suffices for dictionary learning purpose. Stochastic gradient descent is known to be very robust against noise and thus can tolerate the lowprecision spike rates as well as the approximate sparse codes due to the imperfect
. For faster network convergence, the second network is ran right after the first network with all neuron states preserved.Weight symmetry. The sparse code and dictionary gradient are computed using the feedforward and feedback weights respectively. Therefore a symmetry between those weights is the most effective for credit assignment. We have assumed such symmetry is initialized and the learning rules can subsequently maintain the symmetry. One interesting observation is that even if the weights are asymmetric, our learning rules still will symmetrize them. Let be the weight difference at the th iteration. It is straightforward to show , . Hence as gets bigger. In training deep neural networks, symmetric feedforward and feedback weights are important for similar reasons. The lack of local mechanisms for the symmetry to emerge makes backpropagation biologically implausible and hardware unfriendly, see for example [23] for more discussions. Our learning model may serve as a building block for the pursuit of biologically plausible deep networks with backpropagationstyle learning.
(a) Random dictionary (training sample No.1)  (b) Learned dictionary (training sample No.99900) 
4 Numerical Experiments
We examined the proposed learning algorithm using three datasets. Dataset A. 100K randomly sampled patches from the grayscale Lena image to learn 256 atoms. Dataset B. 50K MNIST images [22] to learn 512 atoms. Dataset C. 200K randomly sampled patches from whitened natural scenes [28]
to learn 1024 atoms. These are standard datasets in image processing (A), machine learning (B), and computational neuroscience (C).
^{4}^{4}4For Dataset A and C, the patches are further subtracted by the means, normalized, and split into positive and negative channels to create nonnegative inputs [18]. For each input, the network is ran with from to and with from to , both with a discrete time step of . Note that although this time window of 20 is relatively small and yields a spike rate precision of only 0.05, we observed that it is sufficient for gradient calculation and dictionary learning purpose.We explored two different connection weight initialization schemes. First, we initialize the weights to be fully consistent with respect to a random dictionary. Second, we initialized the weights to be asymmetric. In this case, we set and to be columnnormalized random matrices and the entries of to be random values between [0, 1.5] with the diagonal set to 1.5.
4.1 Network Dynamics
We first show the spike patterns from a network with fully consistent initial weights in Figure 2. It can be seen that the spike patterns quickly settle into a steady state, indicating that a small time window may suffice for spike rate calculations. Further, we can observe that feedback only perturbs the input neuron spike rates while keeping the coding neuron spike rates approximately the same, validating our results in Section 3.1 and 3.2.
Another target the algorithm aims at is to approximately maintain the weight consistency during learning. Figure 3 shows that this is indeed the case. Note that our learning rule acts as a catchup correction, and so an exact consistency cannot be achieved. An interesting observation is that as learning proceeds, weight consistency becomes easier to maintain as the dictionary gradually converges.
Although we have limited theoretical understanding for networks with random initial weights, Figure 3 shows that our learning procedure can automatically discover consistent and symmetric weights with respect to a single global dictionary. This is especially interesting given that the neurons only learn with local information. No neuron has a global picture of the network weights.
4.2 Convergence of Dictionary Learning
The learning problem is nonconvex, and hence it is important that our proposed algorithm can find a satisfying local minimum. We compare the convergence of spiking networks with the standard stochastic gradient descent (SGD) method with the unit atom norm constraint. For simplicity, both algorithms use a batch size of 1 for gradient calculations. The quality of the learned dictionary is measured using a separate test set of 10K samples to calculate a surrogate dictionary learning objective [25]. For a fair comparison, the weight decay parameters in spiking networks are chosen so that the average atom norms converge to approximately one.
Figure 4 shows that our algorithm indeed converges and can obtain a solution of similar, if not better, objective function values to SGD consistently across the datasets. Surprisingly, our algorithm can even reach a better solution with fewer training samples, while SGD can be stuck at a poor local minimum especially when the dictionary is large. This can be attributed to the norm reweighting heuristic that encourages more dictionary atoms to be actively updated during learning. Finally, we observe that a network initialized with random nonsymmetric weights still manages to reach objective function values comparable to those initialized with symmetric weights, albeit with slower convergence due to less accurate gradients. From Figure 3, we see the network weights are not symmetric before samples for Dataset A. On the other hand, from Figure 4 the network can already improve the dictionary before samples, showing that perfectly symmetric weights are not necessary for learning to proceed.
5 Conclusion
We have presented a dynamical neural network formulation that can learn dictionaries for sparse representations. Our work represents a significant step forward that it not only provides a link between the wellestablished dictionary learning problem and dynamical neural networks, but also demonstrates the contrastive learning approach to be a fruitful direction. We believe there is still much to be explored in dynamical neural networks. In particular, learning in such networks respects data locality and therefore has the unique potential, especially with spiking neurons, to enable lowpower, highthroughput training with massively parallel architectures.
Appendix A Detailed Description of Proposed Network Structure
We propose a novel network topology with feedback shown in Figure 1(b). The figure shows two “layers” of neurons. The lower layer consists of neurons we call input neurons, for ; the upper layer consists of neurons we call coding neurons for .
Each coding neuron receives excitatory signals from all the input neurons with a weight of . That is, each coding neuron has a row of the matrix . In addition, neuron receives inhibitory signals from all other coding neurons with weight . denotes this matrix of weights: and . The firing thresholds are and the matrix , , appears often and will denote it as . Each neuron also receives a constant negative bias of where is an important parameter that will be varied during the learning process to be detailed momentarily.
Each input neuron , , with firing threshold fixed to be 1, receives a bias of . Typically corresponds to the th pixel value of an input image in question during the learning process. In addition, it receives excitatory spikes from each of the coding neurons with weights . That is each input neuron has a row of the matrix . These excitatory signals from the coding neurons constitute the crucial feedback mechanism we devised here that enables dictionary learning.
Appendix B Proof of Theorems
b.1 Theorem 1: SNN dynamics, trajectory, and limit points
In the simplest case when none of the neurons are interconnected and for all , then for all and all . Hence those neurons with produces a spike train of constant interspike interval of ; those neurons with will have no spiking activities. When however the neurons are interconnected, the dynamics becomes nontrivial. It turns out that one can so describe the dynamics mathematically that useful properties related to the current and spike train can be derived. Consequently, a network of spiking neurons can be configured to help solve certain practical problems.
Given a system of neurons , , we use vector notations and to denote the currents and spike trains. The vector and are the input biases and firing thresholds. The convolution is the vector whose th component is . For simplicity, we consider only throughout the paper. Thus for and 0 otherwise. Equation 1 in vector form is
(13) 
where and , encodes the inhibitory/excitatory connections among the neurons.
Because , we have
(14) 
Filtering Equation 14 yields
(15) 
where . Theorem B.1 has been established previously in [37] in a slightly different form. We attach the proof consistent to our notations below for completeness. It is established under the following assumptions:

The currents of all neurons remain bounded from above, for all for some . This implies no neuron can spike arbitrarily fast, and the fact that neurons cannot spike arbitrarily rapidly implies the currents are bounded from below as well

There is a positive number such that whenever the numbers and exist, . This assumption says that unless a neuron stop spiking althogether after a certain time, the duration between consecutive spike cannot become arbitrarily long.
Theorem B.1.
As , , and all converge to .
Proof.
Let
( stands for “active”), and
( stands for “inactive”). First consider . Let be the time of the final spike. For any ,
Note that always. If , then
If ,
Since , obviously. Thus
Consider the case of . For any , let be the largest spike time that is no bigger than . Because , as .
Furthermore, note that because of the assumption always, where , . In otherwords, there is a time large enough such that for all and . Moreover, and . Thus
When this term is eventually smaller in magnitude than , we have
or equivalently,
∎
Theorem B.2.
Given any , there exists such that for all ,
The following theorem characterizes limit points of the trajectory . Recall that is a limit point if given any , there exists a time large enough such that is within to .
Theorem B.3.
Given any limit point , we must have , and , where is the elementwise product.
b.2 Nonnegative sparse coding by spiking neural networks
Given a nonnegative dictionary , a positive scaling vector and an image , the nonnegative sparse coding problem can be formulated as
(16) 
where . Using the wellknown KKT condition in optimization theory, see for example [4], is an optimal solution iff there exists such that all of the following hold:
(17) 
where is the generalized gradient of . Note that the generalized gradient is and that when and equals the interval when . Straightforward derivation then shows that is an optimal solution iff
(18)  
We now configure a neuron system depicted in Figure 5 so as to solve Equation 16. Set as the firing thresholds and set as the bias. Define the inhibition matrix to be , . Thus neuron inhibits neuron with weight . In this configuration, it is easy to establish for all for some as all connections are inhibitions. From Theorem B.3, any limit point of the trajectory satisfies and . But and . Thus solves Equation 16. And in particular, if the solution to Equation 16 is unique, the trajectory can only have one limit point, which means in fact the trajectory converges to the sparse coding solution. This result can be easily extended to the network in Figure 1(a) by expanding the bias into another layer of input neuron with .
b.3 Theorem 2: sparse coding with feedback perturbation
Equation 5, reflecting the structure of the coding and input neurons, takes the form:
(19) 
and denote the average currents and spike rates for the coding and input neurons, respectively, and . Note that , and all converge to as .
Theorem B.4.
Consider the configuration and . Suppose the soma currents and thus spike rates are bounded. Let , , be such that . Then, for any there is such that for all , and solves Equation 16 with replaced by where .
Proof.
Consider and define the vectors and for by each of their components:
where is the diagonal of . Denote the perturbations , , , and . This construction of and ensures , , and . Recall that (Theorem B.1); thus at large enough.
Next, observe that always always, for any setting in . Thus Theorem B.1 implies
(20) 
as . From Equation 19, this implies that
for some where . Thus
where , and . By assumption on , . Moreover, can be made arbitrarily small by taking and large enough. Thus there exist such that for all , and , implying in particular . Finally, note that
which shows (recall Equation 18) that solves Equation 16 with replaced by and the proof is complete. ∎
At present, we cannot establish a priori that the currents stay bounded when . Nevertheless, the theorem is applicable in practice as long as the observed currents stay bounded by some for and is small enough. See Section 3.3 for further comments.
b.4 Theorem 3: gradient calculations from contrastive learning
Theorem B.5.
Given any , there is a such that for all ,
(21)  
(22) 