Why do similarity matching objectives lead to Hebbian/anti-Hebbian networks?

03/23/2017 ∙ by Cengiz Pehlevan, et al. ∙ 0

Modeling self-organization of neural networks for unsupervised learning using Hebbian and anti-Hebbian plasticity has a long history in neuroscience. Yet, derivations of single-layer networks with such local learning rules from principled optimization objectives became possible only recently, with the introduction of similarity matching objectives. What explains the success of similarity matching objectives in deriving neural networks with local learning rules? Here, using dimensionality reduction as an example, we introduce several variable substitutions that illuminate the success of similarity matching. We show that the full network objective may be optimized separately for each synapse using local learning rules both in the offline and online settings. We formalize the long-standing intuition of the rivalry between Hebbian and anti-Hebbian rules by formulating a min-max optimization problem. We introduce a novel dimensionality reduction objective using fractional matrix exponents. To illustrate the generality of our approach, we apply it to a novel formulation of dimensionality reduction combined with whitening. We confirm numerically that the networks with learning rules derived from principled objectives perform better than those with heuristic learning rules.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The human brain generates complex behaviors via the dynamics of electrical activity in a network of neurons each making synaptic connections. As there is no known centralized authority determining which specific connections a neuron makes or specifying the weights of individual synapses, synaptic connections must be established based on local rules. Therefore, a major challenge in neuroscience is to determine local synaptic learning rules that would ensure that the network acts coherently, i.e. guarantee robust network self-organization.

Much work has been devoted to the self-organization of neural networks for solving unsupervised computational tasks using Hebbian and anti-Hebbian learning rules (Földiak, 1990, 1989; Rubner and Tavan, 1989; Rubner and Schulten, 1990; Carlson, 1990; Plumbley, 1993b; Leen, 1991; Plumbley, 1993a; Linsker, 1997). Unsupervised setting is natural in biology because large-scale labeled datasets are typically unavailable. Hebbian and anti-Hebbian learning rules are biologically plausible because they are local: The weight of an (anti-)Hebbian synapse is proportional to the (minus) correlation in activity between the two neurons the synapse connects.

In networks for dimensionality reduction, for example, feedforward connections use Hebbian rules and lateral - anti-Hebbian, Figure 1. Hebbian rules attempt to align each neuronal feature vector, whose components are the weights of synapses impinging onto the neuron, with the input space direction of greatest variance. Anti-Hebbian rules mediate competition among neurons which prevents their feature vectors from aligning in the same direction. A rivalry between the two kinds of rules results in the equilibrium where synaptic weight vectors span the principal subspace of the input covariance matrix, i. e. the subspace spanned by the eigenvectors corresponding to the largest eigenvalues.

However, in most existing single-layer networks, Figure 1, Hebbian and anti-Hebbian learning rules were postulated rather than derived from a principled objective. Having such derivation should yield better performing rules and deeper understanding than has been achived using heuristic rules. But, until recently, all derivations of single-layer networks from principled objectives led to biologically implausible non-local learning rules, where the weight of a synapse depends on the activities of neurons other than the two the synapse connects.

Recently, single-layer networks with local learning rules have been derived from similarity matching objective functions (Pehlevan et al., 2015; Pehlevan and Chklovskii, 2014; Hu et al., 2014). But why do similarity matching objectives lead to neural networks with local, Hebbian and anti-Hebbian learning rules? A clear answer to this question has been lacking.

Here, we answer this question by performing several illuminating variable transformations. Specifically, we reduce the full network optimization problem to a set of trivial optimization problems for each synapse which can be solved locally. Eliminating neural activity variables leads to a min-max objective in terms of feedforward and lateral synaptic weight matrices. This finally formalizes the long-held intuition about the adversarial relationship of Hebbian and anti-Hebbian learning rules.

In this paper, we make the following contributions. In Section 2, we present a more transparent derivation of the previously proposed online similarity matching algorithm for Principal Subspace Projection (PSP). In Section 3, we propose a novel objective for PSP combined with spherizing, or whitening, the data, which we name Principal Subspace Whitening (PSW), and derive from it a biologically plausible online algorithm. Also, in Sections 2 and 3, we demonstrate that stability in the offline setting guarantees projection onto the principal subspace and give principled learning rate recommendations. In Section 4, by eliminating activity variables from the objectives, we derive min-max formulations of PSP and PSW which yield themselves to game-theoretical interpretations. In Section 5, by expressing the optimization objectives in terms of feedforward synaptic weights only, we arrive at novel formulations of dimensionality reduction in terms of fractional powers of matrices. In Section 6, we demonstrate numerically that the performance of our online algorithms is superior to the heuristic ones.

2 From similarity matching to Hebbian/anti-Hebbian networks for PSP

2.1 Derivation of a mixed PSP from similarity matching

The PSP problem is formulated as follows. Given centered input data samples, , find projections, , onto the principal subspace (), i.e. the subspace spanned by eigenvectors corresponding to the top eigenvalues of the input covariance matrix:

(1)

where we resort to a matrix notation by concatenating input column vectors into . Similarly, outputs are .

Our goal is to derive a biologically plausible single-layer neural network implementing PSP by optimizing a principled objective. Biological plausibility requires that the learning rules are local, i.e. synaptic weight update depends on the activity of only the two neurons the synapse connects. The only PSP objective known to yield a single-layer neural network with local learning rules is based on similarity matching (Pehlevan et al., 2015). This objective, borrowed from Multi-Dimensional Scaling (MDS), minimizes the mismatch between the similarity of inputs and outputs (Mardia et al., 1980; Williams, 2001; Cox and Cox, 2000):

(2)

Here, similarity is quantified by the inner products between all pairs of inputs (outputs) comprising the Grammians ().

One can understand intuitively that the objective (2) is optimized by the projection onto the principal subspace by considering the following (for a rigorous proof see (Pehlevan and Chklovskii, 2015; Mardia et al., 1980; Cox and Cox, 2000)

). First, substitute a Singular Value Decomposition (SVD) for matrices

and and note that the mismatch is minimized by matching right singular vectors of to that of . Then, rotating the Grammians to the diagonal basis reduces the minimization problem to minimizing the mismatch between the corresponding singular values squared. Therefore, is given by the top right singular vectors of scaled by corresponding singular values. As the objective (2) is invariant to the left-multiplication of

by an orthogonal matrix, it has infinitely many degenerate solutions. One such solution corresponds to the Principal Component Analysis (PCA).

Unlike non-neural-network formulations of PSP or PCA, similarity matching outputs principal components (scores) rather than principal eigenvectors of the input covariance (loadings). Such difference in formulation is motivated by our interest in PSP or PCA neural networks (Diamantaras and Kung, 1996) that output principal components, , rather than principal eigenvectors. Principal eigenvectors are not transmitted downstream of the network but can be recovered computationally from the synaptic weight matrices. Although synaptic weights do not enter the objective (2), in previous work (Pehlevan et al., 2015), they arose naturally in the derivation of the online algorithm (see below) and stored correlations between input and output neural activities.

Next, we derive the min-max PSP objective from Eq. (2), starting with expanding the square of the Frobenius norm:

(3)

We can rewrite Eq. (3) by introducing two new dynamical variable matrices in place of covariance matrices and :

(4)
(5)

To see that Eq. (5) is equivalent to Eq. (3) find optimal and by setting the corresponding derivatives of objective (5) to zero. Then, substitute and into Eq. (5) to obtain (3).

Finally, we exchange the order of minimization with respect to and as well as the order of minimization with respect to and maximization with respect to in Eq. (5). The last exchange is justified by the saddle point property (see Proposition 1 in Appendix A). Then, we arrive at the following min-max optimization problem:

(6)

where is defined in Eq. (5). We call this a mixed objective because it includes both output variables, , and covariances, and .

2.2 Offline PSP algorithm

In this section, we present an offline optimization algorithm to solve the PSP problem and analyze fixed points of the corresponding dynamics. These results will be used in the next Section for the biologically plausible online algorithm implemented by neural networks.

In the offline setting, we can solve Eq. (6) by the alternating optimization approach used commonly in neural networks literature (Olshausen et al., 1996; Olshausen and Field, 1997; Arora et al., 2015). We, first, minimize with respect to while keeping and fixed,

(7)

and, second, make a gradient descent-ascent step with respect to and while keeping fixed:

(8)

where is the learning rate and is a ratio of learning rates for and . In Appendix C, we analyze how affects linear stability of the fixed point dynamics. These two phases are iterated until convergence (Algorithm 1)111This alternating optimization is identical to a gradient descent-ascent (see Proposition 2 in Appendix B) in and on the objective:

.

1:  Initialize . Initalize as a positive definite matrix.
2:  Iterate until convergence:
3:     Minimize Eq. (5) with respect to , keeping and fixed:
(9)
4:     Perform a gradient descent-ascent step with respect to and for a fixed :
(10)
where the step size, , may depend on the iteration.
Algorithm 1 Offline min-max PSP

Optimal in Eq. (9) exists because stays positive definite if initialized as such.

2.3 Linearly stable fixed points of Algorithm 1 correspond to the PSP

Here we demonstrate that convergence of Algorithm 1 to fixed and implies that is a PSP of . To this end, we approximate the gradient descent-ascent dynamics in the limit of small learning rate with the system of differential equations:

(11)

where is now the time index for gradient descent-ascent dynamics.

To state our main result in Theorem 1, we define the “filter matrix” whose rows are “neural filters”

(12)

so that, according to Eq. (9),

(13)
Theorem 1.

Fixed points of the dynamical system (2.3) have the following properties:

  1. The neural filters, , are orthonormal, i.e. .

  2. The neural filters span a -dimensional subspace in spanned by some eigenvectors of the input covariance matrix.

  3. Stability of a fixed point requires that the neural filters span the principal subspace of .

  4. Suppose the neural filters span the principal subspace. Define

    (14)

    where , and are the top principal eigenvalues of . We assume . This fixed point is linearly stable if and only if:

    (15)

    for all pairs. By linearly stable we mean that linear perturbations of and converge to a configuration in which the new neural filters are merely rotations within the principal subspace of the original neural filters.

Proof.

See Appendix C. ∎

Based on Theorem 1 we claim that, provided the dynamics converges to a fixed point, Algorithm 1 has found a PSP of input data. Note that the orthonormality of the neural filters is desired and consistent with PSP since, in this approach, outputs, , are interpreted as coordinates with respect to a basis spanning the principal subspace.

Theorem 1 yields a practical recommendation for choosing learning rate parameters in simulations. In a typical situation, one will not know the eigenvalues of the covariance matrix a priori but can rely on the fact, . Then, Eq. (15) implies that for the principal subspace is linearly stable leading to numerical convergence and stability.

2.4 Online neural min-max optimization algorithms

Unlike the offline setting considered so far, where all the input data are available from the outset, in the online setting, input data are streamed to the algorithm sequentially, one at a time. The algorithm must compute the corresponding output before the next input arrives and transmit it downstream. Once transmitted, the output cannot be altered. Moreover, the algorithm cannot store in memory any sizable fraction of past inputs or outputs but only a few, , state variables.

Whereas developing algorithms for the online setting is more challenging than that for the offline, it is necessary both for data analysis and for modeling biological neural networks. The size of modern datasets may exceed that of available RAM and/or the output must be computed before the dataset is fully streamed. Biological neural networks operating on the data streamed by the sensory organs are incapable of storing any significant fraction of it and compute the output on the fly.

Figure 1: Dimensionality reduction neural networks derived by min-max optimization in the online setting. A. Network with autapses. B. Network without autapses.

Pehlevan et al. (2015) gave a derivation of a neural online algorithm for PSP, starting from the original similarity matching cost function (2). Here, instead, we start from the min-max form of similarity matching (6) and end up with a class of algorithms that reduce to the algorithm of Pehlevan et al. (2015) for special choices of learning rates. Our main contribution, however, is that the current derivation is much more intuitive and simpler, with insights to why similarity matching leads to local learning rules.

We start by rewriting the min-max PSP objective (6) as a sum of time-separable terms that can be optimized independently:

(16)

where

(17)

and

(18)

This separation in time is a benefit of the min-max PSP objective (6), and leads to a natural way to derive an online algorithm that was not available for the original similarity matching cost function (2).

To solve the optimization problem, Eq. (16), in the online setting, we optimize sequentially each . For each , first, minimize Eq.(18) with respect to while keeping and fixed. Second, make a gradient descent-ascent step with respect to and for fixed :

(19)

where is the learning rate and is the ratio of and learning rates. As before, Proposition 2 (Appendix B) ensures that the online gradient descent-ascent updates, Eq. (2.4), follow from alternating optimization (Olshausen et al., 1996; Olshausen and Field, 1997; Arora et al., 2015) of .

1:  At , initialize the synaptic weight matrices, and . must be symmetric and positive definite.
2:  Repeat for each
3:     Receive input
4:     Neural activity: Run until convergence
(20)
5:     Plasticity: Update synaptic weight matrices,
(21)
Algorithm 2 Online min-max PSP

Algorithm 2 can be implemented by a biologically plausible neural network. The dynamics (20) corresponds to neural activity in a recurrent circuit, where is the feedforward synaptic weight matrix and is the lateral synaptic weight matrix, Fig. 1A. Since is always positive definite, Eq. (18) is a Lyapunov function for neural activity. Hence the dynamics is guaranteed to converge to a unique fixed point, , where matrix inversion is computed iteratively in a distributed manner.

Updates of covariance matrices, Eq. (5), can be interpreted as synaptic learning rules: Hebbian for feedforward and anti-Hebbian (due to the sign in (20)) for lateral synaptic weights. Importantly, these rules are local - the weight of each synapse depends only on the activity of the pair of neurons that synapse connects - and therefore biologically plausible.

Even requiring full optimization with respect to vs. a gradient step with respect to and may have a biological justification. As neural activity dynamics is typically faster than synaptic plasticity, it may settle before the arrival of the next input.

To see why similarity matching leads to local learning rules let us consider Eqs. (6) and (16). Aside from separating in time, useful for derivation of online learning rules, also separates in synaptic weights and their pre- and postsynaptic neural activities,

(22)

Therefore, a derivative with respect to a synaptic weight depends only on the quantities accessible to the synapse.

Finally, we address two potential criticisms of the neural PSP algorithm. First is the existence of autapses, i.e. self-coupling of neurons, in our network manifested in nonzero diagonals of the lateral connectivity matrix, , Fig 1A. Whereas autapses are encountered in the brain, they are rarely seen in principal neurons (Ikeda and Bekkers, 2006). Second is the symmetry of lateral synaptic weights in our network which is not observed experimentally. We derive an autapse-free network architecture (zeros on the diagonal of the lateral synaptic weight matrix ) with asymmetric lateral connectivity, Fig 1B, by using coordinate descent (Pehlevan et al., 2015) in place of gradient descent in the neural dynamics stage (20) (see Appendix F). The resulting algorithm produces the same outputs as the current algorithm and for the special case and , reduces to the algorithm with “forgetting” of Pehlevan et al. (2015).

3 From constrained similarity matching to Hebbian/anti-Hebbian networks for PSW

The variable substitution method we introduced in the previous section can be applied to other computational objectives in order to derive neural networks with local learning rules. To give an example, we derive a neural network for PSW, which can be formulated as a constrained similarity matching problem. This example also illustrates how an optimization constraint can be implemented by biological mechanisms.

3.1 Derivation of PSW from constrained similarity matching

The PSW problem is closely related to PSP: project centered input data samples onto the principal subspace (), and “spherize” the data in the subspace so that the variances in all directions are 1. To derive a neural PSW algorithm, we use the similarity matching objective with an additional constraint:

(23)

We rewrite Eq. (23) by expanding the Frobenius norm squared and dropping the term, which is constant under the constraint, thus reducing (23) to a constrained similarity alignment problem:

(24)

To see that objective (24) is optimized by the PSW, first, substitute a Singular Value Decomposition (SVD) for matrices and and note that the alignment is maximized by matching right singular vectors of to and rotating to the diagonal basis (for a rigorous proof see (Pehlevan and Chklovskii, 2015)). Since the squared singular values of equal unity, the objective (24) is reduced to a summation of squared singular values of and is optimized by choosing the top . Then, is given by the top right singular vectors of scaled by . As before, objective (24) is invariant to the left-multiplication of by an orthogonal matrix and, therefore, has infinitely many degenerate solutions.

Next, we derive a mixed PSW objective from Eq. (24) by introducing two new dynamical variable matrices: the input-output correlation matrix, , and the Lagrange multiplier matrix, , for the whitening constraint:

(25)

where

(26)

To see that Eq. (26) is equivalent to Eq. (24), find optimal by setting the corresponding derivatives of the objective (26) to zero. Then, substitute into Eq. (26) to obtain the Lagrangian of Eq. (24).

Finally, we exchange the order of minimization with respect to and as well as the order of minimization with respect to and maximization with respect to in Eq. (26) (see Proposition 5 in Appendix D for a proof). Then, we arrive at the following min-max optimization problem with a mixed objective:

(27)

where is defined in Eq. (26).

3.2 Offline PSW algorithm

Next, we give an offline algorithm for the PSW problem, using the alternating optimization procedure as before. We solve Eq. (27) by, first, optimizing with respect to for fixed and and, second, making a gradient descent-ascent step with respect to and while keeping fixed222This alternating optimization is identical to a gradient descent-ascent (see Proposition 2 in Appendix B) in and on the objective:

. We arrive at the following algorithm:

1:  Initialize . Initialize as a positive definite matrix.
2:  Iterate until convergence:
3:     Minimize Eq. (26) with respect to , keeping and fixed:
(28)
4:     Perform a gradient descent-ascent step with respect to and for a fixed :
(29)
where the step size, , may depend on the iteration.
Algorithm 3 Offline min-max PSW

Convergence of Algorithm 3 requires the input covariance matrix, , to have at least non-zero eigenvalues. Otherwise, a consistent solution cannot be found because update (4) forces to be full-rank while Eq. (28) lowers its rank.

3.3 Linearly stable fixed points of Algorithm 3 correspond to PSW

Here we claim that convergence of Algorithm 3 to fixed and implies PSW of . In the limit of small learning rate, the gradient descent-ascent dynamics can be approximated with the system of differential equations:

(30)

where is now the time index for gradient descent-ascent dynamics. We again define the neural filter matrix .

Theorem 2.

Fixed points of the dynamical system (3.3) have the following properties:

  1. The outputs are whitened, i.e. .

  2. The neural filters span a -dimensional subspace in which is spanned by some eigenvectors of the input covariance matrix.

  3. Stability of the fixed point requires that the neural filters span the principal subspace of .

  4. Suppose the neural filters span the principal subspace. This fixed point is linearly stable if and only if

    (31)

    for all pairs, . By linear stability we mean that linear perturbations of and converge to a rotation of the original neural filters within the principal subspace.

Proof.

See Appendix E. ∎

Based on Theorem 2 we claim that, provided Algorithm 3 converges, this fixed point corresponds to a PSW of input data. Unlike the PSP case, the neural filters are not orthonormal.

3.4 Online algorithm for PSW

As before, we start by rewriting the min-max PSW objective (27) as a sum of time-separable terms that can be optimized independently:

(32)

where

(33)

and is defined in Eq. (18). In the online setting, Eq. (32) can be optimized by sequentially minimizing each . For each , first, minimize (18) with respect to for fixed and , second, update and according to a gradient descent-ascent step for fixed :

(34)

where is the learning rate and is the ratio of and learning rates.

As before, Proposition 2 ensures that the online gradient descent-ascent updates, Eq. (3.4), follow from alternating optimization (Olshausen et al., 1996; Olshausen and Field, 1997; Arora et al., 2015) of .

1:  At , initialize the synaptic weight matrices, and . must be symmetric and positive definite.
2:  Repeat for each
3:     Receive input
4:     Neural activity: Run until convergence
(35)
5:     Plasticity: Update synaptic weight matrices,
(36)
Algorithm 4 Online min-max PSW

Algorithm 4 can be implemented by a biologically plausible single-layer neural network with lateral connections as in Algorithm 2, Fig. 1A. Updates to synaptic weights, Eq. (5), are local, Hebbian/anti-Hebbian plasticity rules. An autapse-free network architecture, Fig 1B, may be obtained using coordinate descent (Pehlevan et al., 2015) in place of gradient descent in the neural dynamics stage (35) (see Appendix G).

The lateral connections here are the Lagrange multipliers introduced in the offline problem, Eq. (26). In the PSP network, they resulted from a variable transformation of the output covariance matrix. This difference caries over to the learning rules, where in Algorithm 4, the lateral learning rule is enforcing the whitening of the output, but in Algorithm 2, the lateral learning rule sets the lateral weight matrix to the output covariance matrix.

4 Game theoretical interpretation of Hebbian/anti-Hebbian learning

In the original similarity matching objective, Eq. (2), the only variables are neuronal activities which, at the optimum, represent principal components. In Section 2, we rewrote these objectives by introducing matrices W and M corresponding to synaptic connection weights, Eq. (5). Here, we eliminate neural activity variables altogether and arrive at a min-max formulation in terms of feedforward, , and lateral, , connection weight matrices only. This formulation lends itself to a game-theoretical interpretation.

Since in the offline PSP setting, optimal in Eq. (6

) is an invertible matrix (because

, see also Appendix A), we can restrict our optimization to invertible matrices, , only. Then, we can optimize objective (5) with respect to and substitute its optimal value into (5) and (6) to obtain:

s.t. is invertible. (37)

This min-max objective admits a game-theoretical interpretation where feedforward, , and lateral, , synaptic weight matrices oppose each other. To reduce the objective, feedforward synaptic weight vectors of each output neuron attempt to align with the direction of maximum variance of input data. However, if this was the only driving force then all output neurons would learn the same synaptic weight vectors and represent the same top principal component. At the same time, linear dependency between different feedforward synaptic weight vectors can be exploited by the lateral synaptic weights to increase the objective by cancelling the contributions of different components. To avoid this, the feedforward synaptic weight vectors become linearly independent and span the principal subspace.

A similar interpretation can be given for PSW, where feedforward, , and lateral, , synaptic weight matrices oppose each other adversarially.

5 Novel formulations of dimensionality reduction using fractional exponents

In this section, we point to a new class of dimensionality reduction objective functions that naturally follow from the min-max objectives (5) and (6). Eliminating both the neural activity variables, Y, and the lateral connection weight matrix, M, we arrive at optimization problems in terms of the feedforward weight matrix, W, only. The rows of optimal W form a non-orthogonal basis of the principal subspace. Such formulations of principal subspace problems involve fractional exponents of matrices and, to the best of our knowledge, have not been proposed previously.

By replacing optimization in the min-max PSP objective, Eq. (6), by its saddle point value (see Proposition 1 in Appendix A) we find the following objective expressed solely in terms of :

(38)

The rows of the optimal are not principal eigenvectors, rather the rowspace of spans the principal subspace.

By replacing optimization in the min-max PSW objective, Eq. (27), by its optimal value (see Proposition 5 in Appendix D):

(39)

As before, the rows of the optimal are not principal eigenvectors, rather the rowspace of

spans the principal eigenspace.

We observe that the only material difference between Eqs. (38) and (39) is in the value of the fractional exponent. Based on this, we conjecture that any objective function of such form with a fractional exponent from a continuous range is optimized by spanning the principal subspace. Such solutions would differ in the eigenvalues associated with the corresponding components.

A supporting argument for our conjecture comes from the work of Miao and Hua (1998), which studied the cost

(40)

Eq. 40 can be seen as a limiting case of our conjecture, where the fractional exponent goes to zero. Indeed, Miao and Hua (1998) proved that the rows of optimal are an orthonormal basis for the principal eigenspace.

6 Numerical experiments

Figure 2: Demonstration of the stability of the PSP (top row) and PSW (bottom row) algorithms. We constructed an by data matrix from its SVD, where the left and right singular vectors are chosen randomly, the top three singular values are set to and the rest of the singular values are chosen uniformly in . Learning rates were . Errors were defined using deviation of the neural filters from their optimal values (Pehlevan et al., 2015). Let be the matrix whose columns are the top 3 left singular vectors of . PSP error: , PSW error: , with in MATLAB notation. Solid (dashed) lines indicate linearly stable (unstable) choices of . A) Small perturbations to the fixed point. and matrices were initialized by adding a random Gaussian variable, , elementwise to their fixed point values. B) Offline algorithm, initialized with random and matrices. C) Online algorithm, initialized with the same initial condition as in B). A random column of is processed at each time.

Next, we test our findings using a simple artificial dataset. We generated an dimensional dataset and we simulated our offline and online algorithms to reduce this dataset to dimensions, using different values of the parameter . The results are plotted in Figs. 2, 3, 4 and 5 along with details of the simulations in the figures’ caption.

Consistent with Theorems 1 and 2, small perturbations to PSP and PSW fixed points decayed (solid lines) or grew (dashed lines) depending on the value of , Fig. 2A. Offline simulations that start from random initial conditions converged to the PSP (or the PSW) solution if the fixed point was linearly stable, Fig. 2B. Interestingly, the online algorithms’ performance were very close to that of the offline, Fig. 2C.

The error for linearly unstable simulations in Fig. 2

saturates rather than blowing up. This may seem at odds with Theorems

1 and 2, which stated that if there is a stable fixed point of the dynamics, it should be the PSP/PSW solution. A closer look resolves this dilemma. In Fig. 3, we plot the evolution of an element of the matrix in the offline algorithms for stable and unstable choices of . When the principal subspace is linearly unstable, the synaptic weights exhibit undamped oscillations. The dynamics seems to be confined to a manifold with a fixed distance (in terms of the error metric) from the principal subspace. That the error does not grow to infinity is a result of the stabilizing effect of min-max antagonism of the synaptic weights. Online algorithms behave similarly.

Figure 3: Evolution of a synaptic weight. Same dataset was used as in Fig. 2. .

Next, we studied in detail the effect of parameter on the convergence. In the offline algorithm, we plot the error after a fixed number of gradient steps, as a function of . For PSP, there is an optimal . Decreasing beyond the optimal value doesn’t lead to a degradation in performance, however increasing it leads to a rapid increase in the error. For PSW, there is a plateau of low error for low values of but a rapid increase as one approaches the linear instability threshold. Online algorithms behave similarly.

Figure 4: Effect of of performance. Error after gradient steps are plotted as a function of different choices of . Same dataset was used as in Fig. 2 with same network initalization and learning rates. Both curves start from and go to the maximum value allowed for linear stability.

Finally, we compared the performance of our online PSP algorithm to neural PSP algorithms with heuristic learning rules such as the Subspace Network (Oja, 1989) and the Generalized Hebbian Algorithm (GHA) (Sanger, 1989), on the same dataset. We found that our algorithm converges much faster (Fig. 5). Previously, the original similarity matching network (Pehlevan et al., 2015), which is a special case of the online PSP algorithm of this paper, was shown to converge faster than the APEX (Kung et al., 1994) and Földiak’s (Földiak, 1989) networks.

Figure 5: Comparison of the online PSP algorithm with the Subspace Network (Oja, 1989) and the GHA (Sanger, 1989). The dataset and the error metric is as in Fig. 2. For fairness of comparison, the learning rates in all networks were set to .

for the online PSP algorithm . Feedforward connectivity matrices were initialized randomly. For the online PSP algorithm, lateral connectivity matrix was initialized to the identity matrix. Curves show averages over 10 trials.

7 Discussion

In this paper, through transparent variable substitutions, we demonstrated why biologically plausible neural networks can be derived from similarity matching objectives, mathematically formalized the adversarial relationship between Hebbian feedforward and anti-Hebbian lateral connections using min-max optimization lending itself to a game-theoretical interpretation, and formulated dimensionality reduction tasks as optimizations of fractional powers of matrices. The formalism we developed should generalize to unsupervised tasks other than dimensionality reduction and could provide a theoretical foundation for both natural and artificial neural networks.

In comparing our networks with biological ones, most importantly, our networks rely only on local learning rules that can be implemented by synaptic plasticity. While Hebbian learning is famously observed in neural circuits (Bliss and Lømo, 1973; Bliss and Gardner-Medwin, 1973), our networks also require anti-Hebbian learning, which can be interpreted as the long-term potentiation of inhibitory postsynaptic potentials. Experimentally, such long-term potentiation can arise from pairing action potentials in inhibitory neurons with subthreshold depolarization of postsynaptic pyramidal neurons (Komatsu, 1994; Maffei et al., 2006). However, plasticity in inhibitory synapses does not have to be Hebbian, i.e. depend on the correlation between pre- and postsynaptic activity (Kullmann et al., 2012).

To make progress, we had to make several simplifications sacrificing biological realism. In particular, we assumed that neuronal activity is a continuous variable which would correspond to membrane depolarization (in graded potential neurons) or firing rate (in spiking neurons). We ignored the nonlinearity of the neuronal input-output function. Such linear regime could be implemented via a resting state bias (in graded potential neurons) or resting firing rate (in spiking neurons).

The applicability of our networks as models of biological networks can be judged by experimentally testing the following predictions. First, we predict a relationship between the feedforward and lateral synaptic weight matrices which could be tested using modern connectomics datasets. Second, we suggest that similarity of output activity matches that of the input which could be tested by neuronal population activity measurements using calcium imaging.

Often the choice of a learning rate is crucial to the learning performance of neural networks. Here, we encountered a nuanced case where the ratio of feedforward and lateral weights, , affects the learning performance significantly. First, there is a maximum value of such ratio, beyond which the principal subspace solution is linearly unstable. The maximum value depends on the principal eigenvalues, but for PSP, is always linearly stable. For PSW there isn’t an always safe choice. Having the same learning rates for feedforward and lateral weights, , may actually be unstable. Second, linear stability is not the only thing that affects performance. In simulations, for PSP, we observed that there is an optimal value of . For PSW, decreasing seems to increase performance until a plateau is reached. This difference between PSP and PSW may be attributed to the difference of origins of lateral connectivity. In PSW algorithms, lateral weights originate from Lagrange multipliers enforcing an optimization constraint. Low , meaning higher lateral learning rates, force the network to satisfy the constraint during the whole evolution of the algorithm.

Based on these observation, we can make practical suggestions for the parameter. For PSP, seems to be a good choice, which is also preferred from another derivation of an online similarity matching algorithm (Pehlevan et al., 2015). For PSW, the smaller the , the better it is, although one should make sure that the lateral weight learning rate is still sufficiently small.

Acknowledgments

We thank Alex Genkin, Sebastian Seung, Mariano Tepper and Jonathan Zung for discussions.

Appendix A Proof of strong min-max property for PSP objective

Here we show that minimization with respect to and maximization with respect to can be exchanged in Eq. (5). We will make use of the following min-max theorem (Boyd and Vandenberghe, 2004), for which we give a proof for completeness:

Theorem 3.

Let . Suppose the saddle-point property holds, i.e. , such that ,

(41)

Then,

(42)
Proof.

, , which implies

(43)

Since is always true, we get an equality. ∎

Now, we present the main result of this section.

Proposition 1.

Define

(44)

where , and are arbitrary sized, real-valued matrices. obeys a strong min-max property:

(45)
Proof.

We will show that the saddle-point property holds for Eq. (44). Then the result follows from Theorem 1.

If the saddle point exists, it is when ,

(46)

Note that is symmetric and positive semidefinite. Multiplying the first equation by on the left and the right, and using the the second equation, we arrive at

(47)

Solutions to Eq. (A) are not unique, because may not be invertible depending on . However, all solutions give the same value of :

(48)

Now, we check if the saddle-point property, Eq. (41), holds. The first inequality is satisfied:

(49)

The second inequality is also satisfied:

(50)

where the last line follows form being positive semidefinite.

Eq.s (A) and (A) show that the saddle-point property (41) holds, and therefore and can be exchanged and the value of at the saddle-point is . ∎

Appendix B Taking a derivative using a chain rule

Proposition 2.

Suppose a differentiable, scalar function , where with arbitrary . Assume a finite minimum with respect to exists for a given set of :

(51)

and the optimal is a stationary point

(52)

Then, for

(53)
Proof.

The result follows from application of the chain rule and the stationarity of the minimum:

(54)

where the second term is zero due to Eq. (52). ∎

Appendix C Proof of Theorem 1

Here we prove Theorem 1 using methodology from (Pehlevan et al., 2015).

The fixed points of Eq. (2.3) are ( using for fixed point):

(55)

where is the input covariance matrix defined as in Eq. (1).

c.1 Proof of item 1

The result follows from Eq.s (12) and (55):

(56)

c.2 Proof of item 2

First note that at fixed points, and commute:

(57)
Proof.

The result follows from Eq.s (12) and (55):

(58)

and share the same eigenvectors, because they commute. Orthonormality of neural filters, Eq. (56), implies that the rows of are degenerate eigenvectors of with unit eigenvalue. To see this: . Because the filters are degenerate, the corresponding shared eigenvectors of may not be the filters themselves but linear combinations of them. Nevertheless, the shared eigenvectors composed of filters span the same space as the filters.

Since we are interested in PSP, it is desirable that it is the top eigenvectors of that spans the filter space. A linear stability analysis around the fixed point reveals that any other combination is unstable, and that the PS is stable if is chosen appropriately.

c.3 Proof of item 3

Preliminaries

In order to perform a linear stability analysis, we linearize the system of equations (2.3) around the fixed point. Even though Eq. (2.3) depends on and , we will find it convenient to change variables and work with and instead.

Using the relation , one can express linear perturbations of around its fixed point, , in terms of perturbations of and :

(59)

Linearization of Eq. (2.3) gives:

(60)

and

(61)

Using these, we arrive at: