1 Introduction
The human brain generates complex behaviors via the dynamics of electrical activity in a network of neurons each making synaptic connections. As there is no known centralized authority determining which specific connections a neuron makes or specifying the weights of individual synapses, synaptic connections must be established based on local rules. Therefore, a major challenge in neuroscience is to determine local synaptic learning rules that would ensure that the network acts coherently, i.e. guarantee robust network selforganization.
Much work has been devoted to the selforganization of neural networks for solving unsupervised computational tasks using Hebbian and antiHebbian learning rules (Földiak, 1990, 1989; Rubner and Tavan, 1989; Rubner and Schulten, 1990; Carlson, 1990; Plumbley, 1993b; Leen, 1991; Plumbley, 1993a; Linsker, 1997). Unsupervised setting is natural in biology because largescale labeled datasets are typically unavailable. Hebbian and antiHebbian learning rules are biologically plausible because they are local: The weight of an (anti)Hebbian synapse is proportional to the (minus) correlation in activity between the two neurons the synapse connects.
In networks for dimensionality reduction, for example, feedforward connections use Hebbian rules and lateral  antiHebbian, Figure 1. Hebbian rules attempt to align each neuronal feature vector, whose components are the weights of synapses impinging onto the neuron, with the input space direction of greatest variance. AntiHebbian rules mediate competition among neurons which prevents their feature vectors from aligning in the same direction. A rivalry between the two kinds of rules results in the equilibrium where synaptic weight vectors span the principal subspace of the input covariance matrix, i. e. the subspace spanned by the eigenvectors corresponding to the largest eigenvalues.
However, in most existing singlelayer networks, Figure 1, Hebbian and antiHebbian learning rules were postulated rather than derived from a principled objective. Having such derivation should yield better performing rules and deeper understanding than has been achived using heuristic rules. But, until recently, all derivations of singlelayer networks from principled objectives led to biologically implausible nonlocal learning rules, where the weight of a synapse depends on the activities of neurons other than the two the synapse connects.
Recently, singlelayer networks with local learning rules have been derived from similarity matching objective functions (Pehlevan et al., 2015; Pehlevan and Chklovskii, 2014; Hu et al., 2014). But why do similarity matching objectives lead to neural networks with local, Hebbian and antiHebbian learning rules? A clear answer to this question has been lacking.
Here, we answer this question by performing several illuminating variable transformations. Specifically, we reduce the full network optimization problem to a set of trivial optimization problems for each synapse which can be solved locally. Eliminating neural activity variables leads to a minmax objective in terms of feedforward and lateral synaptic weight matrices. This finally formalizes the longheld intuition about the adversarial relationship of Hebbian and antiHebbian learning rules.
In this paper, we make the following contributions. In Section 2, we present a more transparent derivation of the previously proposed online similarity matching algorithm for Principal Subspace Projection (PSP). In Section 3, we propose a novel objective for PSP combined with spherizing, or whitening, the data, which we name Principal Subspace Whitening (PSW), and derive from it a biologically plausible online algorithm. Also, in Sections 2 and 3, we demonstrate that stability in the offline setting guarantees projection onto the principal subspace and give principled learning rate recommendations. In Section 4, by eliminating activity variables from the objectives, we derive minmax formulations of PSP and PSW which yield themselves to gametheoretical interpretations. In Section 5, by expressing the optimization objectives in terms of feedforward synaptic weights only, we arrive at novel formulations of dimensionality reduction in terms of fractional powers of matrices. In Section 6, we demonstrate numerically that the performance of our online algorithms is superior to the heuristic ones.
2 From similarity matching to Hebbian/antiHebbian networks for PSP
2.1 Derivation of a mixed PSP from similarity matching
The PSP problem is formulated as follows. Given centered input data samples, , find projections, , onto the principal subspace (), i.e. the subspace spanned by eigenvectors corresponding to the top eigenvalues of the input covariance matrix:
(1) 
where we resort to a matrix notation by concatenating input column vectors into . Similarly, outputs are .
Our goal is to derive a biologically plausible singlelayer neural network implementing PSP by optimizing a principled objective. Biological plausibility requires that the learning rules are local, i.e. synaptic weight update depends on the activity of only the two neurons the synapse connects. The only PSP objective known to yield a singlelayer neural network with local learning rules is based on similarity matching (Pehlevan et al., 2015). This objective, borrowed from MultiDimensional Scaling (MDS), minimizes the mismatch between the similarity of inputs and outputs (Mardia et al., 1980; Williams, 2001; Cox and Cox, 2000):
(2) 
Here, similarity is quantified by the inner products between all pairs of inputs (outputs) comprising the Grammians ().
One can understand intuitively that the objective (2) is optimized by the projection onto the principal subspace by considering the following (for a rigorous proof see (Pehlevan and Chklovskii, 2015; Mardia et al., 1980; Cox and Cox, 2000)
). First, substitute a Singular Value Decomposition (SVD) for matrices
and and note that the mismatch is minimized by matching right singular vectors of to that of . Then, rotating the Grammians to the diagonal basis reduces the minimization problem to minimizing the mismatch between the corresponding singular values squared. Therefore, is given by the top right singular vectors of scaled by corresponding singular values. As the objective (2) is invariant to the leftmultiplication ofby an orthogonal matrix, it has infinitely many degenerate solutions. One such solution corresponds to the Principal Component Analysis (PCA).
Unlike nonneuralnetwork formulations of PSP or PCA, similarity matching outputs principal components (scores) rather than principal eigenvectors of the input covariance (loadings). Such difference in formulation is motivated by our interest in PSP or PCA neural networks (Diamantaras and Kung, 1996) that output principal components, , rather than principal eigenvectors. Principal eigenvectors are not transmitted downstream of the network but can be recovered computationally from the synaptic weight matrices. Although synaptic weights do not enter the objective (2), in previous work (Pehlevan et al., 2015), they arose naturally in the derivation of the online algorithm (see below) and stored correlations between input and output neural activities.
Next, we derive the minmax PSP objective from Eq. (2), starting with expanding the square of the Frobenius norm:
(3) 
We can rewrite Eq. (3) by introducing two new dynamical variable matrices in place of covariance matrices and :
(4) 
(5) 
To see that Eq. (5) is equivalent to Eq. (3) find optimal and by setting the corresponding derivatives of objective (5) to zero. Then, substitute and into Eq. (5) to obtain (3).
Finally, we exchange the order of minimization with respect to and as well as the order of minimization with respect to and maximization with respect to in Eq. (5). The last exchange is justified by the saddle point property (see Proposition 1 in Appendix A). Then, we arrive at the following minmax optimization problem:
(6) 
where is defined in Eq. (5). We call this a mixed objective because it includes both output variables, , and covariances, and .
2.2 Offline PSP algorithm
In this section, we present an offline optimization algorithm to solve the PSP problem and analyze fixed points of the corresponding dynamics. These results will be used in the next Section for the biologically plausible online algorithm implemented by neural networks.
In the offline setting, we can solve Eq. (6) by the alternating optimization approach used commonly in neural networks literature (Olshausen et al., 1996; Olshausen and Field, 1997; Arora et al., 2015). We, first, minimize with respect to while keeping and fixed,
(7) 
and, second, make a gradient descentascent step with respect to and while keeping fixed:
(8) 
where is the learning rate and is a ratio of learning rates for and . In Appendix C, we analyze how affects linear stability of the fixed point dynamics. These two phases are iterated until convergence (Algorithm 1)^{1}^{1}1This alternating optimization is identical to a gradient descentascent (see Proposition 2 in Appendix B) in and on the objective:
(10) 
Optimal in Eq. (9) exists because stays positive definite if initialized as such.
2.3 Linearly stable fixed points of Algorithm 1 correspond to the PSP
Here we demonstrate that convergence of Algorithm 1 to fixed and implies that is a PSP of . To this end, we approximate the gradient descentascent dynamics in the limit of small learning rate with the system of differential equations:
(11) 
where is now the time index for gradient descentascent dynamics.
To state our main result in Theorem 1, we define the “filter matrix” whose rows are “neural filters”
(12) 
so that, according to Eq. (9),
(13) 
Theorem 1.
Fixed points of the dynamical system (2.3) have the following properties:

The neural filters, , are orthonormal, i.e. .

The neural filters span a dimensional subspace in spanned by some eigenvectors of the input covariance matrix.

Stability of a fixed point requires that the neural filters span the principal subspace of .

Suppose the neural filters span the principal subspace. Define
(14) where , and are the top principal eigenvalues of . We assume . This fixed point is linearly stable if and only if:
(15) for all pairs. By linearly stable we mean that linear perturbations of and converge to a configuration in which the new neural filters are merely rotations within the principal subspace of the original neural filters.
Proof.
See Appendix C. ∎
Based on Theorem 1 we claim that, provided the dynamics converges to a fixed point, Algorithm 1 has found a PSP of input data. Note that the orthonormality of the neural filters is desired and consistent with PSP since, in this approach, outputs, , are interpreted as coordinates with respect to a basis spanning the principal subspace.
Theorem 1 yields a practical recommendation for choosing learning rate parameters in simulations. In a typical situation, one will not know the eigenvalues of the covariance matrix a priori but can rely on the fact, . Then, Eq. (15) implies that for the principal subspace is linearly stable leading to numerical convergence and stability.
2.4 Online neural minmax optimization algorithms
Unlike the offline setting considered so far, where all the input data are available from the outset, in the online setting, input data are streamed to the algorithm sequentially, one at a time. The algorithm must compute the corresponding output before the next input arrives and transmit it downstream. Once transmitted, the output cannot be altered. Moreover, the algorithm cannot store in memory any sizable fraction of past inputs or outputs but only a few, , state variables.
Whereas developing algorithms for the online setting is more challenging than that for the offline, it is necessary both for data analysis and for modeling biological neural networks. The size of modern datasets may exceed that of available RAM and/or the output must be computed before the dataset is fully streamed. Biological neural networks operating on the data streamed by the sensory organs are incapable of storing any significant fraction of it and compute the output on the fly.
Pehlevan et al. (2015) gave a derivation of a neural online algorithm for PSP, starting from the original similarity matching cost function (2). Here, instead, we start from the minmax form of similarity matching (6) and end up with a class of algorithms that reduce to the algorithm of Pehlevan et al. (2015) for special choices of learning rates. Our main contribution, however, is that the current derivation is much more intuitive and simpler, with insights to why similarity matching leads to local learning rules.
We start by rewriting the minmax PSP objective (6) as a sum of timeseparable terms that can be optimized independently:
(16) 
where
(17) 
and
(18) 
This separation in time is a benefit of the minmax PSP objective (6), and leads to a natural way to derive an online algorithm that was not available for the original similarity matching cost function (2).
To solve the optimization problem, Eq. (16), in the online setting, we optimize sequentially each . For each , first, minimize Eq.(18) with respect to while keeping and fixed. Second, make a gradient descentascent step with respect to and for fixed :
(19) 
where is the learning rate and is the ratio of and learning rates. As before, Proposition 2 (Appendix B) ensures that the online gradient descentascent updates, Eq. (2.4), follow from alternating optimization (Olshausen et al., 1996; Olshausen and Field, 1997; Arora et al., 2015) of .
(20) 
(21) 
Algorithm 2 can be implemented by a biologically plausible neural network. The dynamics (20) corresponds to neural activity in a recurrent circuit, where is the feedforward synaptic weight matrix and is the lateral synaptic weight matrix, Fig. 1A. Since is always positive definite, Eq. (18) is a Lyapunov function for neural activity. Hence the dynamics is guaranteed to converge to a unique fixed point, , where matrix inversion is computed iteratively in a distributed manner.
Updates of covariance matrices, Eq. (5), can be interpreted as synaptic learning rules: Hebbian for feedforward and antiHebbian (due to the sign in (20)) for lateral synaptic weights. Importantly, these rules are local  the weight of each synapse depends only on the activity of the pair of neurons that synapse connects  and therefore biologically plausible.
Even requiring full optimization with respect to vs. a gradient step with respect to and may have a biological justification. As neural activity dynamics is typically faster than synaptic plasticity, it may settle before the arrival of the next input.
To see why similarity matching leads to local learning rules let us consider Eqs. (6) and (16). Aside from separating in time, useful for derivation of online learning rules, also separates in synaptic weights and their pre and postsynaptic neural activities,
(22) 
Therefore, a derivative with respect to a synaptic weight depends only on the quantities accessible to the synapse.
Finally, we address two potential criticisms of the neural PSP algorithm. First is the existence of autapses, i.e. selfcoupling of neurons, in our network manifested in nonzero diagonals of the lateral connectivity matrix, , Fig 1A. Whereas autapses are encountered in the brain, they are rarely seen in principal neurons (Ikeda and Bekkers, 2006). Second is the symmetry of lateral synaptic weights in our network which is not observed experimentally. We derive an autapsefree network architecture (zeros on the diagonal of the lateral synaptic weight matrix ) with asymmetric lateral connectivity, Fig 1B, by using coordinate descent (Pehlevan et al., 2015) in place of gradient descent in the neural dynamics stage (20) (see Appendix F). The resulting algorithm produces the same outputs as the current algorithm and for the special case and , reduces to the algorithm with “forgetting” of Pehlevan et al. (2015).
3 From constrained similarity matching to Hebbian/antiHebbian networks for PSW
The variable substitution method we introduced in the previous section can be applied to other computational objectives in order to derive neural networks with local learning rules. To give an example, we derive a neural network for PSW, which can be formulated as a constrained similarity matching problem. This example also illustrates how an optimization constraint can be implemented by biological mechanisms.
3.1 Derivation of PSW from constrained similarity matching
The PSW problem is closely related to PSP: project centered input data samples onto the principal subspace (), and “spherize” the data in the subspace so that the variances in all directions are 1. To derive a neural PSW algorithm, we use the similarity matching objective with an additional constraint:
(23) 
We rewrite Eq. (23) by expanding the Frobenius norm squared and dropping the term, which is constant under the constraint, thus reducing (23) to a constrained similarity alignment problem:
(24) 
To see that objective (24) is optimized by the PSW, first, substitute a Singular Value Decomposition (SVD) for matrices and and note that the alignment is maximized by matching right singular vectors of to and rotating to the diagonal basis (for a rigorous proof see (Pehlevan and Chklovskii, 2015)). Since the squared singular values of equal unity, the objective (24) is reduced to a summation of squared singular values of and is optimized by choosing the top . Then, is given by the top right singular vectors of scaled by . As before, objective (24) is invariant to the leftmultiplication of by an orthogonal matrix and, therefore, has infinitely many degenerate solutions.
Next, we derive a mixed PSW objective from Eq. (24) by introducing two new dynamical variable matrices: the inputoutput correlation matrix, , and the Lagrange multiplier matrix, , for the whitening constraint:
(25) 
where
(26) 
To see that Eq. (26) is equivalent to Eq. (24), find optimal by setting the corresponding derivatives of the objective (26) to zero. Then, substitute into Eq. (26) to obtain the Lagrangian of Eq. (24).
Finally, we exchange the order of minimization with respect to and as well as the order of minimization with respect to and maximization with respect to in Eq. (26) (see Proposition 5 in Appendix D for a proof). Then, we arrive at the following minmax optimization problem with a mixed objective:
(27) 
where is defined in Eq. (26).
3.2 Offline PSW algorithm
Next, we give an offline algorithm for the PSW problem, using the alternating optimization procedure as before. We solve Eq. (27) by, first, optimizing with respect to for fixed and and, second, making a gradient descentascent step with respect to and while keeping fixed^{2}^{2}2This alternating optimization is identical to a gradient descentascent (see Proposition 2 in Appendix B) in and on the objective:
(29) 
3.3 Linearly stable fixed points of Algorithm 3 correspond to PSW
Here we claim that convergence of Algorithm 3 to fixed and implies PSW of . In the limit of small learning rate, the gradient descentascent dynamics can be approximated with the system of differential equations:
(30) 
where is now the time index for gradient descentascent dynamics. We again define the neural filter matrix .
Theorem 2.
Fixed points of the dynamical system (3.3) have the following properties:

The outputs are whitened, i.e. .

The neural filters span a dimensional subspace in which is spanned by some eigenvectors of the input covariance matrix.

Stability of the fixed point requires that the neural filters span the principal subspace of .

Suppose the neural filters span the principal subspace. This fixed point is linearly stable if and only if
(31) for all pairs, . By linear stability we mean that linear perturbations of and converge to a rotation of the original neural filters within the principal subspace.
Proof.
See Appendix E. ∎
3.4 Online algorithm for PSW
As before, we start by rewriting the minmax PSW objective (27) as a sum of timeseparable terms that can be optimized independently:
(32) 
where
(33) 
and is defined in Eq. (18). In the online setting, Eq. (32) can be optimized by sequentially minimizing each . For each , first, minimize (18) with respect to for fixed and , second, update and according to a gradient descentascent step for fixed :
(34) 
where is the learning rate and is the ratio of and learning rates.
As before, Proposition 2 ensures that the online gradient descentascent updates, Eq. (3.4), follow from alternating optimization (Olshausen et al., 1996; Olshausen and Field, 1997; Arora et al., 2015) of .
(35) 
(36) 
Algorithm 4 can be implemented by a biologically plausible singlelayer neural network with lateral connections as in Algorithm 2, Fig. 1A. Updates to synaptic weights, Eq. (5), are local, Hebbian/antiHebbian plasticity rules. An autapsefree network architecture, Fig 1B, may be obtained using coordinate descent (Pehlevan et al., 2015) in place of gradient descent in the neural dynamics stage (35) (see Appendix G).
The lateral connections here are the Lagrange multipliers introduced in the offline problem, Eq. (26). In the PSP network, they resulted from a variable transformation of the output covariance matrix. This difference caries over to the learning rules, where in Algorithm 4, the lateral learning rule is enforcing the whitening of the output, but in Algorithm 2, the lateral learning rule sets the lateral weight matrix to the output covariance matrix.
4 Game theoretical interpretation of Hebbian/antiHebbian learning
In the original similarity matching objective, Eq. (2), the only variables are neuronal activities which, at the optimum, represent principal components. In Section 2, we rewrote these objectives by introducing matrices W and M corresponding to synaptic connection weights, Eq. (5). Here, we eliminate neural activity variables altogether and arrive at a minmax formulation in terms of feedforward, , and lateral, , connection weight matrices only. This formulation lends itself to a gametheoretical interpretation.
Since in the offline PSP setting, optimal in Eq. (6
) is an invertible matrix (because
, see also Appendix A), we can restrict our optimization to invertible matrices, , only. Then, we can optimize objective (5) with respect to and substitute its optimal value into (5) and (6) to obtain:s.t. is invertible.  (37) 
This minmax objective admits a gametheoretical interpretation where feedforward, , and lateral, , synaptic weight matrices oppose each other. To reduce the objective, feedforward synaptic weight vectors of each output neuron attempt to align with the direction of maximum variance of input data. However, if this was the only driving force then all output neurons would learn the same synaptic weight vectors and represent the same top principal component. At the same time, linear dependency between different feedforward synaptic weight vectors can be exploited by the lateral synaptic weights to increase the objective by cancelling the contributions of different components. To avoid this, the feedforward synaptic weight vectors become linearly independent and span the principal subspace.
A similar interpretation can be given for PSW, where feedforward, , and lateral, , synaptic weight matrices oppose each other adversarially.
5 Novel formulations of dimensionality reduction using fractional exponents
In this section, we point to a new class of dimensionality reduction objective functions that naturally follow from the minmax objectives (5) and (6). Eliminating both the neural activity variables, Y, and the lateral connection weight matrix, M, we arrive at optimization problems in terms of the feedforward weight matrix, W, only. The rows of optimal W form a nonorthogonal basis of the principal subspace. Such formulations of principal subspace problems involve fractional exponents of matrices and, to the best of our knowledge, have not been proposed previously.
By replacing optimization in the minmax PSP objective, Eq. (6), by its saddle point value (see Proposition 1 in Appendix A) we find the following objective expressed solely in terms of :
(38) 
The rows of the optimal are not principal eigenvectors, rather the rowspace of spans the principal subspace.
By replacing optimization in the minmax PSW objective, Eq. (27), by its optimal value (see Proposition 5 in Appendix D):
(39) 
As before, the rows of the optimal are not principal eigenvectors, rather the rowspace of
spans the principal eigenspace.
We observe that the only material difference between Eqs. (38) and (39) is in the value of the fractional exponent. Based on this, we conjecture that any objective function of such form with a fractional exponent from a continuous range is optimized by spanning the principal subspace. Such solutions would differ in the eigenvalues associated with the corresponding components.
A supporting argument for our conjecture comes from the work of Miao and Hua (1998), which studied the cost
(40) 
Eq. 40 can be seen as a limiting case of our conjecture, where the fractional exponent goes to zero. Indeed, Miao and Hua (1998) proved that the rows of optimal are an orthonormal basis for the principal eigenspace.
6 Numerical experiments
Next, we test our findings using a simple artificial dataset. We generated an dimensional dataset and we simulated our offline and online algorithms to reduce this dataset to dimensions, using different values of the parameter . The results are plotted in Figs. 2, 3, 4 and 5 along with details of the simulations in the figures’ caption.
Consistent with Theorems 1 and 2, small perturbations to PSP and PSW fixed points decayed (solid lines) or grew (dashed lines) depending on the value of , Fig. 2A. Offline simulations that start from random initial conditions converged to the PSP (or the PSW) solution if the fixed point was linearly stable, Fig. 2B. Interestingly, the online algorithms’ performance were very close to that of the offline, Fig. 2C.
The error for linearly unstable simulations in Fig. 2
saturates rather than blowing up. This may seem at odds with Theorems
1 and 2, which stated that if there is a stable fixed point of the dynamics, it should be the PSP/PSW solution. A closer look resolves this dilemma. In Fig. 3, we plot the evolution of an element of the matrix in the offline algorithms for stable and unstable choices of . When the principal subspace is linearly unstable, the synaptic weights exhibit undamped oscillations. The dynamics seems to be confined to a manifold with a fixed distance (in terms of the error metric) from the principal subspace. That the error does not grow to infinity is a result of the stabilizing effect of minmax antagonism of the synaptic weights. Online algorithms behave similarly.Next, we studied in detail the effect of parameter on the convergence. In the offline algorithm, we plot the error after a fixed number of gradient steps, as a function of . For PSP, there is an optimal . Decreasing beyond the optimal value doesn’t lead to a degradation in performance, however increasing it leads to a rapid increase in the error. For PSW, there is a plateau of low error for low values of but a rapid increase as one approaches the linear instability threshold. Online algorithms behave similarly.
Finally, we compared the performance of our online PSP algorithm to neural PSP algorithms with heuristic learning rules such as the Subspace Network (Oja, 1989) and the Generalized Hebbian Algorithm (GHA) (Sanger, 1989), on the same dataset. We found that our algorithm converges much faster (Fig. 5). Previously, the original similarity matching network (Pehlevan et al., 2015), which is a special case of the online PSP algorithm of this paper, was shown to converge faster than the APEX (Kung et al., 1994) and Földiak’s (Földiak, 1989) networks.
7 Discussion
In this paper, through transparent variable substitutions, we demonstrated why biologically plausible neural networks can be derived from similarity matching objectives, mathematically formalized the adversarial relationship between Hebbian feedforward and antiHebbian lateral connections using minmax optimization lending itself to a gametheoretical interpretation, and formulated dimensionality reduction tasks as optimizations of fractional powers of matrices. The formalism we developed should generalize to unsupervised tasks other than dimensionality reduction and could provide a theoretical foundation for both natural and artificial neural networks.
In comparing our networks with biological ones, most importantly, our networks rely only on local learning rules that can be implemented by synaptic plasticity. While Hebbian learning is famously observed in neural circuits (Bliss and Lømo, 1973; Bliss and GardnerMedwin, 1973), our networks also require antiHebbian learning, which can be interpreted as the longterm potentiation of inhibitory postsynaptic potentials. Experimentally, such longterm potentiation can arise from pairing action potentials in inhibitory neurons with subthreshold depolarization of postsynaptic pyramidal neurons (Komatsu, 1994; Maffei et al., 2006). However, plasticity in inhibitory synapses does not have to be Hebbian, i.e. depend on the correlation between pre and postsynaptic activity (Kullmann et al., 2012).
To make progress, we had to make several simplifications sacrificing biological realism. In particular, we assumed that neuronal activity is a continuous variable which would correspond to membrane depolarization (in graded potential neurons) or firing rate (in spiking neurons). We ignored the nonlinearity of the neuronal inputoutput function. Such linear regime could be implemented via a resting state bias (in graded potential neurons) or resting firing rate (in spiking neurons).
The applicability of our networks as models of biological networks can be judged by experimentally testing the following predictions. First, we predict a relationship between the feedforward and lateral synaptic weight matrices which could be tested using modern connectomics datasets. Second, we suggest that similarity of output activity matches that of the input which could be tested by neuronal population activity measurements using calcium imaging.
Often the choice of a learning rate is crucial to the learning performance of neural networks. Here, we encountered a nuanced case where the ratio of feedforward and lateral weights, , affects the learning performance significantly. First, there is a maximum value of such ratio, beyond which the principal subspace solution is linearly unstable. The maximum value depends on the principal eigenvalues, but for PSP, is always linearly stable. For PSW there isn’t an always safe choice. Having the same learning rates for feedforward and lateral weights, , may actually be unstable. Second, linear stability is not the only thing that affects performance. In simulations, for PSP, we observed that there is an optimal value of . For PSW, decreasing seems to increase performance until a plateau is reached. This difference between PSP and PSW may be attributed to the difference of origins of lateral connectivity. In PSW algorithms, lateral weights originate from Lagrange multipliers enforcing an optimization constraint. Low , meaning higher lateral learning rates, force the network to satisfy the constraint during the whole evolution of the algorithm.
Based on these observation, we can make practical suggestions for the parameter. For PSP, seems to be a good choice, which is also preferred from another derivation of an online similarity matching algorithm (Pehlevan et al., 2015). For PSW, the smaller the , the better it is, although one should make sure that the lateral weight learning rate is still sufficiently small.
Acknowledgments
We thank Alex Genkin, Sebastian Seung, Mariano Tepper and Jonathan Zung for discussions.
Appendix A Proof of strong minmax property for PSP objective
Here we show that minimization with respect to and maximization with respect to can be exchanged in Eq. (5). We will make use of the following minmax theorem (Boyd and Vandenberghe, 2004), for which we give a proof for completeness:
Theorem 3.
Let . Suppose the saddlepoint property holds, i.e. , such that ,
(41) 
Then,
(42) 
Proof.
, , which implies
(43) 
Since is always true, we get an equality. ∎
Now, we present the main result of this section.
Proposition 1.
Define
(44) 
where , and are arbitrary sized, realvalued matrices. obeys a strong minmax property:
(45) 
Proof.
We will show that the saddlepoint property holds for Eq. (44). Then the result follows from Theorem 1.
If the saddle point exists, it is when ,
(46) 
Note that is symmetric and positive semidefinite. Multiplying the first equation by on the left and the right, and using the the second equation, we arrive at
(47) 
Solutions to Eq. (A) are not unique, because may not be invertible depending on . However, all solutions give the same value of :
(48) 
Now, we check if the saddlepoint property, Eq. (41), holds. The first inequality is satisfied:
(49) 
The second inequality is also satisfied:
(50) 
where the last line follows form being positive semidefinite.
Appendix B Taking a derivative using a chain rule
Proposition 2.
Suppose a differentiable, scalar function , where with arbitrary . Assume a finite minimum with respect to exists for a given set of :
(51) 
and the optimal is a stationary point
(52) 
Then, for
(53) 
Proof.
The result follows from application of the chain rule and the stationarity of the minimum:
(54) 
where the second term is zero due to Eq. (52). ∎
Appendix C Proof of Theorem 1
The fixed points of Eq. (2.3) are ( using for fixed point):
(55) 
where is the input covariance matrix defined as in Eq. (1).
c.1 Proof of item 1
c.2 Proof of item 2
First note that at fixed points, and commute:
(57) 
and share the same eigenvectors, because they commute. Orthonormality of neural filters, Eq. (56), implies that the rows of are degenerate eigenvectors of with unit eigenvalue. To see this: . Because the filters are degenerate, the corresponding shared eigenvectors of may not be the filters themselves but linear combinations of them. Nevertheless, the shared eigenvectors composed of filters span the same space as the filters.
Since we are interested in PSP, it is desirable that it is the top eigenvectors of that spans the filter space. A linear stability analysis around the fixed point reveals that any other combination is unstable, and that the PS is stable if is chosen appropriately.
c.3 Proof of item 3
Preliminaries
In order to perform a linear stability analysis, we linearize the system of equations (2.3) around the fixed point. Even though Eq. (2.3) depends on and , we will find it convenient to change variables and work with and instead.
Using the relation , one can express linear perturbations of around its fixed point, , in terms of perturbations of and :
(59) 
Linearization of Eq. (2.3) gives:
(60) 
and
(61) 
Using these, we arrive at: