1 Introduction
Inventors of the original artificial neural networks (NNs) derived their inspiration from biology [1]
. However, today, most artificial NNs such as, for example, backpropagationbased convolutional deep learning networks, resemble natural NNs only superficially. Given that, on some tasks, such artificial NNs achieve human or even superhuman performance, why should one care about such dissimilarity with natural NNs? The algorithms of natural NNs are relevant if one’s goal is not just to outperform humans on certain tasks but to develop generalpurpose artificial intelligence rivaling human. As contemporary artificial NNs are far from achieving this goal and natural NNs, by definition, achieve it, natural NNs must contain some “secret sauce” that artificial NNs lack. This is why we need to understand the algorithms implemented by natural NNs.
Motivated by this argument, we have been developing artificial NNs that could plausibly model natural NNs on the algorithmic level. In our artificial NNs, we do not attempt to reproduce many biological details, as in existing biophysical modeling work, but rather develop algorithms that respect major biological constraints.
For example, biologically plausible algorithms must be formulated in the online (or streaming), rather than offline (or batch), setting. This means that input data are streamed to the algorithm sequentially, and the corresponding output must be computed before the next input sample arrives. Moreover, memory accessible to a biological algorithm is limited so that no significant fraction of previous input or output can be stored.
Another key constraint is that in biologically plausible NNs, learning rules must be local
: a biological synapse can update its weight based on the activity of only the two neurons that the synapse connects. Such “locality” of the learning rule is violated by most artificial NNs including backpropagationbased deep learning networks. In contrast, our NNs employ exclusively local learning rules. Such rules are also helpful for hardware implementations of artificial NNs in neuromorphic chips
[2, 3].We derive the algorithms performed by our NNs from optimization objectives. In addition to deriving learning rules for synaptic weights, as is done in existing artificial NNs, we also derive the architecture, activation functions, and dynamics of neural activity from the same objectives. To do this, we postulate only a cost function and an optimization algorithm, which in our case is alternating stochastic gradient descentascent
[4]. The steps of this algorithm map to a NN, specifying its architecture, activation functions, dynamics, and learning rules. Viewing both weight and activity updates as the steps of an online optimization algorithm allows us to predict the output of our NNs to a wide range of stimuli without relying on exhaustive numerical simulation.To derive local learning rules we employ optimization objectives operating with pairwise similarities of inputs and outputs of a NN rather than individual data points. Typically, our objectives favor similar outputs for similar inputs. Hence, the name  similarity matching objectives. The transformation of dissimilar inputs in the NN depends on the optimization constraints. Despite using pairwise similarities we still manage to derive online optimization algorithms.
Our focus is on unsupervised
learning. This is not a hard constraint, but rather a matter of priority. Whereas humans are clearly capable of supervised learning, most of our learning tasks lack big labeled datasets. On the mechanistic level, most neurons lack a clear supervision signal.
This paper is organized as follows. We start by presenting the conventional approach to deriving unsupervised NNs (Section 2). While the conventional approach generates a reasonable algorithmic model of a single biological neuron, multineuron networks violate biological constraints. To overcome this difficulty, in Section 3, we introduce similaritybased cost functions and show that linear dimensionality reduction NNs derived from such cost functions are biologically plausible. In Section 4, we introduce a signconstrained similaritymatching objective and discuss algorithms for sparse feature extraction and nonnegative independent component analysis. In Section 5, we discuss other signconstrained networks for clustering and manifold learning. We conclude by discussing potential applications of our work to neuromorphic computing and charting future directions.
2 Background
2.1 Singleneuron online Principal Component Analysis (PCA)
In the seminal 1982 paper [5]
, Oja proposed that a biological neuron can be viewed as an implementation of a mathematical algorithm solving a computational objective. He proposed to model a neuron by an online Principal Component Analysis (PCA) algorithm. As PCA is a workhorse of data analysis used for dimensionality reduction, denoising, and latent factor discovery, Oja’s model offers an algorithmiclevel description of biological NNs.
Oja’s singleneuron online PCA algorithm works as follows. At each time step, , it receives an input data sample, . As our focus is on the online setting, we use the same variable, , to measure time and index the data points. Then, the algorithm computes and outputs the corresponding top principal component value, :
(1) 
where
is the feature vector computed at time step,
. Here and below lowercase italic letters are scalar variables and lowercase boldfaced letters designate vectors.At the same time step, , after computing the principal component, the algorithm updates the (normalized) feature vector with a learning rate, ,
(2) 
If data are drawn i.i.d. from a stationary distribution with a mean vector of zero, the feature vector,
, converges to the eigenvector corresponding to the largest eigenvalue of input covariance
[5].The steps of the Oja algorithm (1),(2) naturally correspond to the operations of a biological neuron. Assuming that the components of the input vector are represented by the activities (firing rates) of the upstream neurons, (1) describes a weighted summation of the inputs by the output neuron. Such weighted summation can be naturally implemented by storing the components of feature vector, , in the corresponding synaptic weights. If the activation function is linear, the output, , is simply the weighted sum.
The weight update (2) is a biologically plausible local synaptic learning rule. The first term of the update, , is proportional to the correlation of the pre and postsynaptic neurons’ activities and the second term, , also local, asymptotically normalizes the synaptic weight vector to one. In neuroscience, synaptic weight updates proportional to the correlation of the pre and postsynaptic neurons’ activities are called Hebbian.
2.2 Minimization of the reconstruction error yields biologically implausible multineuron networks
Next, we would like to build on Oja’s insightful identification of biological processes with the steps of the online PCA algorithms by computing multiple principal components using multineuron NNs. Instead of trying to extend the Oja model heuristically, we will derive them by using optimization of a principled objective function. Specifically, we postulate that the algorithm minimizes the reconstruction error, derive an online algorithm optimizing such objective, and map the steps of the algorithm onto biological processes.
In the conventional reconstruction error minimization approach, each data sample, , is approximated as a linear combination of each neuron’s feature vector weighted by its activity [4]. Then the minimization of the reconstruction (or coding) error can be expressed as follows:
(3) 
where matrix , , is a concatenation of feature columnvectors and is both the number of data samples and (in the online setting) the number of time steps.
In the offline setting, a solution to the optimization problem (3) is PCA: the columns of optimum are a basis for the dimensional principal subspace [6]. Elements of could be constrained to avoid unreasonably low or high values.
In the online setting, (3) can be solved by alternating minimization [4]. After the arrival of data sample, , the feature vectors are kept fixed while the objective (3) is minimized with respect to the principal components by running the following gradientdescent dynamics until convergence:
(4) 
where is a derivative with respect to a continuous time variable which runs within a time step, . Unlike a closedform output of a single Oja neuron (1), (4) is iterative.
After the output, converges, at the same time step, , the feature vectors are updated according to the following gradientdescent step, with respect to on the total objective:
(5) 
If there was a single output channel, the algorithm (4),(5) would reduce to (1),(2), provided that the scalar is rescaled to unity.
In NN implementations of algorithm (4),(5), feature vectors are represented by synaptic weights and components by the activities of output neurons. Then (4) can be implemented by a singlelayer NN, Fig. 1A, in which activity dynamics converges faster than the time interval between the arrival of successive data samples. The lateral connection weights, , decorrelate neuronal feature vectors by suppressing activities of correlated neurons.
However, implementing update (5) in the singlelayer NN architecture, Fig. 1A, requires nonlocal learning rules making it biologically implausible. Indeed, the last term in (5) implies that updating the weight of a synapse requires the knowledge of output activities of all other neurons which are not available to the synapse. Moreover, the matrix of lateral connection weights, , in the last term of (4) is computed as a Grammian of feedforward weights, clearly a nonlocal operation. This problem is not limited to PCA and arises in networks of nonlinear neurons as well [4, 11].
To respect the local learning constraint, many authors constructed biologically plausible singlelayer networks using heuristic local learning rules that were not derived from an objective function [12, 13]. However, we think that abandoning the optimization approach creates more problems than it solves. Alternatively, NNs with local learning rules can be derived if one introduces a second layer of neurons [14]. However, such architecture does not map naturally on biological networks.
We have outlined how the conventional reconstruction approach fails to generate biologically plausible multineuron networks for online PCA. In the next section, we will introduce an alternative approach that overcomes this limitation. Moreover, this approach suggests a novel view of neural computation leading to many interesting extensions.
3 Similaritybased approach to linear dimensionality reduction
In this section, we propose a different objective function for the optimization approach to constructing PCA NNs, which we term similarity matching. From this objective function, we derive an online algorithm implementable by a NN with local learning rules. Then, we introduce other similaritybased algorithms for linear dimensionality reduction which include more biological features such as different neuron classes.
3.1 Similaritymatching objective function
We start by stating an objective function that will be used to derive NNs for linear dimensionality reduction. The similarity of a pair of inputs, and , both in , can be defined as their dotproduct, . Analogously, the similarity of a pair of outputs, which live in with , is . Similarity matching, as its name suggests, learns a representation where the similarity between each pair of outputs matches that of the corresponding inputs:
(6) 
This offline objective function, previously employed for multidimensional scaling, is optimized by the projections of inputs onto the principal subspace of their covariance, i.e. performing PCA up to an orthogonal rotation. Moreover, (6) has no local minima other than the principal subspace solution [7, 15].
The similaritymatching objective (6) may seem like a strange choice for deriving an online algorithm implementable by a NN. In (6), pairs of inputs and outputs from different time steps interact with each other. Yet, in the online setting, an output must be computed at each time step without accessing inputs or outputs from other time steps. In addition, synaptic weights do not appear explicitly in (6) seemingly precluding mapping onto a NN.
3.2 Variable substitution trick
Both of the above concerns can be resolved by a simple math trick akin to completing the square [16]. We first focus on the crossterm in the expansion of the square in (6), which we call similarity alignment. By introducing a new variable, , we can rewrite the crossterm:
(7) 
To prove this identity, find optimal by taking a derivative of the expression on the right with respect to and setting it to zero, and then substitute the optimal back into the expression. Similarly, for the quartic term in (6):
(8) 
By substituting (7) and (8) into (6) we get:
(9) 
where
(10) 
In the resulting objective function, (9),(10), optimal outputs at different time steps can be computed independently, making the problem amenable to an online algorithm. The price paid for this simplification is the appearance of the minimax optimization problem in variables, W and M. Minimization over
aligns output channels with the greatest variance directions of the input and maximization over
diversifies the output by decorrelating output channels similarly to the Grammian, , used previously. This competition in a gradient descent/ascent algorithm results in the principal subspace projection which is the only stable fixed point of the corresponding dynamics [17].3.3 Online algorithm and neural network
We are ready to derive an algorithm for optimizing (6) online. First, we minimize (10) with respect to the output variables, , while holding and fixed using gradientdescent dynamics:
(11) 
As before, dynamics (11) converges within a single time step, , and outputs . After the convergence of , we update and by gradient descent of (7) and gradient ascent of (8) respectively:
(12) 
Algorithm (11),(12), first derived in [17], can be naturally implemented by a biologically plausible NN, Fig. 1B. Here, activity of the upstream neurons corresponds to input variables. Output variables are computed by the dynamics of activity (11) in a single layer of neurons. Variables and are represented by the weights of synapses in feedforward and lateral connections respectively. The learning rules (12) are local, i.e. the weight update, , for the synapse between input neuron and output neuron depends only on the activities, , of input neuron and, , of output neuron, and the synaptic weight. In neuroscience, learning rules (12) for synaptic weights and (here minus indicates inhibitory synapses, see Eq.(11)) are called Hebbian and antiHebbian respectively.
To summarize this Section so far, starting with the similaritymatching objective, we derived a Hebbian/antiHebbian NN for dimensionality reduction. The minimax objective can be viewed as a zerosum game played by the weights of feedforward and lateral connections [16, 18]. This demonstrates that synapses with local updates can still collectively work together to optimize a global objective. A similar, although not identical, NN was proposed by Földiak [12] heuristically. The advantage of our optimization approach is that the offline solution is known.
Although no proof of convergence exists in the online setting, algorithm (11),(12) performs well on largescale data. A recent paper [10] introduced an efficient, albeit nonbiological, modification of the similaritymatching algorithm, Fast Similarity Matching (FSM), and demonstrated its competitiveness with the stateoftheart principal subspace projection algorithms in both processing speed and convergence rate^{1}^{1}1A package with implementations of these algorithms is on https://github.com/flatironinstitute/online_psp and https://github.com/flatironinstitute/online_psp_matlab., Figure 1D. FSM produces the same output for each input as similaritymatching by optimizing (10) by matrix inversion, . It achieves extra computational efficiency by keeping in memory and updating the matrix rather than . We refer the reader to [10] for suggestions on the implementation of these algorithms.
3.4 Other similaritybased objectives and linear networks
As the algorithm (11),(12) and the NN in Fig.1B were derived from the similaritymatching objective (6), they project data onto the principal subspace but do not necessarily recover principal components per se. To derive PCA algorithms we modified the objective function (6) to encourage orthogonality of [19, 20]. Such algorithms are implemented by NNs of the same architecture as in Fig.1B but with slightly different local learning rules.
Although the similaritymatching NN in Fig. 1B relies on biologically plausible local learning rules, it lacks biological realism in several other ways. For example, computing output requires recurrent activity that must settle faster than the time scale of the input variation, which is unlikely in biology. To respect this biological constraint, we modified the dimensionality reduction algorithm to avoid recurrency [20].
Another nonbiological feature of the NN in Fig.1B is that the output neurons compete with each other by communicating via lateral connections. In biology, such interactions are not direct but mediated by interneurons. To reflect these observations, we modified the objective function by introducing a whitening constraint:
(13) 
where is the byidentity matrix. Then, by implementing the whitening constraint using the Lagrange formalism, we derived NNs where interneurons appear naturally  their activity is modeled by the Lagrange multipliers, (Fig. 1C), [7]:
(14) 
where is the Kronecker delta. Notice how (14) contains the  similarityalignment term similar to (7). We can now derive learning rules for the  connections using the variable substitution trick, leading to the network in Figure 1C. For details of this and other NN derivations, see [7].
4 Nonnegative similaritymatching objective and nonnegative independent component analysis
[]
Algorithm  Accuracy 

Convolutional Nonnegative Similarity Matching [22]  80.42 % 
Kmeans [23]  79.60 % 
Convolutional DBN [24]  78.90 % 
and a deep belief network
[24] on the same task. Detailed comparisons are available in [22].So far we considered similaritybased NNs comprising linear neurons. But many interesting computations require nonlinearity and biological neurons are not linear. To construct more realistic and powerful similaritybased NNs, we note that the output of biological neurons is nonnegative (firing rate cannot be below zero). Hence, we modified the optimization problem by requiring that the output of the similaritymatching cost function (6) is nonnegative:
(15) 
Here, the number of output dimensions, , may be greater than the number of input dimensions, , leading to a dimensionally expanded representation. Eq. (15) can be solved by the same online algorithm as (6) except that the output variables are projected onto the nonnegative domain. Such algorithm maps onto the same network and same learning rules as in Eq.(12), Fig. 1B, but with rectifying neurons (ReLUs) [21, 25, 26], Fig. 3A.
Nonnegative similaritymatching network learns features that are very different from PCA. For example, when the network is trained on whitened natural scenes it extracts edge filters [21] (Fig. 3) as opposed to Fourier harmonics expected for a translationally invariant dataset. Motivated by this observation, Bahroun and Soltoggio [22] developed a convolutional nonnegative similarity matching network with multiple resolutions, and used it as an unsupervised feature extractor for subsequent linear classification on CIFAR10 dataset. They found that nonnegative similarity matching NNs are superior to other singlelayer unsupervised techniques [22, 27], Table 3.
As edge filters emerge also in the independent component analysis (ICA) of natural scenes [28]
we investigated a connection of nonnegative similarity matching with nonnegative independent component analysis (NICA) used for blind source separation. The NICA problem is to recover independent, nonnegative and wellgrounded (finite probability density function in any positive neighborhood of zero) sources,
, from observing only their linear mixture, , where , and .Our solution of NICA is based on the observation that NICA can be solved in two steps [29], Fig. 4A. First, whiten the data and reduce it to dimensions to obtain an orthogonal rotation of the sources (assuming that the mixing matrix is fullrank). Second, find an orthogonal rotation of the whitened sources that yields a nonnegative output, Fig. 4A. The first step can be implemented by the whitening network in Fig. 1C. The second step can be implemented by the nonnegative similaritymatching network (Fig. 3A) because an orthogonal rotation does not affect dotproduct similarities [25]. Therefore, NICA is solved by stacking the whitening and the nonnegative similaritymatching networks, Fig. 4B. This algorithm performs well compared to other popular NICA algorithms [25], Fig. 4C.
5 Nonnegative similaritybased networks for clustering and manifold tiling
Nonnegative similaritymatching can also cluster wellsegregated data [30, 21] and, for data concentrated on manifolds, it can tile them [31]. To understand this behavior, we analyze the optimal solutions of nonnegative similaritybased objectives. Finding the optimal solution for a constrained similaritybased objective is rather challenging as has been observed for the nonnegative matrix factorization problem. Here, we introduce a simplified similaritybased objective that allows us to make progress with the analysis and admits an intuitive interpretation. First, we address the simpler clustering task which, for highly segregated data, has a straightforward optimal solution. Second, we address manifold learning by viewing it as a softclustering problem.
5.1 A similaritybased cost function and NN for clustering
The key to our analysis is formulating a similaritybased cost function, an optimization of which will yield an online algorithm and a NN for clustering. The algorithm should assign inputs to clusters based on pairwise similarities and output cluster assignment indices .
To arrive at a cost function, consider first a single pair of data points, and . If their similarity, , where is a preset threshold, then the points should be assigned to separate clusters, i.e. and , setting output similarity, to 0. If , then the points are assigned to the same cluster, e.g. . Such and are optimal solutions (although not unique) to the following optimization problem:
(16) 
To obtain an objective function that would cluster the whole dataset of inputs we simply sum (16) over all possible input pairs:
(17) 
Does optimization of (17) produce the desired clustering output? This depends on the dataset. If a threshold, , exists such that the similarities of all pairs within the same cluster are greater and similarities of pairs from different clusters are less than , then the cost function (17) is minimized by the desired hardclustering output, provided that is greater than or equal to the number of clusters.
To solve the objective (17) in the online setting, we introduce the constraints in the cost via Lagrange multipliers and using the variable substitution trick, we can derive a NN implementation of this algorithm [31] (Fig. 5A). The algorithm operates with local Hebbian and antiHebbian learning rules, whose functional form is equivalent to (12).
5.2 Manifoldtiling solutions
In many realworld problems, data points are not wellsegregated but lie on lowdimensional manifolds. For such data, the optimal solution of the objective (17) effectively tiles the data manifold [31].
We can understand such optimal solutions using softclustering, i.e. clustering where each stimulus may be assigned to more than one cluster and assignment indices are real numbers between zero and one. Each output neuron is characterized by the weight vector of incoming synapses which defines a centroid in the input data space. The response of a neuron is maximum when data fall on the centroid and decays away from it. Manifoldtiling solutions for several datasets are shown in Fig. 6.
We can prove this result analytically by taking advantage of the convex relaxation in the limit of infinite number of output dimensions, i.e. . Indeed, if we introduce Gramians , such that , and , such that and do not specify the dimensionality of by leaving the rank of open, we can rewrite (17) as:
(18) 
where is a matrix whose elements are all ones, and the cone of completely positive matrices, i.e. matrices with , is denoted by [34]. Redefining the variables makes the optimization problem convex. For arbitrary datasets, optimization problems in are often intractable for large [34], despite the convexity. However, for symmetric datasets, i.e. circle, 2sphere and SO(3), we can optimize (18) by analyzing the Karush–Kuhn–Tucker conditions [31] (Fig. 6A).
5.3 Other similaritybased NNs for clustering and manifoldtiling
A related problem to objective (17) is the previously studied convex semidefinite programming relaxation of community detection in graphs [35], which is closely related to clustering. The semidefinite program is related to (18) by requiring the nonnegativity of instead of the nonnegativity of :
(19) 
While we chose to present our similaritybased NN approach to clustering and manifoldtiling through the cost function in (17), similar results can be obtained for other versions of similaritybased clustering objective functions. The nonnegative similaritymatching cost function Eq. (15) and the NN derived from it (Fig. 3A) can be used for clustering and manifold learning as well [30, 21, 31]. The means cost function can be cast into a similaritybased form and a NN (Fig. 5B) can be derived for its online implementation [36]. We introduced a softmeans cost, also a relaxation of another semidefinite program for clustering [37], and an associated NN (Fig. 5B) [36], and showed that they can perform manifold tiling [38].
The algorithms we discussed operate with the dot product as a measure of similarity in the inputs. By augmenting the presented NNs by an initial random, nonlinear projection layer (Fig. 5C), it is possible to implement nonlinear similarity measures associated with certain kernels [32]. A clustering algorithm using this idea is shown to perform on par with other online kernel clustering algorithms [32], Fig. 5D.
6 Discussion
To overcome the nonlocality of the learning rule in NNs derived from the reconstruction error minimization, we proposed a new class of cost functions called similaritybased. To summarize, the first term in the similaritybased cost functions,
(20) 
is the covariance of the similarity of the outputs and the similarity of the inputs. Hence, the name “similaritybased” cost functions. Previously, such objectives were used in linear kernel alignment [39]. Our key observation is that optimization of objective functions containing such term in the online setting gives rise to local synaptic learning rules (7) [16].
To derive biologically plausible NNs from (20), one must choose not just the first term but also the function, , and the optimization constraints, , so that the online optimization algorithm is implementable by biological mechanisms. We and others have identified a whole family of such functions and constraints (Table 1), some of which were reviewed in this article. As a result, we can relate many features of biological NNs to different terms and constraints in similaritybased cost functions and, hence, give them computational interpretations.
Our framework provides a systematic procedure to design novel NN algorithms by formulating a learning task using similaritybased cost functions. As evidenced by the highperforming algorithms discussed in this paper, our procedure of incorporating biological constraints does not impede but rather facilitates the design process by limiting the algorithm search to a useful part of the NN algorithm space.
OPTIMIZATION FEATURE  BIOLOGICAL FEATURE 

Similarity (anti)alignment  (Anti)Hebbian plasticity [17, 16] 
Nonnegativity constraint  ReLU activation function [21, 18] 
Sparsity regularizer  Adaptive neural threshold [40] 
Constrained output correlation matrix  Adaptive lateral weights [7, 18] 
Constrained PSD output Grammian  AntiHebbian interneurons [7] 
Copositive output Grammian  AntiHebbian inhibitory neurons [31] 
Constrained activity norm  Giant interneuron [36] 
The locality of learning rules in similaritybased NNs makes them naturally suitable for implementation on adaptive neuromorphic systems, which have already been explored in custom analog arrays [3]. For broader use in the rapidly growing world of lowpower, spikebased hardware with onchip learning [2], similaritybased NNs were missing a key ingredient: spiking neurons. Very recent work [26] developed a spiking version of the nonnegative similarity matching network and took a step towards neuromorphic applications.
Despite the successes of similaritybased NNs, many interesting challenges remain. 1) Whereas numerical experiments indicate that our online algorithms perform well, most of them lack global convergence proofs. Even for PCA networks we can only prove linear stability of the desired solution in the stochastic approximation setting. 2) Motivated by biological learning, which is mostly unsupervised, we focused on unsupervised learning. Yet, supervision, or reinforcement, does take place in the brain. Therefore, it is desirable to extend our framework to supervised, semisupervised and reinforcement learning settings. Such extensions may be valuable as general purpose machine learning algorithms. 3) Whereas most sensory stimuli are correlated time series, we assumed that data points at different times are independent. How are temporal correlations analyzed by NNs? Solving this problem is important both for modeling brain function and developing general purpose machine learning algorithms. 4) Another challenge is stacking similaritybased NNs. Heuristic approach to stacking yields promising results
[22]. Yet, except for the Nonnegative ICA problem introduced in Section 4, we do not have a theoretical understanding of how and why to stack similaritybased NNs. 5) Finally, neurons in biological NNs signal each other using allornone spikes, or action potentials, as opposed to realvalued signals we considered. Is there an optimization theory accounting for spiking in biological NNs?Acknowledgments
We thank our collaborators Anirvan Sengupta, Mariano Tepper, Andrea Giovannucci, Alex Genkin, Victor Minden, Sreyas Mohan, Yanis Bahroun for their contributions, Andrea Soltoggio for discussions, and Siavash Golkar and Alper Erdogan for commenting on the manuscript. This work was in part supported by a gift from the Intel Corporation.
References

[1]
F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.”
Psychol. Rev., vol. 65, no. 6, p. 386, 1958.  [2] M. Davies, N. Srinivasa, T.H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A neuromorphic manycore processor with onchip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, 2018.
 [3] J. H. Poikonen and M. Laiho, “A mixedmode array computing architecture for online dictionary learning,” in ISCAS. IEEE, 2017, pp. 1–4.
 [4] B. A. Olshausen and D. J. Field, “Emergence of simplecell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, pp. 607–609, 1996.
 [5] E. Oja, “Simplified neuron model as a principal component analyzer,” J. Math. Biol., vol. 15, no. 3, pp. 267–273, 1982.
 [6] M. Udell, C. Horn, R. Zadeh, S. Boyd et al., “Generalized low rank models,” Foundations and Trends® in Machine Learning, vol. 9, no. 1, pp. 1–118, 2016.
 [7] C. Pehlevan and D. Chklovskii, “A normative theory of adaptive dimensionality reduction in neural networks,” in NeurIPS, 2015, pp. 2260–2268.
 [8] R. Arora et al., “Stochastic optimization for pca and pls,” in ACSSC. IEEE, 2012, pp. 861–868.
 [9] J. Weng, Y. Zhang, and W.S. Hwang, “Candid covariancefree incremental principal component analysis,” IEEE TPAMI, vol. 25, no. 8, pp. 1034–1040, 2003.
 [10] A. Giovannucci et al., “Efficient principal subspace projection of streaming data through fast similarity matching,” IEEE Big Data, 2018.
 [11] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
 [12] P. Földiak, “Adaptive network for optimal linear feature extraction,” in Int. Joint Conf. on Neural Networks. IEEE, 1989, pp. 401–405.
 [13] K. I. Diamantaras and S. Y. Kung, Principal component neural networks: theory and applications. John Wiley & Sons, Inc., 1996.
 [14] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1?” Vision research, vol. 37, no. 23, pp. 3311–3325, 1997.
 [15] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious local minimum,” in NeurIPS, 2016, pp. 2973–2981.
 [16] C. Pehlevan, A. M. Sengupta, and D. B. Chklovskii, “Why do similarity matching objectives lead to hebbian/antihebbian networks?” Neural Comput., vol. 30, no. 1, pp. 84–124, 2018.
 [17] C. Pehlevan, T. Hu, and D. Chklovskii, “A hebbian/antihebbian neural network for linear subspace learning: A derivation from multidimensional scaling of streaming data,” Neural Comput., vol. 27, pp. 1461–1495, 2015.
 [18] H. S. Seung and J. Zung, “A correlation game for unsupervised learning yields computational interpretations of hebbian excitation, antihebbian inhibition, and synapse elimination,” arXiv preprint arXiv:1704.00646, 2017.
 [19] C. Pehlevan and D. B. Chklovskii, “Optimization theory of hebbian/antihebbian networks for pca and whitening,” in Allerton. IEEE, 2015, pp. 1458–1465.
 [20] V. Minden, C. Pehlevan, and D. B. Chklovskii, “Biologically plausible online pca without recurrent dynamics,” in ACSSC. IEEE, 2018.
 [21] C. Pehlevan and D. B. Chklovskii, “A hebbian/antihebbian network derived from online nonnegative matrix factorization can cluster and discover sparse features,” in ACSSC. IEEE, 2014, pp. 769–775.
 [22] Y. Bahroun and A. Soltoggio, “Online representation learning with single and multilayer hebbian networks for image classification,” in ICANN, 2017, pp. 354–363.
 [23] A. Coates, A. Ng, and H. Lee, “An analysis of singlelayer networks in unsupervised feature learning,” in AISTATS, 2011, pp. 215–223.
 [24] A. Krizhevsky and G. Hinton, “Convolutional deep belief networks on cifar10,” Unpublished manuscript, vol. 40, no. 7, 2010.
 [25] C. Pehlevan, S. Mohan, and D. B. Chklovskii, “Blind nonnegative source separation using biological neural networks,” Neural Comput., vol. 29, pp. 2925–2954, 2017.
 [26] C. Pehlevan, “A spiking neural network with local learning rules derived from nonnegative similarity matching,” in ICASSP, 2019, pp. 7958–7962.
 [27] Y. Bahroun, E. Hunsicker, and A. Soltoggio, “Building efficient deep hebbian networks for image classification tasks,” in ICANN, 2017, pp. 364–372.
 [28] A. J. Bell and T. J. Sejnowski, “The “independent components” of natural scenes are edge filters,” Vision research, vol. 37, no. 23, pp. 3327–3338, 1997.
 [29] M. Plumbley, “Conditions for nonnegative independent component analysis,” Signal Processing Letters, IEEE, vol. 9, no. 6, pp. 177–180, 2002.
 [30] D. Kuang, C. Ding, and H. Park, “Symmetric nonnegative matrix factorization for graph clustering,” in SDM. SIAM, 2012, pp. 106–117.
 [31] A. Sengupta et al., “Manifoldtiling localized receptive fields are optimal in similaritypreserving neural networks,” in NeurIPS, 2018.
 [32] Y. Bahroun, E. Hunsicker, and A. Soltoggio, “Neural networks for efficient nonlinear online clustering,” in ICONIP. Springer, 2017, pp. 316–324.
 [33] A. Rahimi and B. Recht, “Random features for largescale kernel machines,” in NeurIPS, 2008, pp. 1177–1184.
 [34] A. Berman and N. ShakedMonderer, Completely positive matrices. World Scientific, 2003.

[35]
T. T. Cai, X. Li et al.
, “Robust and computationally feasible community detection in the presence of arbitrary outlier nodes,”
Ann. Stat., vol. 43, no. 3, pp. 1027–1059, 2015.  [36] C. Pehlevan, A. Genkin, and D. B. Chklovskii, “A clustering neural network model of insect olfaction,” in ACSSC. IEEE, 2017, pp. 593–600.
 [37] B. Kulis et al., “Semisupervised graph clustering: a kernel approach,” Machine learning, vol. 74, no. 1, pp. 1–22, 2009.
 [38] M. Tepper, A. M. Sengupta, and D. Chklovskii, “Clustering is semidefinitely not that hard: Nonnegative sdp for manifold disentangling,” JMLR, vol. 19, pp. 1–30, 2018.
 [39] N. Cristianini et al., “On kerneltarget alignment,” in NeurIPS, 2002, pp. 367–373.
 [40] T. Hu, C. Pehlevan, and D. B. Chklovskii, “A hebbian/antihebbian network for online sparse dictionary learning derived from symmetric matrix factorization,” in ACSSC. IEEE, 2014, pp. 613–619.