Early sensory processing reduces the dimensionality of streamed inputs (Hyvärinen et al., 2009) as evidenced by a high ratio of input to output nerve fiber counts (Shepherd, 2003). For example, in the human retina, information gathered by 125 million photoreceptors is conveyed to the Lateral Geniculate Nucleus through 1 million ganglion cells (Hubel, 1995). By learning a lower-dimensional subspace and projecting the streamed data onto that subspace the nervous system de-noises and compresses the data simplifying further processing. Therefore, a biologically plausible implementation of dimensionality reduction may offer a model of early sensory processing.
For a single neuron, a biologically plausible implementation of dimensionality reduction in the streaming, or online, setting has been proposed in the seminal work of(Oja, 1982), Figure 1A. At each time point,
, an input vector,, is presented to the neuron, and, in response, it computes a scalar output, , were w is a row-vector of input synaptic weights. Furthermore, synaptic weights w are updated according to a version of Hebbian learning called Oja’s rule:
where is a learning rate and
designates a transpose. Then, the neuron’s synaptic weight vector converges to the principal eigenvector of the covariance matrix of the streamed data(Oja, 1982)
. Importantly, Oja’s learning rule is local meaning that synaptic weight updates depend on the activities of only pre- and postsynaptic neurons accessible to each synapse and, therefore, biologically plausible.
Oja’s rule can be derived by an approximate gradient descent of the mean squared representation error (Cichocki and Amari, 2002; Yang, 1995), a so-called synthesis view of principal component analysis (PCA) (Pearson, 1901; Preisendorfer and Mobley, 1988):
Computing principal components beyond the first requires more than one output neuron and motivated numerous neural networks. Some well-known examples are the Generalized Hebbian Algorithm (GHA) (Sanger, 1989), Földiak’s network (Földiak, 1989), the subspace network (Karhunen and Oja, 1982), Rubner’s network (Rubner and Tavan, 1989; Rubner and Schulten, 1990), Leen’s minimal coupling and full coupling networks (Leen, 1990, 1991) and the APEX network (Kung and Diamantaras, 1990; Kung et al., 1994). We refer to (Becker and Plumbley, 1996; Diamantaras and Kung, 1996; Diamantaras, 2002) for a detailed review of these and further developments.
However, none of the previous contributions was able to derive a multineuronal single-layer network with local learning rules by minimizing a principled cost function, in a way that Oja’s rule (1) was derived for a single neuron. The GHA and the subspace rules rely on nonlocal learning rules: feedforward synaptic updates depend on other neurons’ synaptic weights and activities. Leen’s minimal network is also nonlocal: feedforward synaptic updates of a neuron depend on its lateral synaptic weights. While Földiak’s, Rubner’s and Leen’s full coupling networks use local Hebbian and anti-Hebbian rules, they were postulated rather than derived from a principled cost function. APEX network, perhaps, comes closest to our criterion: the rule for each neuron can be related separately to a cost function which includes contributions from other neurons. But no cost function describes all the neurons combined.
At the same time, numerous dimensionality reduction algorithms have been developed for data analysis needs disregarding the biological plausibility requirement. Perhaps the most common approach is again PCA, which was originally developed for batch processing (Pearson, 1901) but later adapted to streaming data (Yang, 1995; Crammer, 2006; Arora et al., 2012; Goes et al., 2014). For a more detailed collection of references, see e.g. (Balzano, 2012). These algorithms typically minimize the representation error cost function:
where is a data matrix and is a wide matrix (for detailed notation, see below). The minimum of (3) is when rows of are orthonormal and span the -dimensional principal subspace, and therefore is the projection matrix to the subspace (Yang, 1995)111Recall that, in general, the projection matrix to the row space of a matrix P is given by , provided is full rank (Plumbley, 1995). If the rows of are orthonormal this reduces to ..
A gradient descent minimization of such cost function can be approximately implemented by the subspace network (Yang, 1995), which, as pointed out above, requires nonlocal learning rules. While this algorithm can be implemented in a neural network using local learning rules, it requires a second layer of neurons (Oja, 1992), making it less appealing.
In this paper, we derive a single-layer network with local Hebbian and anti-Hebbian learning rules, similar in architecture to Földiak’s (Földiak, 1989) (see Figure 1B), from a principled cost function and demonstrate that it recovers a principal subspace from streaming data. The novelty of our approach is that, rather than starting with the representation error cost function traditionally used for dimensionality reduction, such as PCA, we use the cost function of classical multidimensional scaling (CMDS), a member of the family of multidimensional scaling (MDS) methods (Cox and Cox, 2000; Mardia et al., 1980). Whereas the connection between CMDS and PCA has been pointed out previously (Williams, 2001; Cox and Cox, 2000; Mardia et al., 1980), CMDS is typically performed in the batch setting. Instead, we developed a neural network implementation of CMDS for streaming data.
The rest of the paper is organized as follows. In Section 2, by minimizing the CMDS cost function we derive two online algorithms implementable by a single-layer network, with synchronous and asynchronous synaptic weight updates. In Section 3, we demonstrate analytically that synaptic weights define a principal subspace whose dimension is determined by the number of output neurons and that the stability of the solution requires that this subspace corresponds to top principal components. In Section 4, we show numerically that our algorithm recovers the principal subspace of a synthetic dataset, and does it faster than the existing algorithms. Finally, in Section 5, we consider the case when data are generated by a nonstationary distribution and present a generalization of our algorithm to principal subspace tracking.
2 Derivation of online algorithms from the CMDS cost function
CMDS represents high-dimensional input data in a lower-dimensional output space while preserving pairwise similarities between samples222Whereas MDS in general starts with dissimilarities between samples that may not live in Euclidean geometry, in CMDS data are assumed to have a Euclidean representation. (Young and Householder, 1938; Torgerson, 1952).
Let centered input data samples in be represented by column-vectors concatenated into an matrix . The corresponding output representations in , , are column-vectors, , concatenated into an dimensional matrix . Similarities between vectors in Euclidean spaces are captured by their inner products. For the input (output) data, such inner products are assembled into a Gram matrix333When input data are pairwise Euclidean distances, assembled into a matrix , the Gram matrix, , can be constructed from by , where , is the centering matrix, and is the dimensional identity matrix
dimensional identity matrix(Cox and Cox, 2000; Mardia et al., 1980). (). For a given , CMDS finds by minimizing the so-called “strain” cost function (Carroll and Chang, 1972) :
For discovering a low-dimensional subspace, the CMDS cost function (4) is a viable alternative to the representation error cost function (3) because its solution is related to PCA (Williams, 2001; Cox and Cox, 2000; Mardia et al., 1980). Specifically, Y is the linear projection of X onto the (principal sub-)space spanned by principal eigenvectors of the sample covariance matrix . The CMDS cost function defines a subspace rather than individual eigenvectors because left orthogonal rotations of an optimal Y stay in the subspace and are also optimal, as is evident from the symmetry of the cost function.
In order to reduce the dimensionality of streaming data, we minimize the CMDS cost function (4) in the stochastic online setting. At time , a data sample, , drawn independently from a zero-mean distribution is presented to the algorithm which computes a corresponding output, prior to the presentation of the next data sample. Whereas in the batch setting, each data sample affects all outputs, in the online setting, past outputs cannot be altered. Thus, at time the algorithm minimizes the cost depending on all inputs and ouputs up to time with respect to while keeping all the previous outputs fixed:
where the last equality follows from the definition of the Frobenius norm. By keeping only the terms that depend on current output we get:
In the large- limit, expression (6) simplifies further because the first two terms grow linearly with , and therefore dominate over the last two. After dropping the last two terms we arrive at:
We term the cost in expression (7) the “online CMDS cost”. Because the online CMDS cost is a positive semi-definite quadratic form in for sufficiently large , this optimization problem is convex. While it admits a closed-form analytical solution via matrix inversion, we are interested in biologically plausible algorithms. Next, we consider two algorithms that can be mapped onto single-layer neural networks with local learning rules: coordinate descent leading to asynchronous updates and Jacobi iteration leading to synchronous updates.
2.1 A neural network with asynchronous updates
The online CMDS cost function (7) can be minimized by coordinate descent which at every step finds the optimal value of one component of while keeping the rest fixed. The components can be cycled through in any order until the iteration converges to a fixed point. Such iteration is guaranteed to converge under very mild assumptions: diagonals of has to be positive (Luo and Tseng, 1991), meaning that each output coordinate has produced at least one non-zero output before current time step . This condition is almost always satisfied in practice.
The cost to be minimized at each coordinate descent step with respect to channel’s activity is:
Keeping only those terms that depend on yields:
By taking a derivative with respect to and setting it to zero we arrive at the following closed-form solution:
To implement this algorithm in a neural network we denote normalized input-output and output-output covariances,
allowing us to rewrite the solution (8) in a form suggestive of a linear neural network:
where and represent the synaptic weights of feedforward and lateral connections respectively, Figure 1B.
Finally, to formulate a fully online algorithm we rewrite (9) in a recursive form. This requires introducing a scalar variable representing cumulative activity of a neuron up to time ,
Then, at each time point, , after the output is computed by the network, the following updates are performed:
Equations (10) and (2.1) define a neural network algorithm that minimizes the online CMDS cost function (7) for streaming data by alternating between two phases: neural activity dynamics and synaptic updates. After a data sample is presented at time , in the neuronal activity phase, neuron activities are updated one-by-one, i.e. asynchronously, (10) until the dynamics converges to a fixed point defined by the following equation:
where is the -dimensional identity matrix.
In the second phase of the algorithm, synaptic weights are updated, according to a local Hebbian rule (2.1) for feedforward connections, and according to a local anti-Hebbian rule (due to the sign in equation (10)) for lateral connections. Interestingly, these updates have the same form as the single-neuron Oja’s rule (1) (Oja, 1982), except that the learning rate is not a free parameter but is determined by the cumulative neuronal activity 444The single neuron Oja’s rule derived from the minimization of a least squares optimization cost function ends up with the identical learning rate (Diamantaras, 2002; Hu et al., 2013). Motivated by this fact, such learning rate has been argued to be optimal for the APEX network (Diamantaras and Kung, 1996; Diamantaras, 2002) and used by others (Yang, 1995).. To the best of our knowledge such single-neuron rule (Hu et al., 2013) has not been derived in the multineuron case. An alternative derivation of this algorithm is presented in Appendix A.1
2.2 A neural network with synchronous updates
Here, we present an alternative way to derive a neural network algorithm from the large- limit of the online CMDS cost function (7) . By taking a derivative with respect to and setting it to zero we arrive at the following linear matrix equation:
We solve this system of equations using Jacobi iteration (Strang, 2009), by first splitting the output covariance matrix that appears on the left side of (14) into its diagonal component and the remainder :
Interestingly, the matrices obtained on the right side are algebraically equivalent to the feedforward and lateral synaptic weight matrices defined in (9):
Hence, the Jacobi iteration for solving (14)
converges to the same fixed point as the coordinate descent, (13).
Iteration (16) is naturally implemented by the same single-layer linear neural network as for the asynchronous update, Figure 1B. For each stimulus presentation the network goes through two phases. In the first phase, iteration (16) is repeated until convergence. Unlike the coordinate descent algorithm which updated activity of neurons one after another, here, activities of all neurons are updated synchronously. In the second phase, synaptic weight matrices are updated according to the same rules as in the asynchronous update algorithm (2.1).
Unlike the asynchronous update (10), for which convergence is almost always guaranteed (Luo and Tseng, 1991), convergence of iteration (16) is guaranteed only when the spectral radius of is less than 1 (Strang, 2009). Whereas we cannot prove that this condition is always met, in practice, the synchronous algorithm works well. While in the rest of the paper, we consider only the asynchronous updates algorithm, our results hold for the synchronous updates algorithm provided it converges.
3 Stationary synaptic weights define a principal subspace
What is the nature of the lower dimensional representation found by our algorithm? In CMDS, outputs are the Euclidean coordinates in the principal subspace of the input vector (Cox and Cox, 2000; Mardia et al., 1980). While our algorithm uses the same cost function as CMDS, the minimization is performed in the streaming, or online, setting. Therefore, we cannot take for granted that our algorithm will find the principal subspace of the input. In this section, we provide analytical evidence, by a stability analysis in a stochastic setting, that our algorithm extracts the principal subspace of the input data and projects onto that subspace. We start by previewing our results and method.
Our algorithm performs a linear dimensinality reduction since the transformation between the input and the output is linear. This can be seen from the neural activity fixed point (13), which we rewrite as
where is a matrix defined in terms of the synaptic weight matrices and :
Relation (17) shows that the linear filter of a neuron, which we term a “neural filter”, is the corresponding row of . The space that neural filters span, the rowspace of , is termed a “filter space”.
First, we prove that in the stationary state of our algorithm, neural filters are indeed orthonormal vectors (section3.2, Theorem 1). Second, we demonstrate that the orthonormal filters form a basis of a space spanned by some eigenvectors of the covariance of the inputs (section 3.3, Theorem 2). Third, by analyzing linear perturbations around the stationary state, we find that stability requires these eigenvectors to be the principal eigenvectors and, therefore, the filter space to be the principal subspace (section 3.4, Theorem 3).
These results show that even though our algorithm was derived starting from the CMDS cost function (4), converges to the optimal solution of the representation error cost function (3). This correspondence suggests that
is the algorithm’s current estimate of the projection matrix to the principal subspace. Further, in (3), columns of are interpreted as data features. Then, columns of , or neural filters, are the algorithm’s estimate of such features.
Rigorous stability analyses of PCA neural networks (Oja, 1982; Oja and Karhunen, 1985; Sanger, 1989; Oja, 1992; Hornik and Kuan, 1992; Plumbley, 1995) typically use the ODE method (Kushner and Clark, 1978): Using a theorem of stochastic approximation theory (Kushner and Clark, 1978), the convergence properties of the algorithm are determined using a corresponding deterministic differential equation555Application of stochastic approximation theory to PCA neural networks depends on a set of mathematical assumptions. See (Zufiria, 2002) for a critique of the validity of these assumptions and an alternative approach to stability analysis..
Unfortunately the ODE method cannot be used for our network. While the method requires learning rates that depend only on time, in our network learning rates () are activity dependent. Therefore we take a different approach. We directly work with the discrete-time system, assume convergence to a “stationary state”, to be defined below, and study the stability of the stationary state.
We adopt a stochastic setting where the input to the network at each time point, , is an -dimensional i.i.d. random vector with zero mean, , where brackets denote an average over the input distribution, and covariance .
Our analysis is performed for the “stationary state” of synaptic weight updates, i.e. when averaged over the distribution of input values, the updates on and average to zero. This is the point of convergence of our algorithm. For the rest of the section, we drop the time index to denote stationary state variables.
The remaining dynamical variables, learning rates , keep decreasing at each time step due to neural activity. We assume that the algorithm has run for a sufficiently long time such that the change in learning rate is small and it can be treated as a constant for a single update. Moreover, we assume that the algorithm converges to a stationary point sufficiently fast such that the following approximation is valid at large :
where is calculated with stationary state weight matrices.
We collect these assumptions into a definition.
Definition 1 (Stationary State).
In the stationary state,
The stationary state assumption leads us to define various relations between synaptic weight matrices, summarized in the following corollary:
In the stationary state,
where is the Kronecker-delta.
The last equality does not hold for since diagonal elements of are zero. To cover the case , we add an identity matrix to , and hence one recovers (20). ∎
Note that (20) implies , i.e. that lateral connection weights are not symmetrical.
3.2 Orthonormality of neural filters
Here we prove the orthonormality of neural filters in the stationary state. First, we need the following lemma:
In the stationary state, the following equality holds:
Now we can prove our theorem.
In the stationary state, neural filters are orthonormal:
Theorem 1 implies that .
3.3 Neural filters and their relationship to the eigenspace of the covariance matrix
How is the filter space related to the input? We partially answer this question in Theorem 2, using the following lemma:
In the stationary state, and commute:
See Appendix A.2. ∎
Now we can state our second theorem.
At the stationary state state, the filter space is an -dimensional subspace in that is spanned by some eigenvectors of the covariance matrix.
Because and commute (Lemma 2), they must share the same eigenvectors. Equation (22) of Theorem 1 implies that eigenvalues of are unity and the rest are zero. Eigenvectors associated with unit eigenvalues span the rowspace of 666If this fact is not familiar to the reader, we recommend Strang’s (Strang, 2009) discussion of Singular Value Decomposition.
discussion of Singular Value Decomposition.and are identical to some eigenvectors of C. ∎
Which eigenvectors of C span the filter space? To show that these are the eigenvectors corresponding to the largest eigenvalues of , we perform a linear stability analysis around the stationary point and show that any other combination would be unstable.
3.4 Linear stability requires neural filters to span a principal subspace
The strategy here is to perturb from its equilibrium value and show that the perturbation is linearly stable only if the row space of is the space spanned by the eigenvectors corresponding to the highest eigenvalues of . To prove this result, we will need two more lemmas.
Let be an real matrix with orthonormal rows and is an real matrix with orthonormal rows, whose rows are chosen to be orthogonal to the rows of . Any real matrix can be decomposed as:
where is an skew-symmetric matrix, is an symmetric matrix and is an matrix.
Define , and . Then, . ∎
Let’s denote an arbitrary perturbation of as , where a small parameter is implied. We can use Lemma 3 to decompose as
where the rows of are orthogonal to the rows of . Skew-symmetric corresponds to rotations of filters within the filter space, i.e. it keeps neural filters orthonormal. Symmetric keeps the filter space invariant but destroys orthonormality. is a perturbation that takes the neural filters outside of the filter space.
Next, we calculate how evolves under the learning rule, i.e. .
A perturbation to the stationary state has the following evolution under the learning rule to linear order in perturbation and linear order in :
Proof is provided in Appendix A.3. ∎
Now, we can state our main result in the following theorem:
The stationary state of neuronal filters F is stable, in large- limit, only if the dimensional filter space is spanned by the eigenvectors of the covariance matrix corresponding to the highest eigenvectors.
Full proof is given in Appendix A.4. Here we sketch the proof.
To simplify our analysis, we choose a specific in Lemma 3 without losing generality. Let be eigenvectors of and be corresponding eigenvalues, labeled so that the first eigenvectors span the row space of (or filter space). We choose rows of to be the remaining eigenvectors, i.e. .
By extracting the evolution of components of from (25) using (24), we are ready to state the conditions under which perturbations of are stable. Mutlipying (25) on the right by gives the evolution of :
Here we changed our notation to to make it explicit that for each we have one matrix equation. These equations are stable when all eigenvalues of all are negative, which requires as shown in the Appendix A.4:
This result proves that the perturbation is stable only if the filter space is identical to the space spanned by eigenvectors corresponding to the highest eigenvalues of .
It remains to analyze the stability of and perturbations. Multiplying (25) on the right by gives,